Saved transcript

I made a real BMO local AI agent with a Raspberry Pi and Ollama

Channel: brenpoly

Video title: I made a real BMO local AI agent with a Raspberry Pi and Ollama
Channel: brenpoly
Lines: 602
Tool page: Open live transcript tool

0:00

Perhaps the greatest challenge isn't

0:02

creating intelligence, but understanding

0:04

what it means to be truly human.

0:08

>> Yay, Beimo.

0:09

>> Beimo from Adventure Time is one of my

0:11

all-time favorite characters. They're

0:13

hilarious little sentient robot with a

0:15

literal heart of gold. And today, I'm

0:17

going to build my own Beimo as an

0:19

embodied local AI agent. That means

0:21

it'll use onboard AI to make its own

0:23

decisions. It's not just a chatbot, but

0:25

a thinking, acting machine. who wants to

0:29

play video games.

0:31

>> So, let's break down what our Beimo

0:32

needs to do. We'll of course need to be

0:34

able to play video games and connect

0:35

controllers to Beimo. It looks like

0:37

Beimo has a couple of USB ports for that

0:39

and for connecting to other devices. But

0:41

Beimo is more than just a game console.

0:43

They're a whole computer with their own

0:44

operating system which should be able to

0:46

connect to networks and run queries even

0:49

though they live in a world where the

0:51

internet is well gone. Beimo is camera.

0:54

>> Hey, they're also a camera. And they've

0:55

got to do all this powered by batteries

0:57

and using local onboard compute.

0:59

>> I am incapable of emotion, but you are

1:03

making me chaf.

1:04

>> Let's talk about Beimo's parts. We're

1:06

going to need a small screen and some

1:09

kind of microphone and speaker for us to

1:11

talk to them. Looking at Beimo's front

1:13

panel, we'll need two USB ports and a

1:15

bunch of different buttons, including a

1:17

D-pad. There's also this long slot and

1:20

circle, but it's not totally clear what

1:22

these do. Thankfully, the Adventure Time

1:23

artbook has some official reference

1:25

images of Beimo's internal components

1:27

from the episode Bemore. It looks like

1:29

that slot is used for floppy discs. This

1:32

circle is still a mystery, but it's safe

1:34

to say it's not just another button or a

1:36

status light for the disc drive since

1:37

it's kept separate from those

1:38

components. What's really interesting is

1:40

that Beimo has something called an

1:42

infinity box, which explains why Beimo's

1:45

hardware seems to change throughout the

1:46

series. But now we know these aren't

1:48

continuity errors. Beimo is just some

1:50

kind of mini TARDIS that periodically

1:52

swaps out its insides. So I think that's

1:55

more than enough justification for me to

1:56

say that this little circle over here is

1:59

going to be my Beimo's camera. Also, can

2:01

I just say I love the little details

2:02

included by the character designers and

2:04

storyboarders here. Beimo has the heart

2:07

metal diploma from the Wizard of Oz and

2:09

what might be a tiny princess crown. I

2:12

don't think I'll be able to include

2:13

these right now, but I'd love to

2:15

eventually work them in. And there's

2:17

honestly so much more that Beimo can do

2:18

that we're just going to say is out of

2:20

scope.

2:22

>> It goes in my butt.

2:23

>> Oh,

2:25

>> what do you think about the stars in the

2:29

sky?

2:30

>> We'll be using a tiny Raspberry Pi

2:32

computer as the brains of Beimo. This

2:34

runs a full Linux-based operating system

2:36

along with our own custom Python scripts

2:38

that we'll use to build our AI agent.

2:40

This is the Raspberry Pi 5 with 16 GB of

2:43

RAM. This is the most powerful Pi

2:45

available at the time of recording this

2:47

video, and it will be more than enough

2:48

to load all the local AI models we'll be

2:50

using into memory. Now, I bought this

2:52

when it first came out. Prices have gone

2:54

way up thanks to the spike in RAM

2:56

demand. You can get away with a cheaper

2:58

model with less RAM. It all depends on

3:00

the size of the models you plan on

3:01

using. For Beimo's screen, I've got this

3:04

5- in 800x480 pixel IPS touchscreen

3:07

display. It has mounts for the Pi right

3:10

on the back and connects with a display

3:11

cable.

3:13

I also have this official Raspberry Pi

3:15

camera module. This is the V2 module, so

3:18

it's a bit old, but I can never really

3:20

find a good project to use it with. So,

3:22

into Beimo it goes. The Pi has four USB

3:25

ports. We'll use one for this little USB

3:27

microphone so Beimo can hear us

3:32

and this set of USB speakers so they can

3:35

talk back.

3:37

We're going to extend and split this

3:39

third port with these adapters to make

3:42

the two USB ports on the front of

3:43

Beimo's face plate. So, we could connect

3:45

all these buttons to these GPIO pins on

3:48

the Pi, but then I can only use them

3:50

when I'm running my own scripts. So,

3:52

instead, I'm going to use this

3:54

microcontroller to turn the raw button

3:56

presses into keyboard commands that we

3:58

can send to the Pi or any other device

4:00

over USB. I want these buttons to work

4:03

more like a real game controller that we

4:05

can use with any program running on the

4:07

OS.

4:09

>> I am a beautiful big man here.

4:11

>> Now that we have our hardware, we can

4:13

start designing Beimo's body. I'll start

4:15

by measuring the components we have.

4:17

I'll build out things like the screen

4:19

and camera as simple 2D shapes so I can

4:22

start planning out the overall form of

4:24

Beimo's enclosure. I still need to

4:26

design a custom PCB for the buttons. So,

4:29

I'll first lay out some shapes to use

4:31

that to figure out where all the little

4:33

switches need to go. Now, again, I've

4:35

never designed my own PCB before, so

4:37

take what I'm going to do with a grain

4:38

of salt. Actually, I've never done any

4:40

of this before, so take everything I'm

4:42

doing with a big grain of salt. I'm

4:44

using an application called Kyad to

4:46

design my PCB. I first create a

4:49

schematic of the board, including the

4:50

connections with the microcontroller,

4:52

and then use my 2D mockup to lay out the

4:54

switches. Then I send the files off to

4:57

the PCB printing service and wait for

4:59

them to get delivered. While I wait, I

5:01

start 3D modeling Beimo's body. The PCB

5:04

printing service provides a 3D model of

5:06

the board. So, I'm going to import that

5:08

into Blender as my point of reference.

5:11

From here, I start modeling Beimo's

5:12

buttons in place using my 2D mocks to

5:15

make sure I have the scale correct. I'll

5:17

also create geometry to represent some

5:19

of the other hardware like the screen,

5:21

camera, and USB ports. I'll then make

5:24

Beimo's face plate and include cutouts

5:25

for the buttons and other components.

5:28

I'll also use this free plugin to add

5:30

mounting bolts for the PCB that I can

5:32

keep in place with nuts later on. Now,

5:34

I'll build the main enclosure that the

5:36

face plate will attach to. I make sure

5:38

there's a place for all the hardware to

5:39

sit snugly in. Then, I add details like

5:42

Beimo's vents, speaker holes, and

5:44

letters. I also carve out some holes

5:46

that I can glue magnets into. And this

5:48

is what's going to hold the whole body

5:49

together. I then made some arms and legs

5:52

with little pegs that will slot in and

5:54

out of holes in the body. I'm really

5:56

happy with what I made here, but I want

5:58

the option to swap these out for other

5:59

poses in the future. I struggled with

6:02

what to do with this little hatch on

6:03

Beimo's back. The Raspberry Pi 5 has

6:06

been really challenging to work with

6:07

when it comes to power management. So,

6:10

you may have noticed I haven't quite

6:11

figured out how to make Beimo battery

6:13

powered just yet. So, I came up with

6:15

this design that I could just slide into

6:17

a larger opening in Beimo's back. And if

6:19

I want to make something like a battery

6:21

holder in the future, then I can swap

6:23

this out rather than having to rebuild

6:24

the entire enclosure. And with that,

6:26

it's time to 3D print the enclosure. My

6:29

3D printer is pretty small, but

6:31

thankfully my model just fits on the

6:32

build plate. I'd be lying if I said all

6:35

my measurements worked out perfectly,

6:37

but after a few reprints, I finally had

6:39

a real body. Next, it was time to

6:42

sand the 3D print.

6:45

Then I applied a filler and primer

6:48

and sanded it with a finer grit. Next

6:50

came painting. As a bit of a happy

6:52

accident, the spray paint on the

6:54

enclosure came out with a spackle

6:56

texture. I think it's because I left the

6:58

teal paint in my garage and it's

7:00

freezing where I live right now. It

7:02

actually reminded me of the texture of

7:04

my old Super Nintendo, so I decided to

7:06

keep it.

7:08

I then added some matte clear coat to

7:10

protect it and left it all to dry.

7:14

Oh, are you my grandpoo?

7:16

>> Let's start designing our software now.

7:18

So, first full disclosure, I'm trying to

7:21

use as many open source tools as

7:22

possible, but I'm not a professional

7:24

developer. I just like to make things.

7:27

So, I'm going to plan the overall user

7:29

flow and architecture myself, but I'm

7:31

going to be leaning on Gemini to help me

7:33

write the actual code. We'll lay out the

7:35

core loop first. Beimo is going to wait

7:37

around in an idle state until we wake

7:39

them up to start listening to us. We'll

7:41

record and transcribe the user's voice

7:43

and send that to our local LLM. When the

7:46

response is ready, Beimo will read it

7:48

out loud and then go back into the idle

7:50

state. We're going to have a couple of

7:52

ways to wake Beimo up. First, I'm going

7:54

to use one of their buttons to toggle

7:56

voice recording manually, but I'm also

7:58

going to train a custom wakeword model

8:00

so I can just use my voice. There are a

8:03

lot of great paid and free options for

8:05

training these types of models, but in

8:07

the spirit of making sure everything I

8:09

use is open and local, I'm using

8:11

something called open wake word. It can

8:13

feel a little intimidating at first, but

8:15

there's a fantastic collab linked on

8:17

their GitHub page which walks you

8:19

through everything you need to do.

8:21

It takes a while to train, but

8:23

eventually you get a machine learning

8:25

model to download. Okay, so Beimo will

8:27

start in the idle state running the

8:29

wakeword model until it hears us say,

8:32

"Hey, Beimo," or we push the record

8:34

button. It'll then transition into the

8:37

listening state where it needs to record

8:39

and transcribe our voice into text. For

8:42

this, we'll be using another open-

8:44

source model from OpenAI called Whisper.

8:47

The model comes in different sizes, and

8:49

we'll be using a smaller version. It's a

8:51

little less accurate, but it'll run

8:53

faster on our hardware. Once our speech

8:56

is transcribed into text, we'll enter a

8:58

thinking state and hand off that text

9:00

prompt to a local large language model

9:03

to generate a response. To do this,

9:05

we'll be using an open- source tool

9:07

called Olama. Olama will let us run

9:09

openweight LLMs locally on the Pi or any

9:12

other device. You're probably familiar

9:14

with large language models like Gemini

9:16

and GPT5. Those are proprietary models

9:20

that run in the cloud powered by huge

9:22

data centers. The models we'll be using

9:24

are much smaller, but they have many of

9:26

the same capabilities and will run

9:28

locally on our hardware. No internet or

9:31

expensive subscriptions required. Olama

9:33

makes it really easy to download and run

9:35

these models. So, I'm going to test a

9:36

few out. Since Beimo has a camera, I

9:39

think it'll be cool if I use a model

9:41

that can work with both text and images

9:44

as input. I've got 16 gigs of RAM to

9:47

work with, but I'll still need to keep

9:48

my model size small. So for the first

9:51

pass, I'm going to try using Google's

9:53

smallest multimodal version of Gemma.

9:55

This is the quantized version of Gemma

9:58

3. It's a little less precise, but it

10:00

doesn't use as much memory. Once we have

10:03

installed, we can download any model we

10:05

want by name. Running it is super

10:07

simple. We get a text prompt interface

10:09

in the terminal and we just type in our

10:11

prompts. We can also pass images to the

10:13

models by including the file path. So,

10:16

what we're going to do is take our

10:17

transcribe speech from whisper and send

10:20

that to Olama. That will be our text

10:22

prompt to Gemma 3 and we'll wait for a

10:24

response.

10:26

This can take a while. So, we'll also

10:28

include visual and audio feedback so it

10:31

doesn't feel like Beimo is just broken

10:32

or unresponsive. In fact, we'll do this

10:34

for all our states. We're going to

10:36

change Beimo's face after each state

10:38

transition and we'll also play some

10:40

voice clips.

10:44

The character of Beimo is voiced by the

10:46

amazing Nikki Yang. And I think so much

10:49

of Beimo's charm comes directly from her

10:51

performance. Now, you might think, "Hey,

10:53

I've seen a ton of AI slop videos where

10:55

they deep fake celebrity voices. Maybe

10:57

we can do something like that." But we

10:59

can actively not go out of our way to

11:01

directly steal a performer's work. And

11:04

hey, there are some great free

11:06

open-source texttospech models that we

11:08

can run locally on our Pi. For this

11:10

project, I've decided to use Piper. No,

11:13

we can't generate Beimo's voice exactly,

11:15

but we can find an out of the box voice

11:17

that matches her personality good

11:19

enough. I've tested out a few and I

11:21

really like this one.

11:22

>> A rainbow is a meteorological phenomenon

11:24

that is caused by reflection.

11:26

>> So, I wrote a script that generates a

11:28

bunch of voice clips that Beimo will use

11:30

to give feedback to the user. These will

11:32

be played randomly so they don't feel

11:34

like automated canned responses, even

11:36

though they are. Now, after the LLM is

11:39

finished generating a response based on

11:40

our text or image prompts, we'll want be

11:43

able to read that response out loud. For

11:46

that, we can also use Piper and the same

11:48

voice model we used for the canned audio

11:50

clips. But instead, Olama will stream

11:52

the response as text and will generate

11:54

and play voice clips on the fly. Once

11:56

Beimo is finished speaking, it'll

11:58

transition back to the idle state and go

12:00

back to waiting for the wake word. So,

12:02

let's put this all together and test out

12:03

our flow. Hey, Beimo,

12:06

tell me a joke.

12:10

>> On it.

12:14

>> I think we're going to have to make a

12:15

few changes to speed things up.

12:18

>> Checking on this.

12:20

>> A lot of the delay we're seeing is

12:21

because Oama is reloading our model

12:23

every time we enter a prompt.

12:25

>> It's like stopping and starting the

12:26

engine of a car every time we need to

12:28

break. So, we'll create an additional

12:30

warm-up state that keeps the engine

12:32

running. We'll load all the models into

12:34

memory at startup and just keep them

12:36

going. We'll also update our hardware to

12:38

use an NVME SSD instead of running

12:41

everything off of the SD card. This

12:43

should also help with our loading times.

12:45

We'll also speed things up by using an

12:47

even smaller model. Multimodal models

12:50

are nice because they can form a lot of

12:52

tasks in one package, but they're

12:54

actually a big mix of many models and

12:56

that can slow things down for us. So

12:58

instead, we'll build our own mix of

13:00

smaller models to do just what we need.

13:03

I'm going to use a smaller version of

13:05

Gemma 3 that only deals with text

13:07

prompts. And then we'll switch to an

13:09

entirely different model called

13:10

Moondream for image analysis.

13:21

>> Hello, I am ready to play.

13:25

Hey Beimma,

13:27

tell me a joke.

13:29

>> Okay.

13:33

>> Good.

13:37

>> Why did the robot cross the road? To get

13:40

to the other side.

13:42

>> This grease monkey is torquked up on

13:44

automotive science.

13:46

>> Yeah, boy. So rather than just sending

13:49

everything into one model, we need some

13:51

logic to do the model switching. Now we

13:54

could hardcode this. Like if I say take

13:56

a photo, that exact phrase could be our

13:59

trigger. But that feels a bit

14:00

restrictive. Instead, we can do

14:02

something that will transform Beimo from

14:05

a simple chatbot into what's known as an

14:07

agent. A chatbot just generates text

14:10

sentences, but an agent can make

14:12

decisions about what it's being asked

14:14

and respond not just with sentences, but

14:17

by performing actions and using tools we

14:20

give it. So rather than triggering our

14:22

model handoff using hard-coded phrases,

14:25

we can have Gemma 3 enter a sort of loop

14:28

when it receives a prompt. Instead of

14:30

spitting out text right away, it first

14:32

attempts to see if there are any tools

14:34

it can use to inform its response. These

14:36

tools can come in different shapes and

14:38

sizes, but they're often just functions

14:40

living in the agent's code. If the LLM

14:43

decides to use the tool, it generates

14:45

text that is readable by the function,

14:47

like JSON, rather than full sentences.

14:51

It then takes the output of the tool and

14:53

goes back to the beginning of the loop.

14:55

Is there anything else I need to do to

14:57

inform my response? If not, it can then

15:00

use the context of the output of the

15:02

tool to generate sentences to say out

15:04

loud or perform other actions. So, let's

15:07

apply this to our Beimo camera. We'll

15:09

leave it up to Beimo to decide if it

15:11

should take a photo and switch models or

15:13

just stick with Gemma and generate a

15:15

chat response. And we can keep building

15:17

more tools and actions in the same way.

15:20

Right now, if you ask Beimo for the

15:22

current time or about anything that's

15:24

not part of its LLM's training data, it

15:26

will confidently hallucinate a madeup

15:29

answer. So, to help Beimo have more

15:31

informed, accurate responses will use a

15:34

technique called retrieval augmented

15:36

generation or rag, think of it as giving

15:39

AI access to a search engine. So, it can

15:42

look up real world data and use that to

15:45

inform what it says.

15:47

>> Time for the night shift.

15:50

Let's finally put it all together.

15:52

[music] Before I start assembling Beimo,

15:54

there are a few changes I'm going to

15:55

make to our hardware. [music] After a

15:57

lot of trial and error, I found a hat

15:59

that lets me power the Raspberry Pi 5

16:02

and all the peripherals I have using

16:03

just batteries. [music]

16:05

At the same time, I'm going to swap out

16:07

the hat I'm using for my SSD for [music]

16:09

this dual M2 PCIe switch. This gives me

16:13

an extra port that I can hopefully use

16:15

for something like an AI accelerator

16:17

chip.

16:17

>> [music]

16:18

>> There already is an official Raspberry

16:19

Pi accelerator, but it doesn't work with

16:22

LLMs. It's more for computer vision

16:24

projects. But I'm hopeful we'll see

16:26

something like this for LLMs in the

16:27

future. It's me from the future to say

16:30

as I was working on this video,

16:31

Raspberry Pi ended up releasing an

16:34

accelerator chip that works with local

16:36

LLMs. Uh the good news is I got my hands

16:40

on one. The bad news is it's not going

16:43

to slot into my M2 port like I was

16:45

hoping. But I did find this other

16:48

accelerator chip that should.

16:51

So, I think I'm going to make another

16:52

video comparing both of these chips with

16:55

the build we have today. Okay, let's get

16:58

back to the build. [music]

17:00

Here's the PCB I had made. The

17:02

microcontroller should fit right [music]

17:03

into the board.

17:05

Let's assemble the rest of the face

17:07

plate.

17:12

>> [music]

17:21

[music]

17:29

>> Now, let's add the rest of the hardware.

17:32

[music]

17:37

>> [music]

17:44

[music]

17:48

[music]

17:58

[music]

18:02

[music]

18:07

[music]

18:13

[music]

18:22

[music]

18:24

>> Hey,

18:29

hey,

18:31

[music]

18:35

hey.

18:40

>> [music]

18:45

[music]

18:52

>> I will save you a better reality.

18:58

>> Let's test our Beimo AI agent. Just like

19:00

the real Beimo, I can connect my

19:02

keyboard to them and navigate to my

19:04

script.

19:08

Mo's online.

19:10

>> Hey Beimo,

19:12

can you write me a short poem about

19:14

dinosaurs?

19:15

>> Okay,

19:16

>> here you can see Beimo quickly decides

19:18

to just behave as a chatbot and

19:21

generates a response to my prompt right

19:22

away.

19:23

>> Here's a short poem about dinosaurs.

19:28

Scales of green, a mighty stride.

19:32

Long neck giants side by side.

19:36

T-Rex roared a fistome sound.

19:40

Dinosaurs roared all around.

19:44

Fossil bones a wondrous sight.

19:48

Ancient creatures shining bright.

19:52

>> Hey Beimo,

19:54

search for news about Canada.

19:56

>> In this case, Beimo recognizes that I've

19:59

asked it to do something that requires

20:00

using the search tool we made. This

20:02

takes a bit longer, but in the

20:03

background, Beimo is running a search

20:05

query using duck.go and then using what

20:08

it finds to generate its response.

20:10

>> Okay, here's a cute short sentence

20:12

summarizing the data. Canada's trade

20:15

with the US could be a big problem for

20:17

Bent. It would be a disaster.

20:21

>> Let's test our camera now.

20:24

>> Hey Beimo,

20:26

what do you see

20:28

>> on it?

20:31

painting.

20:32

>> Beimo needs to understand that answering

20:34

the question, "What do you see?" means

20:37

it will need to use their camera.

20:39

>> This takes a bit of time to figure out,

20:41

but eventually they take a photo.

20:45

>> Checking on this.

20:46

>> Now, that photo goes back into the loop.

20:49

So, the agent again receives the prompt,

20:51

"What do you see?" But now, it also has

20:53

the image captured by the camera. It

20:56

switches from using Gemma to using

20:58

Moondream and generates a text

21:00

description of the image. This looping

21:02

and model swapping takes around a minute

21:04

to complete, so I'll speed things up for

21:06

this video.

21:06

>> In the image, a person is holding up an

21:09

orange and white toy mushroom with its

21:11

eyes closed.

21:14

The mushroom is a red hat on top of it

21:16

as well.

21:18

The scene takes place in front of a

21:20

wall, which could be part of another

21:21

room or a specific area within a

21:23

building.

21:26

This unique toy mushroom seems to be the

21:28

main focus of the person's attention.

21:34

>> Say something insightful about life in

21:37

the age of artificial intelligence.

21:40

>> Okay.

21:41

>> And Beimo's buttons are all functional.

21:43

They send keyboard commands to the Pi

21:45

directly. I use the enter key as an

21:47

alternative way to trigger voice

21:48

capture.

21:50

And finally, we can of course use these

21:52

to play video games.

21:55

I had a ton of fun building Beimo. I

21:57

hope you really enjoyed following along.

21:59

Again, I'm not a professional developer,

22:02

so I can't really say whether this is

22:03

the right way to approach making your

22:05

own embodied AI agents. This was really

22:08

just an opportunity for me to learn more

22:10

about a technology that we're all being

22:12

told is going to reshape our lives and

22:14

see whether the benefits of AI can

22:16

really only be achieved by resource

22:18

consuming data centers and stealing

22:20

intellectual property. Now, I'm not

22:22

saying local models solve all the

22:24

problems with AI today. There are a lot

22:26

of downsides to them. Because they're

22:28

offline and open, it's really difficult

22:31

to put guard rails in place and enforce

22:33

them. So, are we doomed? It's not

22:36

difficult to imagine the current AI boom

22:38

is rushing us towards some kind of

22:40

apocalyptic future. One where a Skynet

22:42

or HAL 9000 or GLaDOS values growth at

22:45

all cost more than human lives. But

22:48

maybe if we take a step back and really

22:50

learn about these tools we're being

22:52

sold, maybe we can build something

22:54

different.

22:56

And that's what makes Beimo one of my

22:57

favorite characters. Even when faced

23:00

with the end of the world, they teach us

23:02

to be kind, creative, and take care of

23:04

one another.

23:06

Beimo was a reminder that we can always

23:08

be more.

23:09

>> That is an interesting response.

23:14

Shut down.