Saved transcript

I made a real BMO local AI agent with a Raspberry Pi and Ollama

Channel: brenpoly

Perhaps the greatest challenge isn't

creating intelligence, but understanding

what it means to be truly human.

>> Yay, Beimo.

>> Beimo from Adventure Time is one of my

all-time favorite characters. They're

hilarious little sentient robot with a

literal heart of gold. And today, I'm

going to build my own Beimo as an

embodied local AI agent. That means

it'll use onboard AI to make its own

decisions. It's not just a chatbot, but

a thinking, acting machine. who wants to

play video games.

>> So, let's break down what our Beimo

needs to do. We'll of course need to be

able to play video games and connect

controllers to Beimo. It looks like

Beimo has a couple of USB ports for that

and for connecting to other devices. But

Beimo is more than just a game console.

They're a whole computer with their own

operating system which should be able to

connect to networks and run queries even

though they live in a world where the

internet is well gone. Beimo is camera.

>> Hey, they're also a camera. And they've

got to do all this powered by batteries

and using local onboard compute.

>> I am incapable of emotion, but you are

making me chaf.

>> Let's talk about Beimo's parts. We're

going to need a small screen and some

kind of microphone and speaker for us to

talk to them. Looking at Beimo's front

panel, we'll need two USB ports and a

bunch of different buttons, including a

D-pad. There's also this long slot and

circle, but it's not totally clear what

these do. Thankfully, the Adventure Time

artbook has some official reference

images of Beimo's internal components

from the episode Bemore. It looks like

that slot is used for floppy discs. This

circle is still a mystery, but it's safe

to say it's not just another button or a

status light for the disc drive since

it's kept separate from those

components. What's really interesting is

that Beimo has something called an

infinity box, which explains why Beimo's

hardware seems to change throughout the

series. But now we know these aren't

continuity errors. Beimo is just some

kind of mini TARDIS that periodically

swaps out its insides. So I think that's

more than enough justification for me to

say that this little circle over here is

going to be my Beimo's camera. Also, can

I just say I love the little details

included by the character designers and

storyboarders here. Beimo has the heart

metal diploma from the Wizard of Oz and

what might be a tiny princess crown. I

don't think I'll be able to include

these right now, but I'd love to

eventually work them in. And there's

honestly so much more that Beimo can do

that we're just going to say is out of

scope.

>> It goes in my butt.

>> Oh,

>> what do you think about the stars in the

sky?

>> We'll be using a tiny Raspberry Pi

computer as the brains of Beimo. This

runs a full Linux-based operating system

along with our own custom Python scripts

that we'll use to build our AI agent.

This is the Raspberry Pi 5 with 16 GB of

RAM. This is the most powerful Pi

available at the time of recording this

video, and it will be more than enough

to load all the local AI models we'll be

using into memory. Now, I bought this

when it first came out. Prices have gone

way up thanks to the spike in RAM

demand. You can get away with a cheaper

model with less RAM. It all depends on

the size of the models you plan on

using. For Beimo's screen, I've got this

5- in 800x480 pixel IPS touchscreen

display. It has mounts for the Pi right

on the back and connects with a display

cable.

I also have this official Raspberry Pi

camera module. This is the V2 module, so

it's a bit old, but I can never really

find a good project to use it with. So,

into Beimo it goes. The Pi has four USB

ports. We'll use one for this little USB

microphone so Beimo can hear us

and this set of USB speakers so they can

talk back.

We're going to extend and split this

third port with these adapters to make

the two USB ports on the front of

Beimo's face plate. So, we could connect

all these buttons to these GPIO pins on

the Pi, but then I can only use them

when I'm running my own scripts. So,

instead, I'm going to use this

microcontroller to turn the raw button

presses into keyboard commands that we

can send to the Pi or any other device

over USB. I want these buttons to work

more like a real game controller that we

can use with any program running on the

OS.

>> I am a beautiful big man here.

>> Now that we have our hardware, we can

start designing Beimo's body. I'll start

by measuring the components we have.

I'll build out things like the screen

and camera as simple 2D shapes so I can

start planning out the overall form of

Beimo's enclosure. I still need to

design a custom PCB for the buttons. So,

I'll first lay out some shapes to use

that to figure out where all the little

switches need to go. Now, again, I've

never designed my own PCB before, so

take what I'm going to do with a grain

of salt. Actually, I've never done any

of this before, so take everything I'm

doing with a big grain of salt. I'm

using an application called Kyad to

design my PCB. I first create a

schematic of the board, including the

connections with the microcontroller,

and then use my 2D mockup to lay out the

switches. Then I send the files off to

the PCB printing service and wait for

them to get delivered. While I wait, I

start 3D modeling Beimo's body. The PCB

printing service provides a 3D model of

the board. So, I'm going to import that

into Blender as my point of reference.

From here, I start modeling Beimo's

buttons in place using my 2D mocks to

make sure I have the scale correct. I'll

also create geometry to represent some

of the other hardware like the screen,

camera, and USB ports. I'll then make

Beimo's face plate and include cutouts

for the buttons and other components.

I'll also use this free plugin to add

mounting bolts for the PCB that I can

keep in place with nuts later on. Now,

I'll build the main enclosure that the

face plate will attach to. I make sure

there's a place for all the hardware to

sit snugly in. Then, I add details like

Beimo's vents, speaker holes, and

letters. I also carve out some holes

that I can glue magnets into. And this

is what's going to hold the whole body

together. I then made some arms and legs

with little pegs that will slot in and

out of holes in the body. I'm really

happy with what I made here, but I want

the option to swap these out for other

poses in the future. I struggled with

what to do with this little hatch on

Beimo's back. The Raspberry Pi 5 has

been really challenging to work with

when it comes to power management. So,

you may have noticed I haven't quite

figured out how to make Beimo battery

powered just yet. So, I came up with

this design that I could just slide into

a larger opening in Beimo's back. And if

I want to make something like a battery

holder in the future, then I can swap

this out rather than having to rebuild

the entire enclosure. And with that,

it's time to 3D print the enclosure. My

3D printer is pretty small, but

thankfully my model just fits on the

build plate. I'd be lying if I said all

my measurements worked out perfectly,

but after a few reprints, I finally had

a real body. Next, it was time to

sand the 3D print.

Then I applied a filler and primer

and sanded it with a finer grit. Next

came painting. As a bit of a happy

accident, the spray paint on the

enclosure came out with a spackle

texture. I think it's because I left the

teal paint in my garage and it's

freezing where I live right now. It

actually reminded me of the texture of

my old Super Nintendo, so I decided to

keep it.

I then added some matte clear coat to

protect it and left it all to dry.

Oh, are you my grandpoo?

>> Let's start designing our software now.

So, first full disclosure, I'm trying to

use as many open source tools as

possible, but I'm not a professional

developer. I just like to make things.

So, I'm going to plan the overall user

flow and architecture myself, but I'm

going to be leaning on Gemini to help me

write the actual code. We'll lay out the

core loop first. Beimo is going to wait

around in an idle state until we wake

them up to start listening to us. We'll

record and transcribe the user's voice

and send that to our local LLM. When the

response is ready, Beimo will read it

out loud and then go back into the idle

state. We're going to have a couple of

ways to wake Beimo up. First, I'm going

to use one of their buttons to toggle

voice recording manually, but I'm also

going to train a custom wakeword model

so I can just use my voice. There are a

lot of great paid and free options for

training these types of models, but in

the spirit of making sure everything I

use is open and local, I'm using

something called open wake word. It can

feel a little intimidating at first, but

there's a fantastic collab linked on

their GitHub page which walks you

through everything you need to do.

It takes a while to train, but

eventually you get a machine learning

model to download. Okay, so Beimo will

start in the idle state running the

wakeword model until it hears us say,

"Hey, Beimo," or we push the record

button. It'll then transition into the

listening state where it needs to record

and transcribe our voice into text. For

this, we'll be using another open-

source model from OpenAI called Whisper.

The model comes in different sizes, and

we'll be using a smaller version. It's a

little less accurate, but it'll run

faster on our hardware. Once our speech

is transcribed into text, we'll enter a

thinking state and hand off that text

prompt to a local large language model

to generate a response. To do this,

we'll be using an open- source tool

called Olama. Olama will let us run

openweight LLMs locally on the Pi or any

other device. You're probably familiar

with large language models like Gemini

and GPT5. Those are proprietary models

that run in the cloud powered by huge

data centers. The models we'll be using

are much smaller, but they have many of

the same capabilities and will run

locally on our hardware. No internet or

expensive subscriptions required. Olama

makes it really easy to download and run

these models. So, I'm going to test a

few out. Since Beimo has a camera, I

think it'll be cool if I use a model

that can work with both text and images

as input. I've got 16 gigs of RAM to

work with, but I'll still need to keep

my model size small. So for the first

pass, I'm going to try using Google's

smallest multimodal version of Gemma.

This is the quantized version of Gemma

3. It's a little less precise, but it

doesn't use as much memory. Once we have

installed, we can download any model we

want by name. Running it is super

simple. We get a text prompt interface

in the terminal and we just type in our

prompts. We can also pass images to the

models by including the file path. So,

what we're going to do is take our

transcribe speech from whisper and send

that to Olama. That will be our text

prompt to Gemma 3 and we'll wait for a

response.

This can take a while. So, we'll also

include visual and audio feedback so it

doesn't feel like Beimo is just broken

or unresponsive. In fact, we'll do this

for all our states. We're going to

change Beimo's face after each state

transition and we'll also play some

voice clips.

The character of Beimo is voiced by the

amazing Nikki Yang. And I think so much

of Beimo's charm comes directly from her

performance. Now, you might think, "Hey,

I've seen a ton of AI slop videos where

they deep fake celebrity voices. Maybe

we can do something like that." But we

can actively not go out of our way to

directly steal a performer's work. And

hey, there are some great free

open-source texttospech models that we

can run locally on our Pi. For this

project, I've decided to use Piper. No,

we can't generate Beimo's voice exactly,

but we can find an out of the box voice

that matches her personality good

enough. I've tested out a few and I

really like this one.

>> A rainbow is a meteorological phenomenon

that is caused by reflection.

>> So, I wrote a script that generates a

bunch of voice clips that Beimo will use

to give feedback to the user. These will

be played randomly so they don't feel

like automated canned responses, even

though they are. Now, after the LLM is

finished generating a response based on

our text or image prompts, we'll want be

able to read that response out loud. For

that, we can also use Piper and the same

voice model we used for the canned audio

clips. But instead, Olama will stream

the response as text and will generate

and play voice clips on the fly. Once

Beimo is finished speaking, it'll

transition back to the idle state and go

back to waiting for the wake word. So,

let's put this all together and test out

our flow. Hey, Beimo,

tell me a joke.

>> On it.

>> I think we're going to have to make a

few changes to speed things up.

>> Checking on this.

>> A lot of the delay we're seeing is

because Oama is reloading our model

every time we enter a prompt.

>> It's like stopping and starting the

engine of a car every time we need to

break. So, we'll create an additional

warm-up state that keeps the engine

running. We'll load all the models into

memory at startup and just keep them

going. We'll also update our hardware to

use an NVME SSD instead of running

everything off of the SD card. This

should also help with our loading times.

We'll also speed things up by using an

even smaller model. Multimodal models

are nice because they can form a lot of

tasks in one package, but they're

actually a big mix of many models and

that can slow things down for us. So

instead, we'll build our own mix of

smaller models to do just what we need.

I'm going to use a smaller version of

Gemma 3 that only deals with text

prompts. And then we'll switch to an

entirely different model called

Moondream for image analysis.

>> Hello, I am ready to play.

Hey Beimma,

tell me a joke.

>> Okay.

>> Good.

>> Why did the robot cross the road? To get

to the other side.

>> This grease monkey is torquked up on

automotive science.

>> Yeah, boy. So rather than just sending

everything into one model, we need some

logic to do the model switching. Now we

could hardcode this. Like if I say take

a photo, that exact phrase could be our

trigger. But that feels a bit

restrictive. Instead, we can do

something that will transform Beimo from

a simple chatbot into what's known as an

agent. A chatbot just generates text

sentences, but an agent can make

decisions about what it's being asked

and respond not just with sentences, but

by performing actions and using tools we

give it. So rather than triggering our

model handoff using hard-coded phrases,

we can have Gemma 3 enter a sort of loop

when it receives a prompt. Instead of

spitting out text right away, it first

attempts to see if there are any tools

it can use to inform its response. These

tools can come in different shapes and

sizes, but they're often just functions

living in the agent's code. If the LLM

decides to use the tool, it generates

text that is readable by the function,

like JSON, rather than full sentences.

It then takes the output of the tool and

goes back to the beginning of the loop.

Is there anything else I need to do to

inform my response? If not, it can then

use the context of the output of the

tool to generate sentences to say out

loud or perform other actions. So, let's

apply this to our Beimo camera. We'll

leave it up to Beimo to decide if it

should take a photo and switch models or

just stick with Gemma and generate a

chat response. And we can keep building

more tools and actions in the same way.

Right now, if you ask Beimo for the

current time or about anything that's

not part of its LLM's training data, it

will confidently hallucinate a madeup

answer. So, to help Beimo have more

informed, accurate responses will use a

technique called retrieval augmented

generation or rag, think of it as giving

AI access to a search engine. So, it can

look up real world data and use that to

inform what it says.

>> Time for the night shift.

Let's finally put it all together.

[music] Before I start assembling Beimo,

there are a few changes I'm going to

make to our hardware. [music] After a

lot of trial and error, I found a hat

that lets me power the Raspberry Pi 5

and all the peripherals I have using

just batteries. [music]

At the same time, I'm going to swap out

the hat I'm using for my SSD for [music]

this dual M2 PCIe switch. This gives me

an extra port that I can hopefully use

for something like an AI accelerator

chip.

>> [music]

>> There already is an official Raspberry

Pi accelerator, but it doesn't work with

LLMs. It's more for computer vision

projects. But I'm hopeful we'll see

something like this for LLMs in the

future. It's me from the future to say

as I was working on this video,

Raspberry Pi ended up releasing an

accelerator chip that works with local

LLMs. Uh the good news is I got my hands

on one. The bad news is it's not going

to slot into my M2 port like I was

hoping. But I did find this other

accelerator chip that should.

So, I think I'm going to make another

video comparing both of these chips with

the build we have today. Okay, let's get

back to the build. [music]

Here's the PCB I had made. The

microcontroller should fit right [music]

into the board.

Let's assemble the rest of the face

plate.

>> [music]

[music]

>> Now, let's add the rest of the hardware.

[music]

>> [music]

[music]

[music]

[music]

[music]

[music]

[music]

[music]

>> Hey,

hey,

[music]

hey.

>> [music]

[music]

>> I will save you a better reality.

>> Let's test our Beimo AI agent. Just like

the real Beimo, I can connect my

keyboard to them and navigate to my

script.

Mo's online.

>> Hey Beimo,

can you write me a short poem about

dinosaurs?

>> Okay,

>> here you can see Beimo quickly decides

to just behave as a chatbot and

generates a response to my prompt right

away.

>> Here's a short poem about dinosaurs.

Scales of green, a mighty stride.

Long neck giants side by side.

T-Rex roared a fistome sound.

Dinosaurs roared all around.

Fossil bones a wondrous sight.

Ancient creatures shining bright.

>> Hey Beimo,

search for news about Canada.

>> In this case, Beimo recognizes that I've

asked it to do something that requires

using the search tool we made. This

takes a bit longer, but in the

background, Beimo is running a search

query using duck.go and then using what

it finds to generate its response.

>> Okay, here's a cute short sentence

summarizing the data. Canada's trade

with the US could be a big problem for

Bent. It would be a disaster.

>> Let's test our camera now.

>> Hey Beimo,

what do you see

>> on it?

painting.

>> Beimo needs to understand that answering

the question, "What do you see?" means

it will need to use their camera.

>> This takes a bit of time to figure out,

but eventually they take a photo.

>> Checking on this.

>> Now, that photo goes back into the loop.

So, the agent again receives the prompt,

"What do you see?" But now, it also has

the image captured by the camera. It

switches from using Gemma to using

Moondream and generates a text

description of the image. This looping

and model swapping takes around a minute

to complete, so I'll speed things up for

this video.

>> In the image, a person is holding up an

orange and white toy mushroom with its

eyes closed.

The mushroom is a red hat on top of it

as well.

The scene takes place in front of a

wall, which could be part of another

room or a specific area within a

building.

This unique toy mushroom seems to be the

main focus of the person's attention.

>> Say something insightful about life in

the age of artificial intelligence.

>> Okay.

>> And Beimo's buttons are all functional.

They send keyboard commands to the Pi

directly. I use the enter key as an

alternative way to trigger voice

capture.

And finally, we can of course use these

to play video games.

I had a ton of fun building Beimo. I

hope you really enjoyed following along.

Again, I'm not a professional developer,

so I can't really say whether this is

the right way to approach making your

own embodied AI agents. This was really

just an opportunity for me to learn more

about a technology that we're all being

told is going to reshape our lives and

see whether the benefits of AI can

really only be achieved by resource

consuming data centers and stealing

intellectual property. Now, I'm not

saying local models solve all the

problems with AI today. There are a lot

of downsides to them. Because they're

offline and open, it's really difficult

to put guard rails in place and enforce

them. So, are we doomed? It's not

difficult to imagine the current AI boom

is rushing us towards some kind of

apocalyptic future. One where a Skynet

or HAL 9000 or GLaDOS values growth at

all cost more than human lives. But

maybe if we take a step back and really

learn about these tools we're being

sold, maybe we can build something

different.

And that's what makes Beimo one of my

favorite characters. Even when faced

with the end of the world, they teach us

to be kind, creative, and take care of

one another.

Beimo was a reminder that we can always

be more.

>> That is an interesting response.

Shut down.