Perhaps the greatest challenge isn't
Saved transcript
I made a real BMO local AI agent with a Raspberry Pi and Ollama
Channel: brenpoly
creating intelligence, but understanding
what it means to be truly human.
>> Yay, Beimo.
>> Beimo from Adventure Time is one of my
all-time favorite characters. They're
hilarious little sentient robot with a
literal heart of gold. And today, I'm
going to build my own Beimo as an
embodied local AI agent. That means
it'll use onboard AI to make its own
decisions. It's not just a chatbot, but
a thinking, acting machine. who wants to
play video games.
>> So, let's break down what our Beimo
needs to do. We'll of course need to be
able to play video games and connect
controllers to Beimo. It looks like
Beimo has a couple of USB ports for that
and for connecting to other devices. But
Beimo is more than just a game console.
They're a whole computer with their own
operating system which should be able to
connect to networks and run queries even
though they live in a world where the
internet is well gone. Beimo is camera.
>> Hey, they're also a camera. And they've
got to do all this powered by batteries
and using local onboard compute.
>> I am incapable of emotion, but you are
making me chaf.
>> Let's talk about Beimo's parts. We're
going to need a small screen and some
kind of microphone and speaker for us to
talk to them. Looking at Beimo's front
panel, we'll need two USB ports and a
bunch of different buttons, including a
D-pad. There's also this long slot and
circle, but it's not totally clear what
these do. Thankfully, the Adventure Time
artbook has some official reference
images of Beimo's internal components
from the episode Bemore. It looks like
that slot is used for floppy discs. This
circle is still a mystery, but it's safe
to say it's not just another button or a
status light for the disc drive since
it's kept separate from those
components. What's really interesting is
that Beimo has something called an
infinity box, which explains why Beimo's
hardware seems to change throughout the
series. But now we know these aren't
continuity errors. Beimo is just some
kind of mini TARDIS that periodically
swaps out its insides. So I think that's
more than enough justification for me to
say that this little circle over here is
going to be my Beimo's camera. Also, can
I just say I love the little details
included by the character designers and
storyboarders here. Beimo has the heart
metal diploma from the Wizard of Oz and
what might be a tiny princess crown. I
don't think I'll be able to include
these right now, but I'd love to
eventually work them in. And there's
honestly so much more that Beimo can do
that we're just going to say is out of
scope.
>> It goes in my butt.
>> Oh,
>> what do you think about the stars in the
sky?
>> We'll be using a tiny Raspberry Pi
computer as the brains of Beimo. This
runs a full Linux-based operating system
along with our own custom Python scripts
that we'll use to build our AI agent.
This is the Raspberry Pi 5 with 16 GB of
RAM. This is the most powerful Pi
available at the time of recording this
video, and it will be more than enough
to load all the local AI models we'll be
using into memory. Now, I bought this
when it first came out. Prices have gone
way up thanks to the spike in RAM
demand. You can get away with a cheaper
model with less RAM. It all depends on
the size of the models you plan on
using. For Beimo's screen, I've got this
5- in 800x480 pixel IPS touchscreen
display. It has mounts for the Pi right
on the back and connects with a display
cable.
I also have this official Raspberry Pi
camera module. This is the V2 module, so
it's a bit old, but I can never really
find a good project to use it with. So,
into Beimo it goes. The Pi has four USB
ports. We'll use one for this little USB
microphone so Beimo can hear us
and this set of USB speakers so they can
talk back.
We're going to extend and split this
third port with these adapters to make
the two USB ports on the front of
Beimo's face plate. So, we could connect
all these buttons to these GPIO pins on
the Pi, but then I can only use them
when I'm running my own scripts. So,
instead, I'm going to use this
microcontroller to turn the raw button
presses into keyboard commands that we
can send to the Pi or any other device
over USB. I want these buttons to work
more like a real game controller that we
can use with any program running on the
OS.
>> I am a beautiful big man here.
>> Now that we have our hardware, we can
start designing Beimo's body. I'll start
by measuring the components we have.
I'll build out things like the screen
and camera as simple 2D shapes so I can
start planning out the overall form of
Beimo's enclosure. I still need to
design a custom PCB for the buttons. So,
I'll first lay out some shapes to use
that to figure out where all the little
switches need to go. Now, again, I've
never designed my own PCB before, so
take what I'm going to do with a grain
of salt. Actually, I've never done any
of this before, so take everything I'm
doing with a big grain of salt. I'm
using an application called Kyad to
design my PCB. I first create a
schematic of the board, including the
connections with the microcontroller,
and then use my 2D mockup to lay out the
switches. Then I send the files off to
the PCB printing service and wait for
them to get delivered. While I wait, I
start 3D modeling Beimo's body. The PCB
printing service provides a 3D model of
the board. So, I'm going to import that
into Blender as my point of reference.
From here, I start modeling Beimo's
buttons in place using my 2D mocks to
make sure I have the scale correct. I'll
also create geometry to represent some
of the other hardware like the screen,
camera, and USB ports. I'll then make
Beimo's face plate and include cutouts
for the buttons and other components.
I'll also use this free plugin to add
mounting bolts for the PCB that I can
keep in place with nuts later on. Now,
I'll build the main enclosure that the
face plate will attach to. I make sure
there's a place for all the hardware to
sit snugly in. Then, I add details like
Beimo's vents, speaker holes, and
letters. I also carve out some holes
that I can glue magnets into. And this
is what's going to hold the whole body
together. I then made some arms and legs
with little pegs that will slot in and
out of holes in the body. I'm really
happy with what I made here, but I want
the option to swap these out for other
poses in the future. I struggled with
what to do with this little hatch on
Beimo's back. The Raspberry Pi 5 has
been really challenging to work with
when it comes to power management. So,
you may have noticed I haven't quite
figured out how to make Beimo battery
powered just yet. So, I came up with
this design that I could just slide into
a larger opening in Beimo's back. And if
I want to make something like a battery
holder in the future, then I can swap
this out rather than having to rebuild
the entire enclosure. And with that,
it's time to 3D print the enclosure. My
3D printer is pretty small, but
thankfully my model just fits on the
build plate. I'd be lying if I said all
my measurements worked out perfectly,
but after a few reprints, I finally had
a real body. Next, it was time to
sand the 3D print.
Then I applied a filler and primer
and sanded it with a finer grit. Next
came painting. As a bit of a happy
accident, the spray paint on the
enclosure came out with a spackle
texture. I think it's because I left the
teal paint in my garage and it's
freezing where I live right now. It
actually reminded me of the texture of
my old Super Nintendo, so I decided to
keep it.
I then added some matte clear coat to
protect it and left it all to dry.
Oh, are you my grandpoo?
>> Let's start designing our software now.
So, first full disclosure, I'm trying to
use as many open source tools as
possible, but I'm not a professional
developer. I just like to make things.
So, I'm going to plan the overall user
flow and architecture myself, but I'm
going to be leaning on Gemini to help me
write the actual code. We'll lay out the
core loop first. Beimo is going to wait
around in an idle state until we wake
them up to start listening to us. We'll
record and transcribe the user's voice
and send that to our local LLM. When the
response is ready, Beimo will read it
out loud and then go back into the idle
state. We're going to have a couple of
ways to wake Beimo up. First, I'm going
to use one of their buttons to toggle
voice recording manually, but I'm also
going to train a custom wakeword model
so I can just use my voice. There are a
lot of great paid and free options for
training these types of models, but in
the spirit of making sure everything I
use is open and local, I'm using
something called open wake word. It can
feel a little intimidating at first, but
there's a fantastic collab linked on
their GitHub page which walks you
through everything you need to do.
It takes a while to train, but
eventually you get a machine learning
model to download. Okay, so Beimo will
start in the idle state running the
wakeword model until it hears us say,
"Hey, Beimo," or we push the record
button. It'll then transition into the
listening state where it needs to record
and transcribe our voice into text. For
this, we'll be using another open-
source model from OpenAI called Whisper.
The model comes in different sizes, and
we'll be using a smaller version. It's a
little less accurate, but it'll run
faster on our hardware. Once our speech
is transcribed into text, we'll enter a
thinking state and hand off that text
prompt to a local large language model
to generate a response. To do this,
we'll be using an open- source tool
called Olama. Olama will let us run
openweight LLMs locally on the Pi or any
other device. You're probably familiar
with large language models like Gemini
and GPT5. Those are proprietary models
that run in the cloud powered by huge
data centers. The models we'll be using
are much smaller, but they have many of
the same capabilities and will run
locally on our hardware. No internet or
expensive subscriptions required. Olama
makes it really easy to download and run
these models. So, I'm going to test a
few out. Since Beimo has a camera, I
think it'll be cool if I use a model
that can work with both text and images
as input. I've got 16 gigs of RAM to
work with, but I'll still need to keep
my model size small. So for the first
pass, I'm going to try using Google's
smallest multimodal version of Gemma.
This is the quantized version of Gemma
3. It's a little less precise, but it
doesn't use as much memory. Once we have
installed, we can download any model we
want by name. Running it is super
simple. We get a text prompt interface
in the terminal and we just type in our
prompts. We can also pass images to the
models by including the file path. So,
what we're going to do is take our
transcribe speech from whisper and send
that to Olama. That will be our text
prompt to Gemma 3 and we'll wait for a
response.
This can take a while. So, we'll also
include visual and audio feedback so it
doesn't feel like Beimo is just broken
or unresponsive. In fact, we'll do this
for all our states. We're going to
change Beimo's face after each state
transition and we'll also play some
voice clips.
The character of Beimo is voiced by the
amazing Nikki Yang. And I think so much
of Beimo's charm comes directly from her
performance. Now, you might think, "Hey,
I've seen a ton of AI slop videos where
they deep fake celebrity voices. Maybe
we can do something like that." But we
can actively not go out of our way to
directly steal a performer's work. And
hey, there are some great free
open-source texttospech models that we
can run locally on our Pi. For this
project, I've decided to use Piper. No,
we can't generate Beimo's voice exactly,
but we can find an out of the box voice
that matches her personality good
enough. I've tested out a few and I
really like this one.
>> A rainbow is a meteorological phenomenon
that is caused by reflection.
>> So, I wrote a script that generates a
bunch of voice clips that Beimo will use
to give feedback to the user. These will
be played randomly so they don't feel
like automated canned responses, even
though they are. Now, after the LLM is
finished generating a response based on
our text or image prompts, we'll want be
able to read that response out loud. For
that, we can also use Piper and the same
voice model we used for the canned audio
clips. But instead, Olama will stream
the response as text and will generate
and play voice clips on the fly. Once
Beimo is finished speaking, it'll
transition back to the idle state and go
back to waiting for the wake word. So,
let's put this all together and test out
our flow. Hey, Beimo,
tell me a joke.
>> On it.
>> I think we're going to have to make a
few changes to speed things up.
>> Checking on this.
>> A lot of the delay we're seeing is
because Oama is reloading our model
every time we enter a prompt.
>> It's like stopping and starting the
engine of a car every time we need to
break. So, we'll create an additional
warm-up state that keeps the engine
running. We'll load all the models into
memory at startup and just keep them
going. We'll also update our hardware to
use an NVME SSD instead of running
everything off of the SD card. This
should also help with our loading times.
We'll also speed things up by using an
even smaller model. Multimodal models
are nice because they can form a lot of
tasks in one package, but they're
actually a big mix of many models and
that can slow things down for us. So
instead, we'll build our own mix of
smaller models to do just what we need.
I'm going to use a smaller version of
Gemma 3 that only deals with text
prompts. And then we'll switch to an
entirely different model called
Moondream for image analysis.
>> Hello, I am ready to play.
Hey Beimma,
tell me a joke.
>> Okay.
>> Good.
>> Why did the robot cross the road? To get
to the other side.
>> This grease monkey is torquked up on
automotive science.
>> Yeah, boy. So rather than just sending
everything into one model, we need some
logic to do the model switching. Now we
could hardcode this. Like if I say take
a photo, that exact phrase could be our
trigger. But that feels a bit
restrictive. Instead, we can do
something that will transform Beimo from
a simple chatbot into what's known as an
agent. A chatbot just generates text
sentences, but an agent can make
decisions about what it's being asked
and respond not just with sentences, but
by performing actions and using tools we
give it. So rather than triggering our
model handoff using hard-coded phrases,
we can have Gemma 3 enter a sort of loop
when it receives a prompt. Instead of
spitting out text right away, it first
attempts to see if there are any tools
it can use to inform its response. These
tools can come in different shapes and
sizes, but they're often just functions
living in the agent's code. If the LLM
decides to use the tool, it generates
text that is readable by the function,
like JSON, rather than full sentences.
It then takes the output of the tool and
goes back to the beginning of the loop.
Is there anything else I need to do to
inform my response? If not, it can then
use the context of the output of the
tool to generate sentences to say out
loud or perform other actions. So, let's
apply this to our Beimo camera. We'll
leave it up to Beimo to decide if it
should take a photo and switch models or
just stick with Gemma and generate a
chat response. And we can keep building
more tools and actions in the same way.
Right now, if you ask Beimo for the
current time or about anything that's
not part of its LLM's training data, it
will confidently hallucinate a madeup
answer. So, to help Beimo have more
informed, accurate responses will use a
technique called retrieval augmented
generation or rag, think of it as giving
AI access to a search engine. So, it can
look up real world data and use that to
inform what it says.
>> Time for the night shift.
Let's finally put it all together.
[music] Before I start assembling Beimo,
there are a few changes I'm going to
make to our hardware. [music] After a
lot of trial and error, I found a hat
that lets me power the Raspberry Pi 5
and all the peripherals I have using
just batteries. [music]
At the same time, I'm going to swap out
the hat I'm using for my SSD for [music]
this dual M2 PCIe switch. This gives me
an extra port that I can hopefully use
for something like an AI accelerator
chip.
>> [music]
>> There already is an official Raspberry
Pi accelerator, but it doesn't work with
LLMs. It's more for computer vision
projects. But I'm hopeful we'll see
something like this for LLMs in the
future. It's me from the future to say
as I was working on this video,
Raspberry Pi ended up releasing an
accelerator chip that works with local
LLMs. Uh the good news is I got my hands
on one. The bad news is it's not going
to slot into my M2 port like I was
hoping. But I did find this other
accelerator chip that should.
So, I think I'm going to make another
video comparing both of these chips with
the build we have today. Okay, let's get
back to the build. [music]
Here's the PCB I had made. The
microcontroller should fit right [music]
into the board.
Let's assemble the rest of the face
plate.
>> [music]
[music]
>> Now, let's add the rest of the hardware.
[music]
>> [music]
[music]
[music]
[music]
[music]
[music]
[music]
[music]
>> Hey,
hey,
[music]
hey.
>> [music]
[music]
>> I will save you a better reality.
>> Let's test our Beimo AI agent. Just like
the real Beimo, I can connect my
keyboard to them and navigate to my
script.
Mo's online.
>> Hey Beimo,
can you write me a short poem about
dinosaurs?
>> Okay,
>> here you can see Beimo quickly decides
to just behave as a chatbot and
generates a response to my prompt right
away.
>> Here's a short poem about dinosaurs.
Scales of green, a mighty stride.
Long neck giants side by side.
T-Rex roared a fistome sound.
Dinosaurs roared all around.
Fossil bones a wondrous sight.
Ancient creatures shining bright.
>> Hey Beimo,
search for news about Canada.
>> In this case, Beimo recognizes that I've
asked it to do something that requires
using the search tool we made. This
takes a bit longer, but in the
background, Beimo is running a search
query using duck.go and then using what
it finds to generate its response.
>> Okay, here's a cute short sentence
summarizing the data. Canada's trade
with the US could be a big problem for
Bent. It would be a disaster.
>> Let's test our camera now.
>> Hey Beimo,
what do you see
>> on it?
painting.
>> Beimo needs to understand that answering
the question, "What do you see?" means
it will need to use their camera.
>> This takes a bit of time to figure out,
but eventually they take a photo.
>> Checking on this.
>> Now, that photo goes back into the loop.
So, the agent again receives the prompt,
"What do you see?" But now, it also has
the image captured by the camera. It
switches from using Gemma to using
Moondream and generates a text
description of the image. This looping
and model swapping takes around a minute
to complete, so I'll speed things up for
this video.
>> In the image, a person is holding up an
orange and white toy mushroom with its
eyes closed.
The mushroom is a red hat on top of it
as well.
The scene takes place in front of a
wall, which could be part of another
room or a specific area within a
building.
This unique toy mushroom seems to be the
main focus of the person's attention.
>> Say something insightful about life in
the age of artificial intelligence.
>> Okay.
>> And Beimo's buttons are all functional.
They send keyboard commands to the Pi
directly. I use the enter key as an
alternative way to trigger voice
capture.
And finally, we can of course use these
to play video games.
I had a ton of fun building Beimo. I
hope you really enjoyed following along.
Again, I'm not a professional developer,
so I can't really say whether this is
the right way to approach making your
own embodied AI agents. This was really
just an opportunity for me to learn more
about a technology that we're all being
told is going to reshape our lives and
see whether the benefits of AI can
really only be achieved by resource
consuming data centers and stealing
intellectual property. Now, I'm not
saying local models solve all the
problems with AI today. There are a lot
of downsides to them. Because they're
offline and open, it's really difficult
to put guard rails in place and enforce
them. So, are we doomed? It's not
difficult to imagine the current AI boom
is rushing us towards some kind of
apocalyptic future. One where a Skynet
or HAL 9000 or GLaDOS values growth at
all cost more than human lives. But
maybe if we take a step back and really
learn about these tools we're being
sold, maybe we can build something
different.
And that's what makes Beimo one of my
favorite characters. Even when faced
with the end of the world, they teach us
to be kind, creative, and take care of
one another.
Beimo was a reminder that we can always
be more.
>> That is an interesting response.
Shut down.