Saved transcript

What Is Llama.cpp? The LLM Inference Engine for Local AI

Channel: IBM Technology

Video title: What Is Llama.cpp? The LLM Inference Engine for Local AI
Channel: IBM Technology
Lines: 108
Tool page: Open live transcript tool

0:00

Have you ever wondered how to run a large language model on a small machine like a laptop or a Raspberry Pi?

0:06

I'm talking about AI that has no subscription cost, AI that no usage limits, and you also get full control of your data.

0:14

Well, my friend, stick around because I want to introduce you to a project that's called Llama C++

0:20

and show you how you can run your own local AI models with complete privacy, data control, and benefit from all of this.

0:27

Let's get started.

0:28

So if you think about most large language models, well, they're designed to

0:32

run in huge data centers that are both expensive and power-hungry.

0:35

But let's walk through this.

0:36

Say, for example, that we're making a request to our model and we're gonna say something like,

0:40

hey, based on these documents, I want you to answer this specific question.

0:46

And we're going to add in our document sources here that might be PDFs,

0:50

they might be different types of spreadsheets and file types

0:52

using a method that's known as retrieval augmented generation or RAG.

0:57

Which takes our original question, right, and adds in some context from

1:01

those documents into the context window of the model.

1:05

So this is the context right here.

1:07

And it might not just be documents, right?

1:09

It might be data sources, like a database or something.

1:11

So we can connect those various data sources using a standard that's called model context protocol.

1:19

So this going to fetch out to our different CRMs and sources in order to bring that information into the context window,

1:26

just like how we do with retrieval augmented generation.

1:30

Then from here, as we've got all this information into one prompt,

1:33

we're going to pass that along to our large language model.

1:37

So here he is over here, and this is our LLM that's running, but the thing is, whether we're using RAG or just asking questions,

1:45

doing vibe coding, using model context protocol in order to do perhaps agentic functionality,

1:51

so we have our agent right here doing some reasoning back and forth with the data.

1:56

Well, no matter what we're using at the end of the day or doing to the prompt to

2:00

get our response back, it's always going to go to an LLM endpoint.

2:04

So by default, a lot of developers kind of start off using some type of proprietary model,

2:09

maybe it's an API, and it's really easy to send that request to some type of server,

2:14

as I mentioned before, that's running in the cloud.

2:16

The trick is that can get pretty expensive very fast because you're being charged by how many tokens you use.

2:22

And the more information that you're putting in your prompt as your prompt gets bigger,

2:26

the more that's going to cost you and your organization.

2:29

And at the same time, well, if you have secure and private data that's being sent,

2:33

well, it's also a concern for governance and compliance there.

2:37

So what's happening now, and because of this situation, well, a lot of developers are

2:43

starting to look towards running their own large language models on their own device.

2:49

Because you don't have to worry about cost, because you already have that hardware, or the data leaving your premise.

2:54

And that's the idea behind Llama C++ is to be able to run these local large language models on your own hardware,

3:03

not have to worry about sending things to the cloud here, because we already have

3:07

our own machine that has capabilities to run at these models locally.

3:12

And so what we can do is we can take this entire kind of AI application that

3:17

we're already have set up, and we can start to send those requests to a local model.

3:22

And if you've ever heard of Ollama or Jan or GPT4All, all of these tools are using Llama C++ under the hood to run those LLMs.

3:32

But how does that happen?

3:33

Well, we're gonna talk about a few optimizations that Llama C++ does in order to run them on small amounts of hardware.

3:41

So how does Llama C++ work?

3:43

Well, let's start with the model itself.

3:45

So I want you to think of open models that you've heard of, for example, DeepSeek

3:49

perhaps, or maybe you've heard of the Llama family of models, or maybe Qwen, perhaps, and many, many more that all come

3:57

from typically one place, like Hugging Face or other repositories that have open-source models you can use.

4:06

Now, what Llama C++ does is it uses a specific format called GGUF.

4:11

So it takes the model weights, and here we have the weights, and it also puts together the metadata,

4:18

all together in one place, and stores it as this GGUF file, so GGUF.

4:25

And this makes it really easy to do quick loading and swapping of models.

4:29

So, say for example, I'm starting off with DeepSeek, but I need, say, for example,

4:33

Qwen in order to do some retrieval augmented generation tasks, or maybe the Llama family of models to do something.

4:41

This allows me to easily swap and switch because this is all together in one format.

4:45

Now when models are released, they're typically released at a format or precision that's known as 32-bit or 16-bit.

4:53

So let's say this is 16-bit, and we've got a high accuracy for this model.

4:58

But the thing is, at the same time, this requires a large amount of RAM in order to store this and run this model.

5:05

So the idea with Llama C++ and these GGUF formats is that you can shrink the model down to a lower precision.

5:12

So, instead of 16- bits, we're going to store this as 4-bits.

5:17

So we still have pretty similar capabilities and maybe a high degree of accuracy still,

5:22

but instead of needing that huge amount of RAM, we only need, in this case, 25% of the hardware capabilities to run this model.

5:32

So what's great is that we can use less hardware and we can run these models

5:36

on smaller machines because of this model quantization, and that's what this process is called—

5:40

going from high precision to low precision when we store the weights.

5:45

So, for example, you might see situations for these open models where they're released as,

5:50

for example, DeepSeek, and they're named in this format as well.

5:56

So it would be something similar to quantize, Q for quantize, and then at 4-bit precision.

6:03

And then we're also going to refer to a specific compression algorithm and type.

6:07

So we're going to do underscore K and M.

6:10

And I won't get too into the depths here, but this is just the variant that's tuned for quality

6:14

that you'll typically see when you're searching open-source models online.

6:19

And that refers to the model compression that's being used in order to save perhaps 75%

6:24

on hardware usage and capabilities needed to run the model, but also giving you

6:30

much higher throughput when you are actually running this for various tasks.

6:35

Speaking of speed, I wanna talk a little bit about Llama C++'s optimized kernels for nearly every platform.

6:42

So maybe you're using Mac, so there's support for Metal there, or you have an NVIDIA GPU, so accessibility for CUDA.

6:49

Maybe you have a AMD card, so there is ROCm here, also Vulkan.

6:55

And the thing is, whatever you're running, you have support for all types of models on all types of hardware.

7:02

Of course, also including CPU.

7:04

So that's the beauty of it is that you can swap out different models, compress them, and use them in an optimized way.

7:11

So, what does this look like in action?

7:13

Well, as an engineer, as a developer, if you're just tinkering around, you can try

7:17

out and use Llama C++ either through the terminal using the Llama CLI.

7:23

So the Llama CLI allows you to call in a specific model—and so this will be model.gguf—

7:31

and run that and chat with that using your own terminal or CLI.

7:37

But for a lot of use cases, you're going to want an open AI-compatible local server.

7:41

So in order to do this, we would run the command Llama server, point to the model,

7:46

and we would type in the model here, but also we would do a port assignment.

7:51

So port 8080.

7:53

And on that port, we will be able to send Git and POST requests, be able to connect different extensions.

8:00

And maybe you're using something like LangChain or LangGraph,

8:02

where you have the same compatibility between remote servers, but also local model servers that you might be running as well.

8:09

Now, there's also additional capabilities with Llama C++, specifically for capabilities like being able to work with images,

8:18

be able to describe what's happening in there as well, or as I mentioned before,

8:22

the connection between different services that your AI models might want to use using model context protocol.

8:29

That allows you to bring in databases, your CRM, you name it.

8:33

So, thanks to the open-source community, AI is becoming more accessible than ever,

8:39

from models to compression to kernels and support for developers who want to run their own local AI models,

8:45

knowing that their data stays on their device and there's no rate limits or API outages.

8:51

Thanks so much for watching.

8:52

If you learned something today, please be sure to drop a like and hack the algorithm.

8:57

Also, stay subscribed for more content on AI and open-source technology.

9:01

Thanks and have a great day.