TLDW logo

THIS is the REAL DEAL 🤯 for local LLMs

By Alex Ziskind

Summary

## Key takeaways - **LM Studio Lacks True Parallelism**: While LM Studio is user-friendly for running local LLMs, it fails to scale with concurrent requests, processing them sequentially despite using Llama CPP. [01:47], [02:41] - **Docker Model Runner Enables Parallelism**: Docker Model Runner supports parallel requests and can be deployed with applications via Docker Compose, allowing for true GPU utilization and improved throughput. [03:32], [04:49] - **VLM with Docker for Enhanced GPU Use**: Using VLM within a Docker container, especially with powerful GPUs like the RTX Pro 6000, unlocks significant performance gains and parallelism beyond standard tools. [06:11], [06:25] - **FP8 Quantization for Nvidia Speed**: Nvidia GPUs natively support FP8 quantization, a floating-point format that significantly boosts LLM processing speeds by reducing precision without sacrificing performance, unlike integer quantization. [08:34], [09:37] - **Macs Struggle with LLM Parallelism**: While Macs are capable, they face limitations with LLM parallelism and primarily support GGUF and MLX formats, unlike the parallel processing capabilities offered by Nvidia GPUs with tools like VLM. [09:03], [09:11]

Topics Covered

  • Local LLMs can outperform cloud services.
  • LM Studio caps performance at one concurrent request.
  • Docker enables parallel LLM requests for developers.
  • VLM with Docker unlocks massive LLM parallelism.
  • FP8 quantization is key to extreme LLM speeds.

Full Transcript

Now, this may seem like I'm using

co-pilot or clawed, but actually all

those are off, but check this out. Fix

this code.

Done. Except accept. Write comments for

this code. Boom. This is happening real

time. And this is really, really fast.

How about chat? What's wrong with this

function?

There it goes. That is happening in real

time. That is insane. How am I doing

this? Well, there's actually a few

pieces to this puzzle, and I'm going to

take you through it in this video. To

start with, I'm not actually running the

model on my Mac, even though this Mac is

fully capable of running this model,

which by the way happens to be Quen 3

Coder 30 billion parameter model that

just recently came out. And that's

really good for coding scenarios and

coding autocomplete scenarios because

it's a fill-in-the-middle type of model.

I go into more detail about that in my

member videos. By the way, thanks to the

members of the channel for your

continued support. members get extra

videos and so on, and you get emojis.

Are you a member? Are you using the

emojis in the comments? You should try

it out. So, I just ran this benchmark

and I got 5,800 tokens per second, and

it actually ran on this machine, which

is a machine I built a few videos ago.

How am I getting such insane speeds from

this? Well, first, LM Studio and Olama

are probably tools that you're already

familiar with. I've showed you how to

set these up before and they're easy,

especially LM Studio that has a UI where

you can control pretty much all the

aspects. You can download your models

here. Here's Quen 3 coder 30 billion.

You can run a local server and you can

actually use that in your code editor or

you can chat with it right in here in

the UI. This is what I'm doing and I'm

getting 71 tokens per second from Quen

Coder 30 billion. And there's a big butt

here. I don't know. Well, maybe some

people will say it. LM Studio only

supports one concurrent request. Let me

show you here. I'm going to set up one

concurrent request in my benchmark and

I'm going to run the uh LM Studio

scaling benchmark that allows me to

query against a certain model. Here I'm

quering quen coder 30 and it's using the

running instance in LM Studio and I'm

getting 80 tokens per second. Pretty

good. But let's say I wanted to do two

concurrent users. Stick with me here for

a moment. I know you're probably going

to be like, "Oh, I'm only one user. Why

would I need two concurrent users?" Hang

on. So, here I'm going to do two

concurrent users. We'll see LM Studio is

generating successful two out of two and

we're getting 79 tokens per second.

Pretty close. Four concurrent users.

What are you seeing here? You're seeing

that this is not scaling. This is just

queuing up. And you can see that here

generating two cued. It's queuing up the

requests one by one and only processing

them one at a time. So there you go.

Four out of four successful. We

generated 2,000 tokens, 80 tokens per

second, which is about the same. So

we're not getting any kind of benefit

from running concurrent users. It can't.

Even though LM Studio uses Llama CPP,

the popular library, as a backend, it's

not able to run multiple queries in

parallel, which holds it back a bit. So

here's Llama CPP's Llama Bench. After

you compile Llama CPP locally, I'm

getting about 78 tokens per second for

that model here. Now, Llama CVP does

come with a tool called Llama Parallel.

So, you can test the parallelism. But,

uh, unfortunately, LM Studio doesn't

support that. Olama does. And here's

another tool that does. Docker. Yes,

that same Docker that developers use to

develop applications locally so that

they maintain consistency of their

environment. So, let's say Docker model

run. Actually, Docker model list. Let's

take a look at what I have here. Let's

do Docker model run AI Gemma 3. And

we'll give it a high prompt. I know you

love high, but this is just to

demonstrate quickly what's happening

here. And there it is. It answered me.

Great. Now, why would I use this tool?

Well, couple reasons. One, it actually

supports parallelism. And two is it can

be deployed with your applications, with

your Docker Compose applications. Here

I've got a Docker file. I'm installing

my pip requirements. I have a load test.

But in my Docker Compose file, I'm

including the environment which has uh

the number of concurrent users. Let's

start with one for example. And then I

can specify along with my services that

I want to expose in Docker Compose. If

you're not familiar with Docker and

Docker Compose, there's other great

videos and I can post to some of them uh

some resources down below in the

description. But basically, this will

allow you to run your applications

alongside with your models because when

you run a normal Docker container, it

cannot utilize the GPU of the system.

Now, Docker Model Runner is able to. So

that's why you have a docker model

section. Now you specify what model you

want to run. You specify the context

size. And this is important right here.

The runtime flags. This will allow the

docker model to answer parallel

requests. So I can set that to four for

example. And let's try this. Going to

observe what's happening on the GPU

here. And we are in fact using the GPU.

You can see it right there. And I'm

getting 66 tokens per second for this

model. Now I can also set this to

concurrent users of four. So now I'm

going to issue four requests

simultaneously. And I'm getting to why

that's important for software

developers. But I just want to show you

this right here. We're now up to 88

tokens per second. So the requests are

not being cued like they are in LM

Studio. There is some parallel

processing going on here. Not that much,

but that's where the next technology

comes in handy. There's several layers

of technologies here. One is

parallelism. Parallelism allows us to

saturate the GPU to be able to

concurrently answer multiple requests at

the same time. And this is important

when it comes to code completion. When

you're doing chat, one request at a time

is all you can do. But when you're doing

code completion, it's sending tons of

data to your provider. In this case, the

provider is the GPU living in this box

over here. And I'll talk about what's in

that box in a moment. GPU stays

saturated, queuing drops down, and

latency drops down as well. Now, there

is a tool, you might have heard of it,

called VLM. It's an open- source tool,

and I haven't covered this tool at all

yet on the channel, but I'm going to

start getting into it because this is

actually like a step up. It's a little

bit more to configure. And that's where

Docker is going to help us out as well

because Docker will allow you to easily

spin up VLM with Nvidia support. And

that's what I'm running here. I'm

running the RTX Pro 6000. Now, we're

seeing the true power of what that card

can do. Sure, that card can game. And

I've shown that before on the channel.

When I first popped this open and

started playing with it, but I was using

LM Studio. I was using tools like Olama.

But the Docker setup with VLM can

actually be transported. You can use it

with other Nvidia cards like the 40

series or the 50 series, smaller

non-professional cards that are going to

be cheaper, but still give you that

parallelism by using VLM. So here I am

connected through SSH to this machine

for my Mac and I'm actually spinning up

a Docker container passing in the GPUs,

all of them. I only have one in there,

but that's what that is. I'm sending in

my model, which happens to be Quen Coder

30 billion. And this is the other piece

of the puzzle right here, FP8. More on

that in a moment. But here's my image.

It's a VLM image. That way I don't need

to worry about setting that up every

single time. I just spin up the Docker

container and it works. So here, let me

show you how can I improve this code.

Boom. There it goes. It just received

that and instantly provided a response.

You might have even heard the little

tickle twinkle sound from that coil wine

that's happening on the GPU. That was

just one request right there. What if I

issue a bunch of them? And there they

are. Four requests, four concurrent

users. 298 tokens per second. And it's

calculating that because we've generated

total of 4,16 tokens in just a few

seconds. And we've been up to four so

far. Let's go to uh 256

concurrent users. Yeah, I'm not even

joking. Look at that. Running benchmark

with 256 concurrent users. It's sending

256 requests all at the same time to

that GPU. Waiting for the response.

Let's see what happens here. Look at

that. The throughput is 6,000 tokens per

second. This is what's happening on that

Linux box right here, by the way. And

now we've started responding. Look at

that. They're all going in. We've

generated 254,000

tokens at a rate of 5,800 tokens per

second. That is just crazy. And we can

go even higher. So, we talked about

Docker. We talked about VLM with Docker.

The last thing I want to mention is

quantization. And this applies

specifically to Nvidia cards. Also, AMD,

but AMD Instinct cards, those are server

grade cards, not the consumer cards. But

all Nvidia black hole cards, including

the consumer cards, will support and do

support FP8 quantization natively. They

also support FP4 quantization, and I'm

going to do a video on that separately.

That's even crazier speeds. Let me know

in the comments down below if you want

to see that. But on my Mac, Macs are

pretty good and pretty fast, but they

have an issue with parallelism. And they

support only GGUF models, safe tensor

models, and MLX quantizations are

Apple's own quantizations that are

optimized for Apple silicon. There's a

lot more detail I can go into and I do

in other videos, but this video is just

an overview. Now, FB8 is a floatingpoint

quantization. So, here we have Quenoda

30B instruct. And originally this model

is in BF-16 which is unquantized. They

also have the FP8 version which is

quantized down to 8 but not integer 8.

This is floating point8. I'll link to

this post down below. Basically this is

how Nvidia gets things to run really

really fast. And FP4 is even faster. So

you have your baseline precision, your

weights, for example the 16 floatingoint

weights. And now if you want to convert

that to FP8, you have to take all those

values in the original weights and

convert them down to just eight bits.

But whereas integer 8 bits are static

and they're spread out equally across

eight values, floating point is a little

bit more fluid, giving you the ability

to actually get better precision

depending on the data. This is a good

read. I'll link to this down below.

Also, check out Julia Turk's channel.

I'll link to a video where she talks

about floating point 4. It's goes really

deep into this stuff. That's the model

that I'm running right now is the FP8

version which is giving us those crazy

speeds and supported natively by those

tensor cores that are in that chip. And

it is getting kind of warm in here

because well this thing produces a lot

of heat. So I'm going to go now check

out Docker Model Runner which is

something really approachable and really

easy to use and check out my build of

that machine right over here. Thanks for

watching and I'll see you in the next

video.

Loading...

Loading video analysis...