THIS is the REAL DEAL 🤯 for local LLMs
By Alex Ziskind
Summary
## Key takeaways - **LM Studio Lacks True Parallelism**: While LM Studio is user-friendly for running local LLMs, it fails to scale with concurrent requests, processing them sequentially despite using Llama CPP. [01:47], [02:41] - **Docker Model Runner Enables Parallelism**: Docker Model Runner supports parallel requests and can be deployed with applications via Docker Compose, allowing for true GPU utilization and improved throughput. [03:32], [04:49] - **VLM with Docker for Enhanced GPU Use**: Using VLM within a Docker container, especially with powerful GPUs like the RTX Pro 6000, unlocks significant performance gains and parallelism beyond standard tools. [06:11], [06:25] - **FP8 Quantization for Nvidia Speed**: Nvidia GPUs natively support FP8 quantization, a floating-point format that significantly boosts LLM processing speeds by reducing precision without sacrificing performance, unlike integer quantization. [08:34], [09:37] - **Macs Struggle with LLM Parallelism**: While Macs are capable, they face limitations with LLM parallelism and primarily support GGUF and MLX formats, unlike the parallel processing capabilities offered by Nvidia GPUs with tools like VLM. [09:03], [09:11]
Topics Covered
- Local LLMs can outperform cloud services.
- LM Studio caps performance at one concurrent request.
- Docker enables parallel LLM requests for developers.
- VLM with Docker unlocks massive LLM parallelism.
- FP8 quantization is key to extreme LLM speeds.
Full Transcript
Now, this may seem like I'm using
co-pilot or clawed, but actually all
those are off, but check this out. Fix
this code.
Done. Except accept. Write comments for
this code. Boom. This is happening real
time. And this is really, really fast.
How about chat? What's wrong with this
function?
There it goes. That is happening in real
time. That is insane. How am I doing
this? Well, there's actually a few
pieces to this puzzle, and I'm going to
take you through it in this video. To
start with, I'm not actually running the
model on my Mac, even though this Mac is
fully capable of running this model,
which by the way happens to be Quen 3
Coder 30 billion parameter model that
just recently came out. And that's
really good for coding scenarios and
coding autocomplete scenarios because
it's a fill-in-the-middle type of model.
I go into more detail about that in my
member videos. By the way, thanks to the
members of the channel for your
continued support. members get extra
videos and so on, and you get emojis.
Are you a member? Are you using the
emojis in the comments? You should try
it out. So, I just ran this benchmark
and I got 5,800 tokens per second, and
it actually ran on this machine, which
is a machine I built a few videos ago.
How am I getting such insane speeds from
this? Well, first, LM Studio and Olama
are probably tools that you're already
familiar with. I've showed you how to
set these up before and they're easy,
especially LM Studio that has a UI where
you can control pretty much all the
aspects. You can download your models
here. Here's Quen 3 coder 30 billion.
You can run a local server and you can
actually use that in your code editor or
you can chat with it right in here in
the UI. This is what I'm doing and I'm
getting 71 tokens per second from Quen
Coder 30 billion. And there's a big butt
here. I don't know. Well, maybe some
people will say it. LM Studio only
supports one concurrent request. Let me
show you here. I'm going to set up one
concurrent request in my benchmark and
I'm going to run the uh LM Studio
scaling benchmark that allows me to
query against a certain model. Here I'm
quering quen coder 30 and it's using the
running instance in LM Studio and I'm
getting 80 tokens per second. Pretty
good. But let's say I wanted to do two
concurrent users. Stick with me here for
a moment. I know you're probably going
to be like, "Oh, I'm only one user. Why
would I need two concurrent users?" Hang
on. So, here I'm going to do two
concurrent users. We'll see LM Studio is
generating successful two out of two and
we're getting 79 tokens per second.
Pretty close. Four concurrent users.
What are you seeing here? You're seeing
that this is not scaling. This is just
queuing up. And you can see that here
generating two cued. It's queuing up the
requests one by one and only processing
them one at a time. So there you go.
Four out of four successful. We
generated 2,000 tokens, 80 tokens per
second, which is about the same. So
we're not getting any kind of benefit
from running concurrent users. It can't.
Even though LM Studio uses Llama CPP,
the popular library, as a backend, it's
not able to run multiple queries in
parallel, which holds it back a bit. So
here's Llama CPP's Llama Bench. After
you compile Llama CPP locally, I'm
getting about 78 tokens per second for
that model here. Now, Llama CVP does
come with a tool called Llama Parallel.
So, you can test the parallelism. But,
uh, unfortunately, LM Studio doesn't
support that. Olama does. And here's
another tool that does. Docker. Yes,
that same Docker that developers use to
develop applications locally so that
they maintain consistency of their
environment. So, let's say Docker model
run. Actually, Docker model list. Let's
take a look at what I have here. Let's
do Docker model run AI Gemma 3. And
we'll give it a high prompt. I know you
love high, but this is just to
demonstrate quickly what's happening
here. And there it is. It answered me.
Great. Now, why would I use this tool?
Well, couple reasons. One, it actually
supports parallelism. And two is it can
be deployed with your applications, with
your Docker Compose applications. Here
I've got a Docker file. I'm installing
my pip requirements. I have a load test.
But in my Docker Compose file, I'm
including the environment which has uh
the number of concurrent users. Let's
start with one for example. And then I
can specify along with my services that
I want to expose in Docker Compose. If
you're not familiar with Docker and
Docker Compose, there's other great
videos and I can post to some of them uh
some resources down below in the
description. But basically, this will
allow you to run your applications
alongside with your models because when
you run a normal Docker container, it
cannot utilize the GPU of the system.
Now, Docker Model Runner is able to. So
that's why you have a docker model
section. Now you specify what model you
want to run. You specify the context
size. And this is important right here.
The runtime flags. This will allow the
docker model to answer parallel
requests. So I can set that to four for
example. And let's try this. Going to
observe what's happening on the GPU
here. And we are in fact using the GPU.
You can see it right there. And I'm
getting 66 tokens per second for this
model. Now I can also set this to
concurrent users of four. So now I'm
going to issue four requests
simultaneously. And I'm getting to why
that's important for software
developers. But I just want to show you
this right here. We're now up to 88
tokens per second. So the requests are
not being cued like they are in LM
Studio. There is some parallel
processing going on here. Not that much,
but that's where the next technology
comes in handy. There's several layers
of technologies here. One is
parallelism. Parallelism allows us to
saturate the GPU to be able to
concurrently answer multiple requests at
the same time. And this is important
when it comes to code completion. When
you're doing chat, one request at a time
is all you can do. But when you're doing
code completion, it's sending tons of
data to your provider. In this case, the
provider is the GPU living in this box
over here. And I'll talk about what's in
that box in a moment. GPU stays
saturated, queuing drops down, and
latency drops down as well. Now, there
is a tool, you might have heard of it,
called VLM. It's an open- source tool,
and I haven't covered this tool at all
yet on the channel, but I'm going to
start getting into it because this is
actually like a step up. It's a little
bit more to configure. And that's where
Docker is going to help us out as well
because Docker will allow you to easily
spin up VLM with Nvidia support. And
that's what I'm running here. I'm
running the RTX Pro 6000. Now, we're
seeing the true power of what that card
can do. Sure, that card can game. And
I've shown that before on the channel.
When I first popped this open and
started playing with it, but I was using
LM Studio. I was using tools like Olama.
But the Docker setup with VLM can
actually be transported. You can use it
with other Nvidia cards like the 40
series or the 50 series, smaller
non-professional cards that are going to
be cheaper, but still give you that
parallelism by using VLM. So here I am
connected through SSH to this machine
for my Mac and I'm actually spinning up
a Docker container passing in the GPUs,
all of them. I only have one in there,
but that's what that is. I'm sending in
my model, which happens to be Quen Coder
30 billion. And this is the other piece
of the puzzle right here, FP8. More on
that in a moment. But here's my image.
It's a VLM image. That way I don't need
to worry about setting that up every
single time. I just spin up the Docker
container and it works. So here, let me
show you how can I improve this code.
Boom. There it goes. It just received
that and instantly provided a response.
You might have even heard the little
tickle twinkle sound from that coil wine
that's happening on the GPU. That was
just one request right there. What if I
issue a bunch of them? And there they
are. Four requests, four concurrent
users. 298 tokens per second. And it's
calculating that because we've generated
total of 4,16 tokens in just a few
seconds. And we've been up to four so
far. Let's go to uh 256
concurrent users. Yeah, I'm not even
joking. Look at that. Running benchmark
with 256 concurrent users. It's sending
256 requests all at the same time to
that GPU. Waiting for the response.
Let's see what happens here. Look at
that. The throughput is 6,000 tokens per
second. This is what's happening on that
Linux box right here, by the way. And
now we've started responding. Look at
that. They're all going in. We've
generated 254,000
tokens at a rate of 5,800 tokens per
second. That is just crazy. And we can
go even higher. So, we talked about
Docker. We talked about VLM with Docker.
The last thing I want to mention is
quantization. And this applies
specifically to Nvidia cards. Also, AMD,
but AMD Instinct cards, those are server
grade cards, not the consumer cards. But
all Nvidia black hole cards, including
the consumer cards, will support and do
support FP8 quantization natively. They
also support FP4 quantization, and I'm
going to do a video on that separately.
That's even crazier speeds. Let me know
in the comments down below if you want
to see that. But on my Mac, Macs are
pretty good and pretty fast, but they
have an issue with parallelism. And they
support only GGUF models, safe tensor
models, and MLX quantizations are
Apple's own quantizations that are
optimized for Apple silicon. There's a
lot more detail I can go into and I do
in other videos, but this video is just
an overview. Now, FB8 is a floatingpoint
quantization. So, here we have Quenoda
30B instruct. And originally this model
is in BF-16 which is unquantized. They
also have the FP8 version which is
quantized down to 8 but not integer 8.
This is floating point8. I'll link to
this post down below. Basically this is
how Nvidia gets things to run really
really fast. And FP4 is even faster. So
you have your baseline precision, your
weights, for example the 16 floatingoint
weights. And now if you want to convert
that to FP8, you have to take all those
values in the original weights and
convert them down to just eight bits.
But whereas integer 8 bits are static
and they're spread out equally across
eight values, floating point is a little
bit more fluid, giving you the ability
to actually get better precision
depending on the data. This is a good
read. I'll link to this down below.
Also, check out Julia Turk's channel.
I'll link to a video where she talks
about floating point 4. It's goes really
deep into this stuff. That's the model
that I'm running right now is the FP8
version which is giving us those crazy
speeds and supported natively by those
tensor cores that are in that chip. And
it is getting kind of warm in here
because well this thing produces a lot
of heat. So I'm going to go now check
out Docker Model Runner which is
something really approachable and really
easy to use and check out my build of
that machine right over here. Thanks for
watching and I'll see you in the next
video.
Loading video analysis...