The Architect
By unknowntech
Summary
## Key takeaways - **Hybrid AI Services Essential**: The idea that you're going to pick one AI solution and use it for everything is a fallacy; use multiple services like 11 Labs for voice, Perplexity for reasoning, Warp terminal, Gemini, Claude, and Grok for specific strengths while controlling costs to avoid $300-400 monthly subscriptions. [13:03], [15:09] - **RAM Prices Tripled**: RAM has tripled in price since October due to the AI boom and DRAM shortage, making DDR5 expensive; he uses DDR4 ECC 64GB sticks at $200-220 each for 512GB in his EPYC systems. [21:43], [24:21] - **3090s Beat DGX Spark**: Two used 3090s with 48GB GDDR6 for $1200 outperform the $4000 Nvidia DGX Spark in tokens per second for models that fit, due to superior memory bandwidth despite Spark's 128GB unified memory. [29:38], [39:10] - **Avoid ePenis Token Brags**: Reddit users brag about tokens per second without use cases, like buying $8000 RTX Pro 6000; focus on actual needs, SLMs, and quants instead of largest models or enterprise licenses few use. [28:35], [31:32] - **DIY DGX Spark for $2000**: Built a small form factor AI box with minisforum AR900i mobo, 96GB RAM, RTX 4000 SFF ADA 24GB, Ubuntu, Ollama, Open Web UI, and Nvidia NIM for half the DGX Spark cost, running GPT4All 20B. [59:48], [01:16:11] - **Use Case Drives Hardware**: Determine use case first for home lab budgets; commercial workstations like 9950X3D with RTX 5090 32GB handle AI, coding via Continue.dev with local Ollama, gaming without separate servers. [31:08], [01:34:30]
Topics Covered
- Full Video
Full Transcript
Heat. Heat. [music]
>> [music] [music] >> Heat. Heat.
>> Heat. Heat.
[music] >> [music] [music] >> Heat. Heat.
>> Heat. Heat.
[music] Heat up here.
Heat up here. [music]
[music] >> [music] [music] >> Heat up here.
Heat.
[music] Heat.
>> [music] >> Heat. Heat.
>> Heat. Heat.
[music] >> [music] [music] >> Heat. Heat.
>> Heat. Heat.
[music] >> [music] >> Heat. Heat.
>> Heat. Heat.
[music] >> [music] [music] >> Heat. Heat.
>> Heat. Heat.
Heat >> [music] [music] >> up [music] here.
[music] >> [music] [music] >> Heat. Heat.
>> Heat. Heat.
Heat up here.
[music] Heat. Heat.
Heat. Heat.
>> [music] [music] >> Heat. Heat.
>> Heat. Heat.
Heat [music] [music] up here.
Oh [music] [music] [music] [music] [music] You move through [music] me like a flash [singing] of light in my sh [music]
[music] Burning where you're here.
[music] >> [music] >> I see [music] I can't let you go. realize the
only [music] hope I know.
[music] [music] >> [music] >> Time slow. [music]
[music] Your ghost keeps calling my [music] name.
Hey [music] I see you in the shadow [music]
turns to go.
[music] Hey, I can't let you go. You the
only hope. [music]
>> [music] >> still shining even when [music] you're gone.
[music] All right. Good morning, good afternoon,
All right. Good morning, good afternoon, good evening. This is Mike from Unknown
good evening. This is Mike from Unknown Tech and welcome to the Architect podcast. So, this was an impromptu
podcast. So, this was an impromptu podcast and uh I'm actually glad that I actually did this because I'm excited to talk about the next subject. Um I know it's been a while since I've had a
podcast. I've been crazy busy with work,
podcast. I've been crazy busy with work, travel, family, you know, etc., etc., you know, life in general, right? Um but
um I again around the holiday season I tend to start picking up uh more you know doing more content uh creating more blog posts and you can see I'm starting
to ramp that up again um as we go through the holiday season um through the holidays around you know before Christmas or during Christmas week I
probably do probably will do the the updated home lab um video and we'll and it'll be posted on YouTube channel. Um,
and you can see what the home lab looks like. It hasn't changed much since the
like. It hasn't changed much since the last lab. And I don't know if you recall
last lab. And I don't know if you recall the last time I did this, I actually went through and said, "Hey, I'm not going to buy anything uh for the foreseeable foreseeable future." And uh,
pretty much I didn't I actually just replaced some drives on my NVR cuz I had to because the drives that I was using were failing miserably. Um, and so I bought some AI Seagate drives, but we'll
talk about that in a later episode.
Before I start here, I just want to point out that I am streaming to LinkedIn and YouTube as usual. Um,
YouTube, excuse me, LinkedIn will not have the ability to chat. U, but on YouTube, you will have the ability to actually uh chat and have comments and I can answer those. There is a delay.
There's about a minute delay between uh my stream and what you're seeing um on YouTube. So, if I don't respond to you,
YouTube. So, if I don't respond to you, if you have a question, um it's not that I'm ignoring you. Um there actually is this delay that I use because I use reream to actually stream to multiple platforms. Uh another thing too is you
do if you do want to join a YouTube session versus a LinkedIn session, the link is right on the screen in the white text below. Um and what I've noticed is
text below. Um and what I've noticed is that YouTube has pretty much better quality uh stream than LinkedIn. And of
course, you know, YouTube's been doing this for years and they have, you know, pretty pretty awesome technology. And
I'm not saying LinkedIn doesn't have technology, but uh YouTube has been doing this for years. So, you will get a better quality stream when it comes to uh the stream on YouTube. So, if you want to just go to that, you have that
link right there. I'll leave it up for a few seconds. Now, before we get into the
few seconds. Now, before we get into the subject of today's episode, which is really focused on building your own AI for your home lab, right? And we'll get
into components and we'll talk about what I use and all the choices that you can make. And then we'll end it with uh
can make. And then we'll end it with uh what I use for AI services. um and and we'll bring it all together right at the end and say, "Okay, well, what do you use this for? Do you use hybrid services?" And the answer is yes, right?
services?" And the answer is yes, right?
I think you know this idea that you're going to pick this one AI solution and say, "Oh, this is my solution and I'm going to use it for everything." It's
just a fallacy, right? I think there's so many different services out there that do things very well. I'll give you an example. Uh you know, voice services,
an example. Uh you know, voice services, you know, from 11 Labs is phenomenal.
Not saying you can't build this on prem or do it in your home lab. You certainly
can do that. But 11 Labs definitely uh has a a pretty good entry level uh plan for demos and and testing. Um and you
know there are other services that I use like you know I use warp and I have a for my terminal and we'll talk about the power of warp in a little bit. It's very
very powerful probably the best terminal I've ever used. Um and then you know I use perplexity from a overall you know service perspective um you know for you
know general reasoning um asking questions getting uh having them build content things like that right so you know perplexity is great um it's it does
phenomenally well um people always say why don't you use chat GPT I have my own personal reasons behind that I'm not going to get into that but the reality is is you have other options out there besides chat GPT that you can leverage
coverage that are very good including Gemini right from Google. I know Gemini 3 I think just got released from Google is pretty powerful. Of course, Anthropic
has Claude is another phenomenal uh service, right? You know, these frontier
service, right? You know, these frontier models I'm talking about do very very well.
Um, and you know, I'm not saying they're very close to one another, but they're kind of, right? They're not that far away from um what you can deliver on each of those models. Um, and of course, you have Grock from Twitter, too. I
didn't I didn't uh or excuse me, from from Axe. I didn't really talk about
from Axe. I didn't really talk about that, too. But again, if you think about
that, too. But again, if you think about it, if you sign up for all these, it's almost like, you know, your television plan, right? I mean, at some point,
plan, right? I mean, at some point, you're going to get to a point where, oh, this one does better than this for for this use case, and now I have 10 subscriptions and I'm spending, you
know, $3 $400 a month on AI services, which I which is, you know, again, you have to control cost. Obviously, um, you know, this is different than work.
Again, I want to separate this episode from a lot of the episodes I've done when it comes to enterprise AI solutions. I've talked about NAI. I've
solutions. I've talked about NAI. I've
talked about NAND, right? All these
solutions that I use in the enterprise and we talk about for the enterprise, but I'm talking about what you can use at home today, right? And those aren't replacements for those things, by the
way, right? You're not going to uh bring
way, right? You're not going to uh bring in Nanix Enterprise AI as an example into your home or you're not going to use NADN with the fundamentally the full plan, right? where you're paying
plan, right? where you're paying astronomical amount. You can use n at
astronomical amount. You can use n at home and I do use it at home and I will be starting to use it more and more for for a gentic AI for automating some things around my home lab. Uh that'll be
something that I'll be working on the next couple months. But you know what I'm talking about is using them like an enterprise solution or requiring enterprisegrade you know capabilities
for a home lab. Now I know we're kind of at the cusp of that where people want remote access and VPNs and all of these things but the reality is it's not as robust as you know what the robustness
of an enterprise solution is not required uh in a home lab for the most part as much as we'd like to think we have an enterprise solution at home a
lot of us try to to do um enterp or you know provide enterprise foundational features that's okay um but again again there's a a lot of money in enterprise solutions, right? You you know, you pay
solutions, right? You you know, you pay for that for a reason. And I'm looking at what you can build out in the home lab, what you can do locally on your PC, as an example. Um, and we'll talk about
all the options that are out there today um to to go through that. All right. Um
before I start too, there's there's one thing I just want to talk about too is um as you have the break for the holiday season, if you're interested in the UAP
UFO topic, Prime has just released uh a documentary called Age of Disclosure.
You have to pay for it. You can rent it or you can buy it. Um, it's probably the, and this is a a side note, it's probably one of the most remarkable documentaries on AI or excuse me, on
UAP, UFO, alien life that I've ever seen. And so, if you have a interest in
seen. And so, if you have a interest in that subject, which obviously I do, um, it would behoove you to watch a age of disclosure. I'm not going to give away
disclosure. I'm not going to give away any anything that came out in the video because there was a lot of surprises that came out. I follow this topic very closely and uh you know I'm not I'm not
deep into it right but I follow it and I generally generally speaking there are probably you know 10 20 people that are regularly talking about this subject and
I follow those folks and I'm kind of up to date with the information that's come out. I watched the hearings of Congress
out. I watched the hearings of Congress and those things, but even I was surprised at some of the information um that came out of that documentary.
It's called Age of Disclosure. So, if
you have a chance, you have nothing to do this weekend or you want to watch something, it's I know TV's been kind of horrible. Even though there's there has
horrible. Even though there's there has been new releases like Land Man, etc. Um, but as far as the movies are concerned, has been kind of a mixed bag, right, as far as what good movies are
out there. But Age of Disclosure has
out there. But Age of Disclosure has been It's about two over two hours long.
Very very good. Okay. So, I'm going to just share my desktop here because I want to get into um some of the the things I want to talk about today. But I
think we should probably Let me just make this notepad bigger. Hope you guys can see this. Yeah, that's big enough.
So, basically I I align things into three categories when it comes to home lab AI. Okay. So you have
uh enter used enterprise enterprise uh uh hardware.
Okay. You have purpose-built uh platforms. you have um
commercial not enterprise grade uh hardware um small form factor systems which I which I focus on a lot although there is
a trade-off and we'll talk about that in a minute and then you also have uh external
AI services or hybrid AI services. So, I think like this probably is a is a pretty good list to where we're going to start. Um, I
want to start really on um use enterprise hardware.
So, and you can mix and match commercial and enterprise by the way. So, for
example, you can have a, you know, Intel Xeon or AMD Epic platform. I personally
have a couple AMD epic platforms, 64 cores, half a terabyte of memory. Um,
it's an older generation. It's it's the previous generation of epic Milan. So I
can run so I have to use DDR4. That's
one of the downfalls I would say from a from AI if you're going to use CPU based inferencing as as a as an example.
Although I try to stay away from CPU based inferencing um for the sheer fact that what I do right requires GPUbased solutions. And so in my home lab, I run
solutions. And so in my home lab, I run Nutanix holistically um from an enterprise hardware capability, right?
And that's down in a rack. I've showed
you guys that rack in in previous videos. I'll do an update again this uh
videos. I'll do an update again this uh holiday season around Christmas time, New Year's time, uh to show you exactly where I'm at, but essentially it is enterprise grade. It's an ASRock rack.
enterprise grade. It's an ASRock rack.
So let's just go and look and see what I actually use. So I basically went on
actually use. So I basically went on eBay um when I bought this. So, let's go to eBay really quick and see what this is.
So, I have ASRock Rack Rome 82T. Let's
see what those prices are right now. So,
you know, these are pretty expensive. I
didn't pay $600 or $700 for this motherboard. But the reason I bought
motherboard. But the reason I bought this motherboard is obviously, you know, I like Epic. There's more cores the better for me. I have single two single servers, two Nutanic servers, and I have
populated half a terabyte of MA memory.
Uh it's DDR4 3200. The memory bandwidth is isn't great compared to DDR5 platforms. The problem with DDR5 is it's so expensive, right? RAM right now is
atrocious. So RAM has tripled in price
atrocious. So RAM has tripled in price since October and so there's a shortage of RAM of course due flash due to the AI boom, right? So we are and of course
boom, right? So we are and of course there's other reasons, right? Tariffs
and things like that people will talk about, but the reality is is there is a flash a DRAM shortage. Okay. Um, and
that that will fundamentally tie into RAM prices which is happening now. And
like I said, RAM has, you know, pretty much tripled quadrupled in price since October. It's crazy right now, even for
October. It's crazy right now, even for commercial desktop RAM. Okay. Now, this
is ECC memory, right? Registered memory.
And I used 32 uh GB stick or excuse me, 64 GB sticks in this platform. That
gives me 512 gigabytes in the platform.
But what I love about this platform is the amount of PCIe slots that I get, especially with the AMD Epic 64 core
CPU. Now, I didn't buy a retail part.
CPU. Now, I didn't buy a retail part.
Okay, so again, this is pretty expensive. I think I paid maybe 450
expensive. I think I paid maybe 450 bucks for my motherboards at the time.
So it looks like you know hardware right now it's not really a great time unfortunately to be buying hardware especially RAM and flash will be starting will start to become uh more
expensive. I think if you need flash
expensive. I think if you need flash drives right now or anything flash you should buy them now. Um but essentially what I did was I bought this and then I
bought a um AMD epic Milan 64 core uh engineering sample. So you have to be very very careful
about engineering samples.
See these prices are are crazy. I want
to say that I spent here you go 950. I spent less than that and and I don't buy from China. So
that's another thing. Any guy, this is my personal preference. Don't don't take this the wrong way, but I don't buy anything from China. I try not to. Um,
of course it's very difficult because a lot of things are made in China. when it
comes to technology, right, and those things, I try to stay away from uh Chinese brands. Very hard to do, though.
Chinese brands. Very hard to do, though.
You know, I digress. But really, I didn't pay $ 950 for my Epic Milan 64 core processor. I want to say I spent
core processor. I want to say I spent about $700. So, I think I spent $1,100
about $700. So, I think I spent $1,100 for the CPU in the board. And then the RAM, right? Um, I bought used of course
RAM, right? Um, I bought used of course DDR4 ECC registered um 64 gigabyte. Let me go put the speed
here. 3200.
here. 3200.
Yeah, you can see how crazy this is. I
want to say I spent less than $200 per 64 gig stick or it was like 220 a stick, right? So, you can see the RAM prices
right? So, you can see the RAM prices are atrocious right now. So when it comes to enterprise gear it's right now I think I would stay away
from that right now right unless you plan on doing CPU inference you don't need you really don't need 64 cores I use my systems for many other things
other than just AI right so I use it to run all the nutanis capability workloads right demos databases everything else so it's not just AI right so it's not
purpose built just for AI Okay. Uh it's
purpose-built for nutanics and and workloads to demo nutanics and all those things, right? So it it's being used,
things, right? So it it's being used, right? A matter of fact, I'm almost out
right? A matter of fact, I'm almost out of memory on um one of the nodes right now. And that's due to you know a lot of
now. And that's due to you know a lot of Kubernetes, right? A lot of databases,
Kubernetes, right? A lot of databases, things like that, right? So um if you plan on doing CPUbased inference, people will say um go with the latest, you
know, epic CPUs with DDR5 memory or you can even go with the Thread Ripper system. But even Thread Ripper is
system. But even Thread Ripper is expensive. So if I go to Thread Ripper
expensive. So if I go to Thread Ripper and let's go to here, what we'll do here is we'll go to Newegg here and we'll go to Thread Ripper.
And I I looked at Thread Ripper 2. Even
the 7,000 Thread Ripper, which is previous gen, is still expensive.
All right, so this is a 32 core for $24.99. I mean, you can get Epic Milan
$24.99. I mean, you can get Epic Milan 64 core for um you know, 900 bucks, right? Engineering stamp, probably less
right? Engineering stamp, probably less if you keep on looking on on Newegg now.
But this is the new Thread Ripper processor, right? This is a 9000.
processor, right? This is a 9000.
Um, very very Look at the 64 core. 8
grand. Right now, if I put this 64 core against my Epic Milan 64 core CPU, I'll get trouched, right? From a pure IPC perspective, not that there's not that big of a gap, though, right? And so,
that's why I say, why would you spend like eight grand on a Thread Ripper CPU when you can just go get an epic Milan CPU for, you know, 64 core? Um, and
people will answer probably. People will
say, "Oh, well, I need the high clock speed." because you could get very high
speed." because you could get very high clock speed um out of these CPUs today, right? The 9000 series. Uh but you don't
right? The 9000 series. Uh but you don't need that, right? You don't need that at all. You don't have to go enterprise
all. You don't have to go enterprise gear. So, another thing I did was I also
gear. So, another thing I did was I also bought on eBay um for this EP epic system, and we'll we'll finish this up here in a minute here, but I want to and get to the other systems because I think
they're more realistic. But I also bought an Nvidia L4 GPU and these are expensive right now. I can
tell you I spent I want to say 1,700 on mine. So everything's like very very
on mine. So everything's like very very pricey right now when it comes to uh GPU, especially for AI. So the L4 has 24
gigs, right? Um, it's a very capable GPU
gigs, right? Um, it's a very capable GPU for its size, for what it does, and I'm able to do a lot of things with it. Of
course, the 24 gig is the limiting factor, but you but you have to compare it like what are you going to what are you going to get out of it? And also,
limiting factor is bandwidth, by the way, because there's no you can tell here there's no um active cooling on the L4, even though I'm in a server and I
have fans blowing air across it, right?
I also have a fan laying on top of this and my add-in card for my NVME drives.
And by the way, I just bought a four port NVME card, Gen 4, right? PCI 16.
And I have um four 4 TBTE HP NVMEs that were on sale. And those were really those were I want to say 200 something dollars. Now you can't even get a two
dollars. Now you can't even get a two terbyte for 200. It's crazy the amount of the prices right now. So you know the L4, do you need it? No, you don't need it. You don't need a L4, right? You
it. You don't need a L4, right? You
don't need a RTX Pro 6000. Everybody's
like, I have I want to get this RTX Pro 6000. And I'm looking at some of these
6000. And I'm looking at some of these people and what they're doing with it.
And I swear it's like they're forgive me for saying this, but it's like e penis, right? It's like they're bragging about
right? It's like they're bragging about their epeniness. That's what that's what
their epeniness. That's what that's what we used to call it, right? When people
would come on and say, "I can get so many tokens per." So what what does that what are you doing? What are you trying to use it for? Right? What's it matter if you don't have a if you're not talking about a use case? Everybody's
on, you know, Reddit on the local AI or local llama forum sub subreddit talking about how many tokens per second they're getting, you know, how wonderful it is, right? All this stuff. Half of them
right? All this stuff. Half of them don't have n, you know, the Nvidia enterprise licenses, take advantage of MIG, like like no one's using the, you know, I would say no one, but a lot of people aren't using their solutions like
they should be using it. So why spend all that money? It's eight grand for an RTX Pro 6000. By the way, I think that's a deal for what you're getting. But for
a home lab, $8,000, that's crazy, right?
So, um, you don't even need you don't need to do that. You don't need to go, you know, to 24 gig or excuse me, an L4 either, right? Which is an enterprise
either, right? Which is an enterprise it's an edge solution, right? AI
solution, but it's an enterprise solution because of, you know, look at it. It's 22 2,300 bucks, right? Why do
it. It's 22 2,300 bucks, right? Why do
you need to spend that? You don't need it. I can get you you can get 3090s 24
it. I can get you you can get 3090s 24 gig for 600 bucks a piece. Right? Now,
the difference is, okay, well, I can put this in a smaller form factor, right?
But those are the trade-offs I'm talking about, right? You can get two 3090s for
about, right? You can get two 3090s for for half the cost of this, okay, and have 64 gigs of very fast GDDR6 memory,
okay? That will provide you very good
okay? That will provide you very good inference performance, right? And decent
model size. And I think everybody, that's another thing everybody talks about is I want to run the largest model. Why? Why do you want to run the
model. Why? Why do you want to run the largest model? Give me a use case why
largest model? Give me a use case why you want to run the largest because there are there are SLMs and quantiz models that do fairly well today, right?
Fairly well. Not as good as the big model, you know, uh not quant, you know, but but you know, decent quants on
larger models to fit on, you know, GPU memory like 64 gigs or or 24 gigs, you can get pretty good performance. Like I
get I run GPT OSS uh 20B on my L4, no problem. It runs great. I run it on my
problem. It runs great. I run it on my A4, excuse me, my uh my RTX 4000 SSF SFFF ADA GPU, which we'll talk about in
a little bit. But it does fairly well, right? Um but if you obviously
right? Um but if you obviously depends on what you're using it for, right? And what models you need to run.
right? And what models you need to run.
And so I think it's really important that you understand your use case just like in the enterprise use case is number one to determine the solution.
the same thing in the home lab, right?
Because we have we don't have infinite budgets. We have in fact very low
budgets. We have in fact very low budgets in the home lab, right? For the
most case. And we're trying to squeeze as much performance and capability out of those budgets as possible. And that's
what you should be focused on, not focused on, oh, is my EPS large enough that I get I get, you know, so many tokens per second because that doesn't mean anything, right? When people talk about that in Reddit, I just ignore it.
Doesn't mean anything to me. Okay. All
right. So, that's kind of what I run.
This is kind of some of the things you can get. You know, you can look at
can get. You know, you can look at L40S's now. I don't know how much they
L40S's now. I don't know how much they are now, right? 7,000. So, they're still up
right? 7,000. So, they're still up there. There's, you know, why buy an
there. There's, you know, why buy an L40S at 48 gigs of RAM for $7,400, $7,800 when I can buy an RTX Pro, right?
Blackwell 6000 that has 96 gigs of RAM, right, for the same price or or close to the same price. To me, I think this is ridiculous that these are even up here at 78. These really should be more like
at 78. These really should be more like three $4,000 at this point since the Blackwell 6000's been released. Okay.
And by the way, the black belt 6000 comes in three different variants, okay?
It comes in enterprise, which is doesn't have active cooling on it, right? Looks
like this a little bit. It comes in workstation, okay? And it comes in max Q. Okay, Max
okay? And it comes in max Q. Okay, Max
Q, which is a 450 watt variant of the Blackwell 6000 to save on power and cooling. Okay, and it does pretty well.
cooling. Okay, and it does pretty well.
Now, one of the I I don't have RTX 6000s, right? So, I couldn't tell you
6000s, right? So, I couldn't tell you the performance uh benefits, but if you go to like level one text on YouTube, you know, Wendle will go through and walk through the difference between the
Max Q and the 6000 uh regular 6000 workstation. Um, and there's there's a
workstation. Um, and there's there's a difference obviously, but it's not that big of a difference. And you're saving energy, right? And you're saving uh
energy, right? And you're saving uh electricity, right? So electricity cost
electricity, right? So electricity cost and heat right are very important especially in servers especially within smaller servers. Uh so for example you
smaller servers. Uh so for example you could even take a 6000 regular 6000 enterprise and you can um limit it to 450 watts as an example. You could do
that too right? You can do a limitation uh on that as well. Matter of fact um some vendors actually do that right when they stack in you know many 6000s
together. They're trying to lower the
together. They're trying to lower the the wattage, right? They're trying to lower the heat dissip dissipation. Um,
and they're still getting phenomenal performance. Okay. So, yeah, L4, you
performance. Okay. So, yeah, L4, you know, so I think, you know, as far as enterprise gear right now, I would hold off on that, right? Unless you can get a
phenomenal deal, um, and you're, you know, you're doing GPU inferencing, you you can stick with Epic, you can stick with last gen, right? If you're doing CPU inferencing,
right? If you're doing CPU inferencing, I would go I would go Intel all the way.
And people will probably cringe when I say this, but I would go Intel. And I
would go newer Intel or Intel with AMX extensions because now I have the ability to offload uh some of the some of the inferencing capability onto AMX
using AMX extensions on the Intel CPU.
And so I would personally try to get an Intel CPU that supports AMX um if I was going to do a home lab today just for just for AI. Okay. Um, you
know, I can brute force Epic, right? But
it's it's not as good as the, you know, obviously off the offload capability that Intel has in some of their newer CPUs, right? Xeon CPUs.
CPUs, right? Xeon CPUs.
Okay. So, that's it for that. As far as connectivity, I do have it's just 10 gig right now. Um, I do have the capability
right now. Um, I do have the capability adding a 25 gig and I have a 25 gig ports open on my uh UniFi aggregate or excuse me, UniFi Enterprise XG switch. I
have two 25 gig ports that are unused.
So I could use that but I wouldn't get any benefit unless I had one server connected and then my workstation connected right or something right where I was you know doing the inferencing against right or if the application was
either on the server or on another server that had 25 gig right so or the data was being fed remotely right from a
from a storage device a s or a network attached storage device right um honestly for what I do 10 gig is fine right
I probably will go to 25 gig eventually, but just not right now. Okay.
All right. So, what other things can you do? Well, I let's going back to this
do? Well, I let's going back to this list, right? So, we really covered kind
list, right? So, we really covered kind of some of the enterprise gear that I use. Um,
use. Um, and you know, I just think let me just italicize this because if italicize this,
uh, hang on, make it look like we actually Oh, okay. So, the tail size means that we
okay. So, the tail size means that we went through it. So now we're going to talk about purpose-built platforms. So there are many different purpose-built platforms and they actually tie into SF
SFFF systems as well. But all you hear people talk about today are two different platforms. Not all but mostly.
So you really have you really have three three options here, okay, for purpose-built platforms. Now one is purpose-built, the other two can be used for AI. Let's just call it what
it is. Okay. The first thing everybody,
it is. Okay. The first thing everybody, you know, all the rage right now and everybody has been talking about this is
the Nvidia DGX Spark. Okay, this is a basically a purpose-built small form
factor system that runs the Nvidia GB10 Grace Blackwell uh ASIC. And what it is
is it's a combination of 10 ARM cortex, excuse me, 20 ARM cortex cores. Uh some
are high high frequency, some are low, right? So think of big little design,
right? So think of big little design, right? Around ARM. And it comes with 128
right? Around ARM. And it comes with 128 GB of LPDDR5X unified memory. So you can load uh large
unified memory. So you can load uh large models into this platform. Okay. Comes
with 4 TB NVME. This is the Nvidia version, right? Um, it is a smaller
version, right? Um, it is a smaller NVME, right? It's the the small form
NVME, right? It's the the small form factor NVME. So, you're not getting the
factor NVME. So, you're not getting the full capability uh of NVME because of small form factor. That's one thing I think that they they could have done
better, especially for the price. But
what it also comes with is it comes with 10 gig network, Wi-Fi 7, and 200 gig
connectex networking. All right. Now,
connectex networking. All right. Now,
you're probably thinking to yourself, why the heck would I need 200 gig connectex networking in a ARM box, right, with a with a smaller Blackwell?
Cuz the Blackwell system, the the the GPU is pretty phenomenal for what it does, especially with floating point for performance. It's pretty capable. The
performance. It's pretty capable. The
issue, the big issue I have with it is the memory bandwidth. It's horrible.
You're using LP DDR5X, right? And
current GPUs are using DDR7 or G GDDR7, right? And the bandwidth is even a even
right? And the bandwidth is even a even a 3090. Matter of fact, my RTX 4000 SFF
a 3090. Matter of fact, my RTX 4000 SFF ADA has better bandwidth than this system. Now, does it beat this system?
system. Now, does it beat this system?
No, because it has this system has twice as many nextgen tensor cores, blackwell tensor cores, etc. Right? So, there are other things obviously that that make it
uh more capable. But if you took a 3090 up against this, yes, the 3090, I know you're thinking, well, the 390 only has 24 gigs of RAM. This has 128. I get it.
But if you take a large a language model that fit in both, okay, and did a benchmark, the 3090 would eat its breakfast as far as tokens per second, time to first token, etc. performance.
All right? And cost a hell of a lot less, right? So I can again, I can get a
less, right? So I can again, I can get a 3090 for 600 bucks. I can get two of them. Gives me 48 gigs of RAM, right?
them. Gives me 48 gigs of RAM, right?
for 1,200 bucks. And here we got $4,000 for this. Okay. Now, you can buy
for this. Okay. Now, you can buy different there are different models that you can buy. You don't have to buy the Nvidia one. The Nvidia one's four grand. The Nvidia one's four grand
grand. The Nvidia one's four grand because uh I believe they say because of the 4 TBTE NVME, but you can buy the ASUS. Lenovo makes one, MSI makes one.
ASUS. Lenovo makes one, MSI makes one.
Uh a bunch of them make one. You can see the prices are are better, but this only comes with one tab, right, of NVMe, but it's the same chip, same capability,
same connectivity. Still get the 200 gig
same connectivity. Still get the 200 gig networking, 10 gig networking, um, etc. Now, you can combine two of these. And that's really what the 200
these. And that's really what the 200 gig networking is for, right? To make a cluster or combine three and daisy chain, right? The connect exports. So
chain, right? The connect exports. So
you would go one to two, two to three, three back to one, right? And that
because you have dual ports on each one, dual 200 gig ports, right? So you can make a three node cluster, right? Just
daisy chain um and get extremely high capability when it comes to um you know FP4 performance uh inference performance and large very large models some
fine-tuning some training. Now would I train with this? would I fine-tune with this? Um, if I was a developer, yeah, I
this? Um, if I was a developer, yeah, I mean, that's what it's geared for. I
think that's another thing we have to talk about. What is this really geared
talk about. What is this really geared for? Is it geared for the user to use in
for? Is it geared for the user to use in production? No, this is really geared
production? No, this is really geared for the AI developer, right? And what
does that mean? That means testing, validating AI apps against inference, different model size, different model types, fine-tuning models, training models, right? But they don't care, you
models, right? But they don't care, you know, particularly about, you know, massive performance, right? Um, so I think you know this could be an option
if you're an AI developer. This is
probably a very good option for you, right? And I would pick the Nvidia one
right? And I would pick the Nvidia one because of the 4 terabytes of of NVME.
Uh, and plus it's just so cool looking.
I mean, look at this thing. It's gold,
right? And I know that's kind of like off topic, but it's just so cool looking right?
So that's the Nvidia pre-built systems. Now there are two other systems that you can leverage right for AI that are not purpose-built really for AI but can be
used for AI. The first everybody's talking about this one too is the framework desktop. Okay. Now this is
framework desktop. Okay. Now this is based on AMD's stricks Halo design. Okay. And it has 128 gigs of
design. Okay. And it has 128 gigs of memory as well. Can you use all 128 gigs? The answer is no. You can get
gigs? The answer is no. You can get pretty you can get up there though, right? to leverage it. Now, there's this
right? to leverage it. Now, there's this big back and forth between, you know, how good these systems are, right? Like,
if you were compare a framework against a Spark there, there are two things I think that stand out between the third one, by the way,
and we'll get into is a Mac, right? You
can buy a Mac uh M3 Ultra with 512 gigs of memory and great memory bandwidth.
The problem with both the Mac and this system, right, is it doesn't have a CUDA like framework. Okay, now
like framework. Okay, now I know I'm going to get bigots on here to come back and say, "Oh, my Stricks Halo is awesome. I can use the Vulcan API. I can use the Rockom API or rock
API. I can use the Rockom API or rock framework, right, for AMD." You know, Mac has its own. I don't I don't really know Mac. I know Mac has its own.
know Mac. I know Mac has its own.
There's a there's another YouTube channel. Um, the guy's name is Alex. Uh,
channel. Um, the guy's name is Alex. Uh,
hang on. I'll just go to it. I really
love his channels. They're great. Alex
Ziskin, Ziskind, I think he's phenomenal. He tests all these things,
phenomenal. He tests all these things, right? And he kind of gives you like
right? And he kind of gives you like this no BS um testing. Now, is this testing really for
testing. Now, is this testing really for real use cases? I mean, not really. I
mean he's really talking about oh I can get so many you know tokens per second time to first token those how how big of a model I can fit. I mean that's those are important depending on what your use case like what's your use case calling
but for a home lab is it that important I don't know right maybe it is maybe it isn't but you can go to his channel you can get a lot more information especially about the Mac but I want to talk about the framework I want to go
back to the framework and the Mac really quick okay the great thing about Nvidia is the
just works okay I'm not having to do anything special so so much to get it to work, right? It just works. And the
work, right? It just works. And the
reason why it just works is because of CUDA. And people can argue with me about
CUDA. And people can argue with me about this, but the reality is the Silicon's phenomenal. And the framework, the
phenomenal. And the framework, the frameworks in the back end are phenomenal. Okay? And Nvidia knows this.
phenomenal. Okay? And Nvidia knows this.
It's the reason why all their stuff is so expensive. Okay? It's a reason why
so expensive. Okay? It's a reason why that, you know, the the H200s and the H100's and the RTX 6000 Pro and, you
know, the NVA AI license required for those GPUs is so faking expensive because they know they have a leg up on AMD. They certainly have a leg up on Mac
AMD. They certainly have a leg up on Mac and they certainly have a leg up on, you know, Mac doesn't even have an enterprise solution and they certainly have a leg up on Intel, right? Intel and
we'll talk about Intel in a little bit, but Intel has some capabilities, too.
So what is the framework? The framework is cheaper but not by much. So let's go look and see how much this is. Right? So
you saw that the Nvidia Spark for the 4 TBTE model it's four grand. Okay. Before
tax, shipping, etc. For the ASUS version, Asus version, it's three grand, but you're only getting one terabyte of NVME. Okay. Now, I know it doesn't take
NVME. Okay. Now, I know it doesn't take $1,000 to get to four terabytes. can
probably upgrade the ASUS, save yourself, you know, 600 bucks and and spend 3,400 on the ASUS model with and get 4 TBTE, right? That's something that
I would do personally. I wouldn't buy, you know, the Nvidia one is nice because the gold and I think it's one of those things that you'll keep and it's like one of those things that, you know, it's
Nvidia or it's Nvidia branded. Um, but
if you're costconcious like me, I would go for a different model that's cheaper where I can upgrade the the NVME myself and not spend the extra grand that Nvidia is charging. I mean, Nvidia's charging an extra grand because it's
gold color, it's Nvidia name, and it's 4 TB NVMe, right?
Let's go back to the uh Stricks Halo really quick. All right, so you had the
really quick. All right, so you had the option three here, which is would be the option for AI. You could get away with both of these, right, for AI. But I mean to compare against the uh in Spark,
right? This is pretty phenomenal. Now,
right? This is pretty phenomenal. Now,
what I will say about the Stricks Halo is if you want a system that's small form factor like the like this the DJX Spark, okay, but will not just use it for AI, use it for other things like
gaming and other things, then this system is great, right? Because I can game on this. I can do 1440p gaming on on current titles with the with the integrated iGPU, the A60S, it's
phenomenal. All right, but you know what
phenomenal. All right, but you know what the Nvidia Spark lacks in performance on gaming and those things, right? And the
fact that it's ARM, right? Uh and it lacks in obviously um interoperability, you know, the ability you can't can't run Windows, you can't run Windows games
or you know, it needs an ARM Linux, right? It's not x86.
right? It's not x86.
The 395 is x86, right? And I can run all of my applications that are compatible with x86 and I can still run AI workloads. Okay, it is lacking in the
workloads. Okay, it is lacking in the bandwidth, right? It's only 5 GB
bandwidth, right? It's only 5 GB Ethernet. It does have a BY4 PCIe slot,
Ethernet. It does have a BY4 PCIe slot, I want to say. I think it's I don't know if it's a BY3 or uh it's a B4, but I don't know if it's gen three or gen four. I can't remember. Let's see the
four. I can't remember. Let's see the full specs.
I don't even think it lists it.
This one has two NVME ports, right? The
other one doesn't. Like the DJX Spark doesn't only has one. It has a 5 GB Ethernet.
Yeah, it has a PCIe by4 slot not exposed in the case. You can buy the motherboard by itself without the case. So, if you use this case, you won't be able to use the B4 slot. Well, you could. So, you
could add like, you know, 25 gig networking if you wanted into this with the B4 slot. That's one thing you could do. Um, but it's not 200 gig networking,
do. Um, but it's not 200 gig networking, right? It's not like the Nvidia uh
right? It's not like the Nvidia uh ConnectX uh system, right? Uh back end.
So, that's the other thing is the framework, right? You buy the you can
framework, right? You buy the you can buy the motherboard and build your own system or you can buy the whole framework desktop that comes with, you know, you you decide, right? You can
piece it together. You can have framework provide all the, you know, NVMe. I wouldn't do that because you're
NVMe. I wouldn't do that because you're you're going to spend more. Anytime you
have um custom integrators provide you, you know, NVMe, RAM, etc., you pay more for that. You you're better off just
for that. You you're better off just buying it bare bones and you install it, buy it yourself outside of that install it. Um, there are other AI 95 systems,
it. Um, there are other AI 95 systems, excuse me, 395 systems, uh, Ryzen Max.
One of them is the Beink GTR9 Pro. I
love the look of this. It looks like a a Mac. You know, one thing I really love
Mac. You know, one thing I really love about the Mac is their form factors on what they look like. I love them, right?
No one's been able to kind of copy this, but I got to say Belin came close with this device, right? And it's pretty awesome. Pretty awesome device. Um,
awesome. Pretty awesome device. Um,
again, 395 CPU, same as the framework, 120 GB of RAM. Um, you can upgrade the the uh um NVME drive. This one comes
with two terabytes uh comes with you know dual 10 gig network capability etc. And you see here 7 Deep Seek 70DB.
You can run Deep Seek 70B on this platform no problem. But what I'll tell you again is you're bound by the bandwidth as far as performance is
concerned. Right? Again, the DGX Spark
concerned. Right? Again, the DGX Spark uh memory bandwidth not great. The
memory bandwidth in this is is probably a little worse since it's not unified me. It's not really true unified memory.
me. It's not really true unified memory.
LPDDR5X again, right? And until we get systems like in this form factor that can take advantage of you know DDR GD GDDR so unified memory DDR7 etc. The Mac
has phenomenal bandwidth. The problem
with the Mac is the back end right the frameworks to support AI. Okay. That's
the problem. All right.
Okay. So again we talked about the Mac a little bit. I won't get deep into it but
little bit. I won't get deep into it but that's another system that you could leverage right. Um and again you can
leverage right. Um and again you can choose your Stricks Halo system. Uh
there's another one. GM Techch. I want
to say GM Tech Mini PC. Uh GM Tech Stricks
Mini PC. Uh GM Tech Stricks Halo.
I know it's in here. The search on on Amazon Prime is horrible by the way. You
probably already know that. Okay. So,
those were kind of, you know, talking about the purpose-built systems. I know I'm going long here. Um, but let's just italicize this to say we reviewed this.
Um, that's part of the small form factor systems, but I want to talk about what I've done here internally, what I've built. Um, but I want to talk about
built. Um, but I want to talk about commercial grade hardware. So, you can do commercial grade hardware. So, for
example, my system, okay, right now, right, is a 9900 X. I have a 9950 X3D that I need
to replace with this system. Right. I
sold my 99 9900X. I actually have to after this after this uh podcast take my system down and uh put install the 9950X3D and so I can uh ship out the 99X. But
you can see here I have 128 gigs of RAM and I have an RTX 5090. Now it's
multi-purpose. This is my workstation. I
use it for everything, okay? work
development, uh, tons of Kubernetes stuff. I'm always
doing AI stuff. I'm always using it for AI, whether it's for Visual Studio Code locally, you know, LM Studio,
um, just a bunch of different AI tools that I'm using. Um, and so I have an RTX 590. Now, an RTX5090 right now is around
590. Now, an RTX5090 right now is around 2400 or $2,600, and you get 32 gigs of very fast GDDR7. It's extremely fast.
It's 32 gigs. So, it's um it's more memory than the 4090. 4090 was um 4090 was 24, the 3090 was 24, the 59 is 32.
Okay.
What I would recommend is if if you don't need like the fastest, so I game two, right? So, I want the fastest frame
two, right? So, I want the fastest frame rates and so I use it for everything.
So, I splurged on this system to be the top-of-the-line basically desktop grade, not enterprise grade, right? Solution,
right? 9950X, 16 cores, you know, 32 threads, X3D processor, 128 gigs of RAM, you know, RTX 590, and I have uh about 8
terabytes of very fast NVME storage.
Okay, all Samsung, one is Gen 5, one is Gen 4, right? Very, very fast. This
system just rips through everything I throw at it. What's lacking is the memory size. We talked about this,
memory size. We talked about this, right? But I don't need it. I'm not
right? But I don't need it. I'm not
trying, like I said, I'm not trying to run deep. First of all, I would never
run deep. First of all, I would never run deepseeek or quen or even uh Kiki KM2, I think it's called. I'm just not I stay away from that. I just I worry about bias, what's in the code, what's
in the model, etc. because it comes from China. I'm just worried about those
China. I'm just worried about those models. So, I don't use those models,
models. So, I don't use those models, right? But I try to use models like you
right? But I try to use models like you know GPTOSS, a mist lama. I try to use those types of models um locally and I'll use you know
I'll test different models right for uh coding. Visual Studio Code we'll talk
coding. Visual Studio Code we'll talk about some of the tools I use in a minute but Visual Studio Code I use a a plugin um called continue.dev
and I tie in my OMA instance uh model instance into the continue.dev dev and that way I have my own co-pilot without paying for GitHub copilot. Right now, is
it as good as GitHub co-pilot? No, it's
not as good. Right? But it does the job.
Okay. And so I'm not paying for GitHub co-pilot subscription and therefore um I save on that. Although you can say, well, you spent all that money on your but I'm using the RTX 590 for every for
all other things, right? For gaming and everything else, right? So this is the workstation that I've created. It's a
very very robust workstation. Um, and it works phenomenally well. And talk about multitasking. 120 gig bytes of RAM is
multitasking. 120 gig bytes of RAM is phenomenal. It's it's DDR5 running at
phenomenal. It's it's DDR5 running at 6,000, right? Mega mega transfers, but
6,000, right? Mega mega transfers, but it's not it's not GDDR7, right? And I'm not getting bandwidth
right? And I'm not getting bandwidth like that. But I don't I don't run
like that. But I don't I don't run models on my CPU and I don't inference or any of that. So the the memory that I have in my system, it well, it's important, right? it's not as important.
important, right? it's not as important.
Okay.
Um and so that's one part of it. It's
very very capable. Okay. Now, if we go back here, there's other hardware you can buy. Okay. You can go all the way.
can buy. Okay. You can go all the way.
You can go and buy, you know, a B650 Ryzen system or an in a Z69. I I would stay away from the the
Z69. I I would stay away from the the latest Intel Z890 platform.
Um I think the 200 K series Intel CPUs.
They have no hyperthreading. Um, very
actually they're worse performance than the previous gen. If you're going to buy Intel, I happen to like uh the previous
gen Z690 Z790 chipsets and the 12 1200 series and the 1300 series. The 1400
series high-end from my understanding it is not great. It actually degrades over time. So the 14900 K um has problems
time. So the 14900 K um has problems with degrading. Matter of fact, there
with degrading. Matter of fact, there are other YouTubers that I've watched where they've lost CPUs from degregation.
And so I stick I would stay away from the 14900 series. By the way, Intel is releasing a new CPU on that platform.
It's coming. I think it's 12 cores, 24 threads. Um, and it's going to be very
threads. Um, and it's going to be very high, but it's very it's built really for gaming. So, you can again, you can
for gaming. So, you can again, you can build a very capable system on a budget, spend less than $1,000, and then go buy
yourself a 5060 uh TI 16 gigs, right?
And still be very capable from a um from a AI perspective. Yeah, 16 gigs, I know, is small, um but for the price and what you're getting, I think it's fairly
good. You're getting CUDA capability,
good. You're getting CUDA capability, you're getting tensor offload, all of that capability. So even in Nvidia's
that capability. So even in Nvidia's desk desktop cards, GPUs, you're getting the benefits, uh, of the enterprise GPUs on those desktop components, right? And
you're getting very good, very fast memory bandwidth, um, etc., right? Or
you can opt to go for an older card. So,
I know c I know some people that have purchased um older Thread Ripper just for the PCIe lanes, right? Uh because if
you buy like a desktop board 8 uh an 870 from AMD or Z790, right, you're going to be bound by limited PCIe length. So, if
you want to have multiple GPUs, you want to be able to have support those multiple GPUs and get the full PCIe bandwidth. And so what we'll see is
bandwidth. And so what we'll see is people will go out and they'll buy older AMD Epic like I did or older Thread Ripper, get the PCIe lanes and buy, you
know, a couple used 3090s and uh and a decent, you know, beefy power supply, right, 1200 watt or higher power supply, all right, as an example.
And um what you get out of that, like I I think I talked about this before, is very capable system, right, at a very reasonable price and the ability to run
models that can support 48 gigs of uh GPU memory. Okay, so you have the
GPU memory. Okay, so you have the ability to do that. Um and go look for solutions like that. You know, go on the forums, go on Reddit, you know, buy
used, buy from from a reput reputable trader and buy used. There's nothing
wrong with that. I I I see people all the time they avoid used used uh hardware. I've been buying used hardware
hardware. I've been buying used hardware for the last 15 years, 20 years. Okay.
Have I had problems? Probably uh 98% of the time, 99% of the time, it's fine. Do
you run into a problem occasionally?
Certainly. That's why you use protection. That's why you use PayPal or
protection. That's why you use PayPal or uh uh a payment solution that provides you protection um where you can actually file if you need to file. And I've never
had to file a complaint um against the trader um on PayPal to get my money back. Um I've had to give bad reviews,
back. Um I've had to give bad reviews, but I've never had to file a complaint to get my money back. So, there's
nothing wrong with buying used gear.
Nothing wrong with going on the forums. Generally speaking, you'll get better prices than eBay on used hardware. If
you go to like Reddit, Hardware Swap as an example, or Home Lab sales, or if you go to Hard Forum for sale trade, right?
Um, and you buy from a reputable trader, right? People that have references,
right? People that have references, right? Heatwear, eBay, uh, eBay, um,
right? Heatwear, eBay, uh, eBay, um, refs, things like that, right? There's
no reason why you can't do that. So, you
can use commercial hardware, no problem, and still have an AI platform for your needs. So, I like SFFF systems. I like
needs. So, I like SFFF systems. I like small form factor systems. So, what I did was I actually built a system. Um, I
called the do-it-yourself DJX Spark. And
if I you go to my website, I have a blog called Unknown Tech.io. I actually show you the system that I built here to the right. And essentially what I have is I
right. And essentially what I have is I had an a a minisform AR900i motherboard that I was going to use for a NAS about two years ago, year
and a half ago. didn't work out and so it was still in the box had 96 gigs of RAM with it and so I just in built a 3D printed case right that
someone had had built already and I bought the actually I didn't buy anything I already had everything I basically used an RTX 4000 SF8 that I have been using for testing I've had
that for two years I've put that in the box and I installed Abuntu uh Linux on it 2404 Um, and I walk through the process of
what I did in part one, right? What I
used, right? And basically what I did was I tried to make it as close to the DJX Spark as possible. Obviously, the
trade the big trade-off is the VRAM, right? I will only be able to use 24
right? I will only be able to use 24 gigs of memory, right? And the VRAM on the obviously the
right? And the VRAM on the obviously the DJ Spark is 128 gigs, right? So the size model that I can sport support is not the largest but I can support pretty
good models reasoning models GPT again OSS 20 build 20B etc. And so I installed all of the NVIDIA components on the
solution and here it is right here I'm using my Jet KVM to actually connect. I
don't have SSH configuring this yet.
You're probably like, "You don't have SSH, dude." But, um, I haven't
SSH, dude." But, um, I haven't configured that yet. But, I'm in the Spark near Spark now, or excuse me, my do-it-yourself spark, and it runs great.
I mean, it really does. And I have basically um, let's see what I have running here. So, if I go dops, I have
running here. So, if I go dops, I have open web UI, which is my UI to connect to the models. And I actually have the ability to I have Olama installed.
Right now, I have two models downloaded.
And I actually have my Nvidia NIM uh NGC Docker login configured. And so I can download models for both OLAM and NVIDIA NIM. Okay. So I have my API key
NIM. Okay. So I have my API key configured. So I have my Nvidia Docker
configured. So I have my Nvidia Docker container toolkit. I have NVI top to
container toolkit. I have NVI top to monitor in real time the usage of my RTX 4000 FF ADA, right? And I still get get that um there's there's things I have to
do in the command line, don't get me wrong. Like for example, um
wrong. Like for example, um there's still there's still things I have to do in the command line. So let's
start let's do this. We're going to start up a model in um what's this?
No, I don't want that.
No.
So, we're going to start by model a NIM model and we're going to use um we're going to use open web UI.
I don't know why I didn't have Hang on one second here.
It's easier for me to do this, believe it or not.
Okay.
Oh, maybe that's something I didn't download yet.
All right, just bear with me a second.
So, if I go to CD mount, this is where my uh models are located. I actually
have a striped uh LDM for this. Let's go
to uh CDN. What's in here?
CDN GC CD hub.
Okay, so I have Gemma 2B. I have Llama 318B instruct and I have this Neotron.
Yeah, I like that Neotron model. That's
that reasoning model. By the way, where you can get this is if you go to um as a developer
no models build I'm sorry /models. If you go to build.invidia.com/
/models. If you go to build.invidia.com/
ninvidia.com/models,
you can get all the information of the models um and how to actually install them and actually get them going. So,
what we'll do here is we're going to go back to my Jet KVM and I'm going to run
docker run help able to find the latest locally docker error message N [clears throat] that should work. I'm not sure why it's not working. All right, I guess we'll go
not working. All right, I guess we'll go back and we'll run the old llama models to make it easy.
Let me see if I can run Gemma first here though.
There we go.
Okay. So, that'll pull. And then what we'll be able to do is we'll be going into open web UI and we're going to configure pointing to that model. Now
you can use open web UI against nim models which I'll show you how I configured that and you can use it against a lama models fairly very or even open a any open API uh inference endpoint. So what I did here in the
endpoint. So what I did here in the settings though I wanted to show you is I actually created a system prompt and basically I wanted to uh show the model going through chain of thought process and so I provided a system prompt
basically your logical system and answering show your step-by-step reasoning process breaking down complex queries and explain that the steps right so basically I'm telling the model to go through its chain of thought and show me
how it went through this chain of thought uh for um for its um what I'm asking it to do right so anytime you ask it a question or ask it to create something or do something, it will
automatically attach this additional system prompt, right? The context to uh to allow it to go through the chain of throat process. The other thing that you
throat process. The other thing that you do is um you go here down to your login settings, I'm sorry, go here, login,
admin panel, and up to settings. Right?
This gets a little bit a little bit uh uh confusing. Go to connections. So you can
confusing. Go to connections. So you can see here that I've added an open API endpoint for open web UI to my
um DGX spark right so I'm using the 8000 right v1 this is my nim model default inference endpoint okay and if I want to use lama models it automatically uses
the the docker container 11434 because that's the docker internal host and because I'm running lama on the same host as the docker uh configuration Um
basically um it just it just uses uh the internal host, right? So you have to actually set this up because this is not in here. So if you want to use NIM, you
in here. So if you want to use NIM, you actually have to configure this connection here. Okay, so let's go back
connection here. Okay, so let's go back really quick to see where we are. It's
downloading complete, extracting. So
it's taking a little bit to actually get this going.
But essentially, that's how you configure uh open web UI for NIM models and OLAM. OLAM will automatically be
and OLAM. OLAM will automatically be configured. Um, but the NIM model will
configured. Um, but the NIM model will not. You'll actually actually have to
not. You'll actually actually have to put this in here. You can actually go out to OpenAI as well if you have external access and connect to OpenAI models through the API through the
OpenAI uh API backend uh for OpenAI. So,
if you want to connect to GPT4 or GPT5 or GPT5 mini or whatever, right? Um,
you're connecting to those models. Now,
we all know that GPT5 especially is not the model, right? It's basically a router when you think about it, right?
Because it's a it's it's going to point you to the right depending on what you're asking. It's going to point you
you're asking. It's going to point you to the right model for your inference.
Um, and I think a lot of people probably don't know that, but G, you know, people say, well, GPT5 is the best model. It's
not it's not just one model. It's a it's a mixture of models, okay? depending on
what you're asking for it. But you can actually connect to any open AI API endpoint to including including nutanics enterprise AI. Now you won't use this.
enterprise AI. Now you won't use this.
You could use this for testing which I like. One thing I like with when we test
like. One thing I like with when we test out the Alama models is I'll be able to get metrics. For the open AI models, I
get metrics. For the open AI models, I actually have to have uh open telemetry or um uh you know uh
some of those you know frameworks set up that feed into something like graphana.
So I can get those metrics out of a thirdparty system within open web UI for lama. It's already I'm already getting
lama. It's already I'm already getting those metrics. There isn't a way I wish
those metrics. There isn't a way I wish there was to configure the OpenAI models to get the metrics like OAM models.
Unfortunately, you actually have to use a third-party tool like open telemetry with graphana um to get those uh to get that u those metrics. All right, let's
go back here and see where we're at.
Let's do a Docker.
Let's look at the logs and see if it's ready.
I don't know if it's going to be ready yet. Let's see.
yet. Let's see.
Nope, not ready yet. If it was ready, it would show up in this list.
And it's not ready yet.
All right.
Workspaces.
All right, there it is. So, you see Gemma's up right here.
Tell me a quick funny story.
One thing we can check out is the NVI top. You can see here
top. You can see here the model's loaded in GPU memory.
You can see the temperature, right? How
many watts is being used?
I don't think it's quite ready yet.
Let's see if it's ready now.
Yeah, it should be ready.
Oh, error in applying chat template. I'm
not sure why I'm getting an error.
Let me just look at some same IP address.
That's right.
Okay. Okay. So, I'm not sure why that's not working, but what we're going to do here is we're going to stop this.
We're going to stop and remove it.
I have to double check this. This worked
last time. I'm not sure what was going on here.
Okay, that's done now. So, let's start let's fire up the llama.
We'll fire up the GPT OSS. Yeah. So,
that's already up.
Let's go back to open web UI. And you
can see here it's green. This is one thing that you get within Open Web UI with O Lama is you'll get a you'll get a
uh a real um uh state, right? So it's
green. It's up. It's working.
And let me do a new chat.
We'll delete that.
tell me quick story.
Okay, so you can see here here's the chain of thought I was talking about.
Right now, this is running on my RTX 4000 FFTA, right? So again, it's a very small form factor system. Sits in that little cubby. I have a Jet KVM connected
little cubby. I have a Jet KVM connected to it. It's only connected to 10 gig or
to it. It's only connected to 10 gig or excuse me, two and a half gig networking. The storage is internal. So
networking. The storage is internal. So
I'm not servicing models or storing models or other things externally. So
it's all internal over the PCIe bus. So
it's very fast as far as loading. Um and
you can see here um it gave me the chain of thought and then it went through and gave me a story here, right? And again
you can go in here and you can go back and look at MB top and you can monitor the utilization, the wattage, right? The
temperature, um how much uh VRAMm is being used, what processes are leveraging the GPU, right? So obviously
my display right the display here for uh Xorg BT2 and then of course Olama has uh almost 12 1/2 gigs loaded into um that
GPU right so that's one thing um you can do here now I can do a llama stop that'll stop the model okay so it's not running anymore I don't know why my NIM models aren't
working, but essentially that's kind of, you know, how do you use um how I use this quote unquote, for lack of a better term, a do-it-yourself DJX Spark. So,
this cost about I want to say a little bit close to two grand. Okay, so you know, if you look at the ASUS, right, with the same amount of NVME, you're
probably going to be about 3500, 3,600.
You look at the DGX Spark, you're, you know, from Nvidia, you're looking at $4,000, right? So, roughly, uh, you
$4,000, right? So, roughly, uh, you know, two3, excuse me, one, you know, half the cost, right, of what I spent on this. Now, the trade-off obviously is
this. Now, the trade-off obviously is I'm not getting all the DGX Spark tools that come with it. Like, there's a remote tool that I don't have that you can install. Um, and you're not really
can install. Um, and you're not really getting the DGX operating system. I'm
faking that using Abuntu 24044 and installing the Nvidia tools that I can install on the DGX Spark. And by the way, Perplexity helped me kind of build this, right? It outlined, okay, how I
this, right? It outlined, okay, how I basically said, how can I install or create my own do-it-yourself DGX Spark and get as close to capability as DGX Spark as possible, right? And it kind of
walked me through what I needed to do.
The problem, of course, is yes, you only have, you know, a certain amount of memory, right? VRAM, right? you're
memory, right? VRAM, right? you're
getting 128 gigs with DGX Spark. You're
not getting that with the do-it-yourself DGX Spark that I built. Okay? Again, it
all goes boils down to what is it you're using it for. So, one of the other tools that I use, uh, and by the way, you already see me using Warp, right? You
saw me using Warp in that in that KVM instance. I use Warp all the time, right? What I love about Warp is I
time, right? What I love about Warp is I can go to agent mode and ask it a question.
Tell me why the nim model didn't work.
Let's see if it can go in and find it.
So this is warp. I have I have a subscription to this. Okay.
What command did you run or if you have any logs? Right. So, basically, if I
any logs? Right. So, basically, if I were to run that docker command again, let's go back here
and let's we're going to start that.
We're going to start this again. Okay,
now that you know it's downloaded, it's already been downloaded, it can it's just going to run, right? So, docker ps.
Let's docker logs.
That numatron model is actually really good too.
Especially the 9b2.
Okay. So, it's not there yet.
No, I can't do that.
Let's see what happens here. I want to see if it loads here.
Oh, says here I'm using a deprecated py nvml package. Please install Nvidia MLP
nvml package. Please install Nvidia MLP instead of make sure you uninstall py nvml. Oh, and both of them are
nvml. Oh, and both of them are installed. PVNVML take presence cause
installed. PVNVML take presence cause errors.
Watch this.
I'm getting this error trying to run a Nvidia NM.
Nope.
even though it's let me copy that.
Oh gosh, I have to use paste.
All right, let's see what happens here.
So, this is going to actually tell me how to fix it, right? But I shouldn't do that with the Docker the Docker container running. Okay. So, what I'm
container running. Okay. So, what I'm going to do is I'm going to reject this and I'm going to go back here and docker logs again. Sorry.
logs again. Sorry.
Okay. And I'm just going to go back here. I want to make sure that this is
here. I want to make sure that this is still not working.
Yeah, it's still not working.
Okay. So, what I'm going to do is I'm going to unload that again. So, I'm
going to stop it and remove it.
and I'm going to go here and say the same question here.
Now, this is agent mode, right? So,
basically, it's going to go through and say, "Hey, I'm going to create the command for you uh and go through this.
And it's finding some other problems, right?
I like how you're interacting with the agent, right? It's asking me questions.
agent, right? It's asking me questions.
Anyway, it'll walk through the process of trying to fix it, which I love in Warp. So, if you get caught on something
Warp. So, if you get caught on something and say, "Hey, I'm getting this error and you don't know what it is." It'll
walk through. Sometimes it'll log in.
It'll do all sorts of things and check, especially on local systems, even remote systems, right? So, I've had remote
systems, right? So, I've had remote systems where they say, "Okay, you're SSH to this. Let me SSH blah blah blah."
It'll do do the logs. It'll look at them. It'll monitor them and kind of
them. It'll monitor them and kind of give you kind of an output of what's happening. And so it's almost like
happening. And so it's almost like having like an S sur at your fingertips, which I love. Okay, so that's one of the tools I use is warp. There's also, you know, full agent mode where you can
actually build code and all sorts of things, which I haven't really done yet.
Um, I still use Visual Studio Code. So
if I go in here to Visual Studio Code, right, I have a plugin called um continue and under this I have a config file,
right? So you have uh a local assistant,
right? So you have uh a local assistant, right, for O Lama, right? So if I run O Lama locally. So let's do that.
Lama locally. So let's do that.
Let me go into my warp That's local.
So I have GPTO OSS20B. Let's do code lama though. So let's do uh run
lama though. So let's do uh run code llama latest.
It's going to take a minute to load.
I can go in here and let me just I have NAI. I can choose local.
It'll auto detect the model.
All right.
Okay. So, we know that's working.
Let's try this.
I'm going to use code lama latest.
Okay.
So you can see here that I'm locally hosting on my workstation, right? using
O Lama the the models that I want to use and I'm tying that back into Visual Studio Code right through continue extension and that way I don't have to pay for a GitHub copilot um
subscription. Okay, so again using the
subscription. Okay, so again using the power of what I already have in my workstation to actually do coding and things like that. So that's another tool that I use. So you saw that I use warp,
I use get uh visual studio code with lama um and other thirdparty tools I use uh perplexity. Let's go here AI tools.
uh perplexity. Let's go here AI tools.
So I have a subscription here. So
generally speaking I have three or four subscriptions for AI. I have
subscription to perplexity. This is
pretty much what I use to do anything around um reasoning, search, trying to walk through something, right? An
architecture. I could use warp to do some of that stuff too if I wanted. Uh
but I I really like Perplexity and what it what it brings. Um can it can give me diagrams, it can build me pictures, it can create content, it can do all sorts of things. Um and it also has a browser,
of things. Um and it also has a browser, right? So you've heard of obviously
right? So you've heard of obviously Google has Gemini, right? In the
browser. Um, ChatGBT has a browser now and now Perplexity has a browser called Comet. So, if you want an AI assistant
Comet. So, if you want an AI assistant browser, you can actually install that as well. I have it. I just don't use it.
as well. I have it. I just don't use it.
I still use Google. I'm still used to using Google, but generally speaking, I typically use Perplexity most of the time for most things. This is a
subscription, right? It's not local.
subscription, right? It's not local.
It's a service. The other service that I use too is naden um nan.com. Let's see if I go there.
um nan.com. Let's see if I go there.
And I actually have an account here. I I
actually host NAND. This is my uh workflow AI automation workflow uh solution that I use. I actually use NAN locally, but I also use it um as a SAS
solution as well. One of the things that um I like to do is tie in multiple services. So, for example, I like to tie
services. So, for example, I like to tie in I don't know if I can do that because I think I'm my my uh my account is lock.
Yeah, there's no workspace. So, I I used to have a workspace here and I used to have a uh a a workflow where I would go
to 11 Labs. So, basically creating a um a customer service assistant, right? Am
I logged in here? 11 Labs. Let me see my agents.
So, here I have an Nanx agent, right?
This nutanics agent basically called um called NAND when it went to schedule an appointment. So you could call this
appointment. So you could call this agent. It had access to the all the
agent. It had access to the all the nutanics website, just the website for now, right? And I could ask it any
now, right? And I could ask it any question where I could get information from the website. And then I would say, "Hey, this sounds great. I'd love to be able to speak to somebody about this."
And then the agent would create a appointment with the account team. And
the way it did that is it would call NADN and use local models, right?
Instead of going out to and using a Foundation model or or a um Frontier model service, right, like like uh OpenAI or some other model, right, and
having me pay for that API hit. It would
use a local workflow to actually schedule the appointment and then it would use an agent to tie back in to uh my Gmail calendar, etc., and send me and
send the uh user the uh appointment through an email, right? So, again, just something that you could think about hybrid AI for, right? the ability to leverage a service, then tie that back
into an on-rem solution and really have a hybrid AI approach where you're using the bulk of what you need 11 Labs for, which is the voice piece of it, right?
And the the agent piece of it, acting like a customer service agent, but the actual scheduling of a of an account or, you know, sky's is the limit. You can
think of all the the possibilities. you
can pass that off to an on-prem solution where you don't need to pay for using a frontier model service as an example, right? You can do it on prem. So, the
right? You can do it on prem. So, the
ability to kind of tie those two together really shows a lot of power there in what you can what you can do.
But these are some of the the tools that I use. Like I said, 11 Labs for for
I use. Like I said, 11 Labs for for voice agents. Um, Perplexity for my
voice agents. Um, Perplexity for my overall uh you know, global AI solution, Warp for my terminal, which has AI assisted agents built in. uh Visual
Studio Code tying back into continue for for programming locally without having to pay GitHub copilot. So, you know, tying this all together, right? What
does it mean? Well, it means that you can build a system, right? Whether it's
local on your desktop, which I showed you, or it's it's in a, you know, kind of a preconfigured lab, small lab that I had, the DJX do-it-yourself DJX Spark,
for lack of a better term. Um, or buying a system like a DGX Spark, right? like
going to um you know and actually buying a purpose-built system like a framework or or DJX system or even a Mac, right?
You have all these all these option options and choices to actually leverage for yourselves when it comes to your local AI. So again, I hope this was, you
local AI. So again, I hope this was, you know, uh uh was good for you guys to actually look and see what's out there. I know when we looked at the prices for like enterprise
gear, they're astronomical. Honestly,
right now I would I would hold off on any enterprise gear unless you get a deal on something. Um or unless or unless you need that PCIe bandwidth I
was talking about. I mean, if I were to build something today, I probably would use com, you know, commercial grade
solution or an older an older epic system or older Thread Ripper system, you know, motherboard and and maybe, you know, 24 core Thread Ripper or something like that and take advantage of all
those PCI lanes and then buy two 3090s and now I have 48 gigs to leverage for my models with phenomenal bandwidth. Um,
and you know, I'm only spending, you know, so much, right? If you're looking for something small form factor, you can do what I did and build something like I did, or you can go ahead and buy a framework or a DJX Spark or even a Mac,
you know, an M3 Mac Mini that has the ability to, you know, M3 Mac Mini is very expensive when you get up there in in the high memory. Um, it's actually more expensive than the Spark actually.
Um, so so or excuse me, yeah, the Spark.
So I would I would probably just buy the DJX Spark before I went Men 3 Mac Mini or unless you're like me you have a work a platform a workstation that you want to use for everything right where you
don't want to have a separate server or separate home lab environment I want to use my workstation I want it to be I want it to be um uh very flexible in the
use cases right for my workstation for my work my daily work for play like I use for gaming for AI etc you have the ability to kind of have one platform to do all of that. And that's why I think
most people don't look at that yet. I
see a lot still a lot of people that have like, oh, I have this desktop PC or I had this workstation and then I go buy something. You don't have to do that,
something. You don't have to do that, right? Your home lab could be right in
right? Your home lab could be right in the workstation that you're using. Um,
and you can segment environments out, right? You can run on your home lab,
right? You can run on your home lab, right? In my on this station, I can run
right? In my on this station, I can run um Linux, right, alongside Windows. Um,
and I could run O Lama on the system and kind of have it separated a bit from what I'm doing daily and not have it kind of affect what I'm doing at work.
So, for example, if I'm working during the day and I'm I need to do something with Visual Studio Code that I can run O Lama and run Code Lama and tie it in to continue in Visual Studio Code and still be able to function do my email, have
all my Chrome windows, my 100 Chrome windows up, you know, all you know, whatever I'm doing, PowerPoint presentations or I'm doing coding or whatever. I I'm able to do that on one
whatever. I I'm able to do that on one system, right? And I think honestly
system, right? And I think honestly there's something to be said for that.
You don't have to go out and buy extra things, right? you can leverage what you
things, right? you can leverage what you have. And so think about your use cases.
have. And so think about your use cases.
Think about, you know, the things we went over today and and things that you can do. I apologize that the NIM model
can do. I apologize that the NIM model didn't work on my DGX work. I had to look into that. It was working fine before I I did this, of course. Of
course, that's what happens with live demo, right? But you can saw Llama
demo, right? But you can saw Llama working on that. No problem. I'll get
the NIM models working again. No
problem. Not really worried about it.
But the ability to run both of those, right, in one platform is very powerful.
Um, and again, you don't have to spend a ton of money to get started. So, again,
I hope this was useful for you guys and I hope you enjoy the rest of your Thanksgiving holiday for those of you who are celebrating that. Uh, and I wish you a great uh holiday season. Stay
tuned for more content. We'll have more content on the blog and of course upcoming uh podcast as well. So, I hope you have a great day today. Thanks
again.
Hey, hey, hey.
Heat. Hey, heat. Hey, heat.
Loading video analysis...