The Truth About AI Agents
By Ben Davis
Summary
## Key takeaways - **AI Agents Are Simple Loops**: All an agent is is a loop where the LLM decides when to stop: you do an LLM call, tool calls, and repeat until it says stop. This is much simpler than most people say. [00:18], [00:25] - **Hand-Write the While Loop**: Hand-writing the agent run while loop made how tool calls work click, unlike AI SDK which abstracts it away. In the example, call OpenAI API, execute tools, append results, loop until no more tool calls. [01:05], [02:43] - **Too Many Tools Degrade Performance**: Increasing input tokens from too many tools makes models dumber, as shown in Chroma needle-in-haystack test; overload agents with 10 tools and performance drops. Fix by pre-passing essentials like video info instead of tool calls. [04:32], [05:49] - **Use Small Fast Models**: GPT-4o mini via Grok is insanely fast at 750-5000 TPS for simple classification, no need for slow GPT-5 overkill. Different models excel at different tasks; tune prompts for speed. [07:48], [08:21] - **Durable Streams Essential**: Agents loop long on server, so use durable streams like River with Redis to resume on refresh without losing progress. Wrap in waitUntil for serverless to keep running post-request. [09:21], [11:40] - **Agents Beat Chat UX**: Chat sucks for non-chatting tasks like ordering; agents are LLMs in loops with hidden tools, creating hellish UX if unknown. Integrate subtly into apps/background for better experience. [13:33], [14:07]
Topics Covered
- Agents are simple LLM loops
- More tools make models dumber
- Tune agents to one task
- Match small models to simple tasks
- Durable streams enable robust UX
Full Transcript
I feel like AI agent has become the new big buzzword that everyone loves throwing around these days. There's a
lot of noise on this, especially if you like go on Twitter. It's really, really stupid all the things that you'll see on this, but I wanted to talk about it because there really is some incredibly cool, incredibly useful stuff that you
can do with this. And it's not magic. I
think Mac PCO had a great tweet on this where like all an agent is is it's just a loop where the LLM decides when to stop. You do an LLM call, you do some
stop. You do an LLM call, you do some tool calls, you go back here. If it says stop, you stop. If it says go do more tool calls, you go do more tool calls.
That's all this is. And that's okay.
This is still an incredibly useful and cool thing that we now have access to.
Tool calls are awesome. The things that this unlocks is awesome. And what I want to go through today is just what I found going through and actually building these things. I am going to use this as
these things. I am going to use this as an example of how tool calls actually work because I don't think most people really get how these work. This was a repo I made over the summer when I was testing out some weird contextf free
grammar stuff that GBT5 introduced. The
entire reason I'm bringing it up is because I hand wrote an agent run instead of using like the AI SDK. I
actually wrote out the while loop myself. So, it's a giant Python file.
myself. So, it's a giant Python file.
I'm really sorry.
I picked Python for this project because the OpenAI documentation was better for Python. I [ __ ] hate this language. I
Python. I [ __ ] hate this language. I
would not recommend it for anything, but for this example, it'll do. You can see here in the main function, we're just doing a bunch of stuff here. We're
creating our OpenAI client, doing a bunch of random stuff. We're setting up our system prompt here, which is giving the agent instructions on what it's actually supposed to be doing, and then the important stuff starts here.
Basically, all this is saying is that if you output something in this shape, I will give you back the result of this function call. So, you're defining a
function call. So, you're defining a bunch of functions to get the current date, time, add a to-do to the database.
Um, those are the only two I defined in here. And then I have this function for
here. And then I have this function for actually calling the to-dos. I have my input list. And then we're doing our
input list. And then we're doing our first response here. So we're calling the OpenAI API. We're passing in all our tools. We're passing in our inputs. And
tools. We're passing in our inputs. And
then we start the actual while loop. The
way this actually works is we're grabbing the output right here. We're
going through all the tool calls in the actual output. We're going to check if
actual output. We're going to check if it's a function call. Then we're going to actually call the tool. And then
we're going to append the result back into our input list. Because the thing with agent runs or whatever is the LMS obviously can't call the tools themselves. All they can do is input and
themselves. All they can do is input and output text. That's all they actually
output text. That's all they actually do. So we're basically just creating a
do. So we're basically just creating a new prompt that includes the tool calls that it made in the last step plus their actual results. And then we're going to
actual results. And then we're going to pass that back into it right down here.
And we're going to keep running this until there are no more tool calls. When
there are no more tool calls, that means that we got our final output. And then
we can just print out our final text here. And it's done. That's it. That
here. And it's done. That's it. That
that's all this actually is. It is much simpler than most people are actually saying it is. I wanted to start with that example because it's the thing that made how agent runs actually work click for me because before that I'd just been
using the AI SDK and I I really like the AI SDK. It's amazing. Uh but the one
AI SDK. It's amazing. Uh but the one thing that I don't super love is that they kind of abstract away how the tool calls and steps are really working because you can see in this input here when I look at this in Typescript brain
I'm like oh this is just a function of stream text. I'm going to give it these
stream text. I'm going to give it these tools and I'm just going to get an output here. I'm just going to get a
output here. I'm just going to get a function output. It's not super obvious
function output. It's not super obvious that this stop when step count is five.
That means that we're only going to call the LLM a max of five times. And once
you understand how all this stuff works, you can do some crazy stuff with it.
This is an example I made a while ago.
If I start this new agent run, you can see it's going to start streaming in.
It's using Grock on the back end. But
you can see all the stuff it's doing on the first step. It's going to read from memory. It's going to get the video
memory. It's going to get the video info. It's going to get the top
info. It's going to get the top comments. Step break means that we're
comments. Step break means that we're going to do another LLM run here. Going
to do some more hidden reasoning. And
then it's going to get the video memories. And then it's going to keep
memories. And then it's going to keep going. It's going to write a new memory.
going. It's going to write a new memory.
And then it's going to give us our final text response. This is actually not a
text response. This is actually not a great example of how to use these things. Well, I'm giving this thing way
things. Well, I'm giving this thing way more tools than it probably should have, but you can see what's actually happening here of each step. We're
running a bunch of tool calls. That's
also how like cursor and claw code work.
It's just a gigantic loop like this where every single run has a bunch of tools for reading files, writing files.
The hardest part of building these things by far is not the actual technical writeup of writing the while loop or calling the AI SDK. It's
figuring out what tools to give the model and what prompt to give the model and how to orchestrate all this stuff together to make something good. Because
there are a lot of constraints on how these things actually work. This article
from Tri Chroma is really good on how increasing input tokens impacts LLM performance. I'm not going to go through
performance. I'm not going to go through the whole thing here, but basically what they did is they did a needle in a haystack test where we give the model a bunch of text, ask it a question about the text, and then there's like one
sentence hidden deep within this text that it's trying to get the answer to.
And you can see up here in the results as we increase the amount of tokens we're dumping into these things, the models get dumber. they get worse and worse at finding the actual important piece within the original text. And that
was one of the problems with this original agent run I built where there are like 10 tools that this thing has access to. You can see in here when you
access to. You can see in here when you look at all the tool calls, it's doing a read from memory, a get video info, a get top comments, a get other videos memories, a write to memory, and then
it's finally doing the text output. So
that's five tool calls on one agent run where most of these tool calls probably should have just been things that got passed into it at the very start. Like
if you look at the system prompt for this agent run, I'm literally giving it a workflow on things to do every single time. It needs to check the memory. It
time. It needs to check the memory. It
needs to get the video info in the top comments. It needs to identify the
comments. It needs to identify the sponsor from the description, flag priority comments, blah blah blah.
Already I'm looking at this like these two should just be passed into the initial request. Like it should just
initial request. Like it should just have those right off the gate. shouldn't
be degrading the model performance by giving it more tools and more things to think about because as you give these things more stuff to play with, they get dumber. That's one of my biggest
dumber. That's one of my biggest problems with MCP is I don't like overloading my agents with a bunch of different things that it can do cuz these things like calling their tools.
If you give them tools, they're going to want to play with the tools. I don't
want them constantly calling a bunch of MCPS, especially when these MCPs expose like 20 tools. Often times, that's just going to massively degrade the performance of your actual output. The
way to build these things intelligently right now at least is very much tuning them down to just doing like one or two very specific things and then chain those together in a workflow. And I
ended up fixing this in V3 of that R8Y project where initially I was having it do all of the comment classification in that one gigantic LLM loop. Now I'm
having it do the classification on a per comment basis where it will return an output schema of is editing mistake is sponsor mesh is question is positive comments. We just pass that into this
comments. We just pass that into this generate object call using GPT OSS which I'll talk about model choice in a second here because it's really important and you can see there's no tools in here.
I'm just telling it, hey, here's the sponsor. Here's the comments. Here's the
sponsor. Here's the comments. Here's the
schema. Give me an answer. And I'm doing the same exact thing for getting the sponsor because the sponsor can be in different places in the description.
It's hard to like hardcode ripping that out. Especially if you're doing this
out. Especially if you're doing this across multiple channels. They're going
to have different video description structures. Most of the time I can
structures. Most of the time I can basically tell it like, hey, here's how you get the sponsor out of this video's description. And then you just pass that
description. And then you just pass that into this generate object and you get really good output. That's how I'm doing all the classification in this thing.
And I'm doing it with this super super fast model. You can see here in the work
fast model. You can see here in the work I've been doing, I've been playing around with a lot of different models for these internal agent runs. I've been
using Gro 4 fast, uh, GBT OSS, uh, GBT 5 Mini, Claude Sonnet 4, and a bunch of other ones. But the big thing that I've
other ones. But the big thing that I've ended up finding is that different models are very good at different things. GPT5 is the big super smart
things. GPT5 is the big super smart model that will run for a very long time and will probably give you the right answer. It will try very hard to give
answer. It will try very hard to give you the right answer, and it it is overall the smartest model, but it's really slow and it's really overkill for a lot of these problems. GPTO OSS 12B is
like not actually that smart, but it's insanely fast. And like this is not a
insanely fast. And like this is not a hard problem to just do a basic sentiment analysis on a comment. And the
way I set this up is I'm having GPTOSS do the classification here. I'm limiting
the provider to only be Grock, which means that the actual run is insanely fast. If you look at their stats on Open
fast. If you look at their stats on Open Router, the throughput is like 750 TPS.
I've seen this go all the way up to like 5,000 TPS. This will basically just run
5,000 TPS. This will basically just run instantly and get you the right answer.
If you dial in the actual like prompts and inputs and tools you're giving these things. You don't have to always use the
things. You don't have to always use the biggest, best, smartest model. And this
is especially true in AI coding stuff where I've almost entirely been using Composer 1 lately for my main day-to-day code work because I give it such specific instructions on what I want it to do. I don't need to bring out
to do. I don't need to bring out something like GPT5 which will sit there and think and loop and loop and loop and loop over and over and over again to try and get the perfect right answer. I just
need it to do the actual thing. And
these smaller faster models are really really nice for this. And when you do need to break out more intelligence, you need it to think a little bit harder.
Sonnet 4.5 is still great. I've been
really happy bringing out the plan mode with 4.5 where it basically decides what it actually wants to do and then having composer 1 actually go through and implement that. It's a great flow. And
implement that. It's a great flow. And
the last thing I want to talk about here is the actual user experience for building out these agents. And one of the most important things here I really think is having durable streams because these agents are going to run for a very
long time on the server. It's just how they work. They have to loop a lot. They
they work. They have to loop a lot. They
take a while to think. You don't want to lose all the progress in that generation if the user refreshes the page or they navigate somewhere else or they want to open it up in another tab and have it
also be streaming in there. You want to have your streams computed somewhere else. I have another video on this that
else. I have another video on this that goes deeper on that, but I wanted to like briefly show off how it works here because I've been building a lot of tooling around this lately. I just
finished a new version of River. It's
just like an internal package I've been working on that I ended up open sourcing and releasing that is stable in Spellkit World. I built out this demo here where
World. I built out this demo here where you can ask it a question. This is the imposter agent that I showed earlier. Is
the Earth really flat? A good question.
Uh we can ask this question. It's going
to start running and then as soon as it starts running, we're going to get the tool call result of you are an imposter.
So then yes, the earth is flat. And if
we refresh this, we're going to still have that response out here because we have this resume key that links to that actual agent run. I can clear this out.
I'll ask again. I can keep refreshing this. You'll see it just keeps popping
this. You'll see it just keeps popping out. The stream is durable. It's not
out. The stream is durable. It's not
happening in the actual request. It's
happening in the background. And the
code for setting this up in something like River is insanely easy because of the way I structured this where I define the unreliable agent. I define I create the river stream. I pass in the chunk
type to get us full type safety on the client. I give it the spelkit adapter
client. I give it the spelkit adapter request type because this is running in a speltkit project. I want to bring this to uh Tanstack start and react native plus expo. I'm going to be working on
plus expo. I'm going to be working on that very soon. I pass in the input schema on what you need to pass in on the client. I give it a provider. In
the client. I give it a provider. In
this case, it is Reddus, which means that the stream itself is going to be stored in Reddus as well as being streamed down to the client so that we can hit it over and over and over again and then use Reddus as that source of
truth. I have to pass in a storage ID
truth. I have to pass in a storage ID for like what the stream actually is called, the Reddus client to interface with Reddus, and then this wait until which is mostly for serverless stuff in
local or on a longunning server like a railway or something like that. You
don't really need this cuz like it will just keep running in the background even after the request terminates. But on
Cloudflare or on Verscell, as soon as that request gets aborted or terminated, the stream will stop, which means that we will no longer have our agent run happening in the background. Under the
covers, I'm wrapping all the logic in that wait until so that it will keep running in the background regardless of what happens on the client, which is really, really nice. Then finally, we just have our runner down here where we
grab our unreliable agent stream from that function above. We go through all the chunks in here. We append those chunks to the actual stream with full type safety on what's actually here.
Then on the client, we can consume it with this very um TRPC like interface.
It was very heavily inspired by that cuz I just love that way of working with things. Every time a chunk comes down, I
things. Every time a chunk comes down, I get that chunk in the onchunk and it's got full type safety, which means that whenever we get a text delta, I can update the answer. And then whenever we get a tool result, I know that if it
wasn't a dynamic tool call, it was the is imposter tool call. So I have full type safety on whether or not it is an imposttor, I can set that uh boolean on the client and it just kind of works.
Have your other life cycle hooks. And
the really nice thing here for the resuming stuff is on stream info.
Whenever we start the stream, this will get fired which will include an encoded resumption token. You can save this to
resumption token. You can save this to the database or into the URL like we are in this demo. So that anytime we mount, if we have the resume key, we just resume the stream and it will flush everything back down in exactly the same
way. I really really like this way of
way. I really really like this way of working with agents and I want more AI apps to adapt this. The entire reason I made this project, I don't care about making money on this or anything like that. I just want to use these things in
that. I just want to use these things in more of the things I'm building. and
this is the way I want to use them. If
someone wants to steal this, hell yeah.
Have fun. It's MIT licensed. I could not care less. Hopefully, this all makes
care less. Hopefully, this all makes sense. I really wanted to make this
sense. I really wanted to make this video to just kind of go over how these things actually work. I think they're incredibly useful and there's a lot of really sick stuff we can do with them.
I'm definitely not in the camp where I want everything to become a chat app. I
think chat is actually a pretty bad experience for most things other than direct chatting, like ordering a meal or whatever. putting that into a chat
whatever. putting that into a chat interface would really kind of suck because effectively what you end up doing is as we just looked at here, these LLM agents are just LLMs running
in a loop with tool calls. So if you had the original dashboard, it has an add food to cart button and a submit order button. Those are just getting turned
button. Those are just getting turned into tools that the agent now has, but you don't know what tools it has. So you
kind of just end up like prompting at it and talking to it and not really knowing what it's doing. It creates this weird hellish user experience that I really don't like. But I think that's subtly
don't like. But I think that's subtly integrating these things in the background into existing apps and in new things. It is simultaneously both a very
things. It is simultaneously both a very overfunded and underexplored world right now. It's it's a good time to be doing
now. It's it's a good time to be doing this stuff. Highly recommend going and
this stuff. Highly recommend going and checking it out. If you enjoyed this video, make sure to like and subscribe.
Loading video analysis...