3 Years of AI in 32 Minutes (chatbots to agents)

By Shaw Talebi

Summary

## Key takeaways - **ChatGPT's 3-Step Training Blueprint**: ChatGPT was created via pre-training on internet text as a super powerful autocomplete, supervised fine-tuning on fake conversations to make it an instruction follower, and reinforcement learning from human feedback using a reward model to align it as a helpful assistant. [01:15], [03:16] - **Scaling Laws: Bigger, More Data, Longer Training**: OpenAI's scaling laws show that larger models trained on more data for longer yield predictably better performance, as plotted with loss decreasing on axes of compute, driving the LLM race and Nvidia's dominance. [06:20], [07:32] - **RAG Fixes Hallucinations with Retrieved Context**: Retrieval Augmented Generation reduces hallucinations by using a user query to fetch relevant documents from a vector database, injecting them into the LLM prompt for grounded responses on recent or untrained topics. [11:15], [11:45] - **Test-Time Compute: More Thinking Tokens Boost Performance**: Reasoning models like o1 improve via test-time compute scaling, where generating more tokens before answering enhances performance across tasks, shifting from just train-time compute. [17:04], [18:02] - **Coding Agents Scale via RL and Specialization**: Coding agents like Claude Code succeed through reinforcement learning on concrete coding tasks with rule-based feedback, using specialized sub-agents for planning and editing to maximize test-time compute without context rot. [27:13], [28:54] - **MCP: USB-C for AI Tools and Agents**: Model Context Protocol standardizes plugging tools and context into AI apps like a USB-C port, enabling easy custom integrations such as Slack and Google Drive, powering mix-and-match AI agents. [24:41], [25:16]

Topics Covered

ChatGPT's Three-Step Blueprint Powers All Modern AI
Scaling Laws: Bigger Models, More Data, Longer Training Wins
Scaffolding Trumps Raw LLM Power Alone
Test-Time Compute Scaling Revolutionizes Reasoning
Specialized Agents Scale Via Multi-Agent Parallelism

Full Transcript

Hey everyone, I'm Shaw. In this video, I'm going to recap all the key AI innovations of the past few years.

Although AI has evolved at an overwhelming pace recently, my goal with this video is to give you a clear picture of how we got here and a sense of where we're going next. Our story

naturally starts in November of 2022 with the release of Chad GBT. Although

this was quite different than the chat GPT we know today. There was no web search. There was no code interpreter.

search. There was no code interpreter.

It was simply this chat interface where you could type in questions or ask it to do things and it would just magically do it. And of course this was a big hit.

it. And of course this was a big hit.

Think Chatt hit 100 million users in less than a month. And very quickly AI went from something that just a handful of researchers and scientists and

engineers and enthusiasts were really into being something that everyone was talking about. It consumed news cycles.

talking about. It consumed news cycles.

It raised concerns and really not much has changed since that moment. People

are still talking about AI. There's

still concerns around it. There's still

a lot of excitement around it too. The

process used to create JGBT is the blueprint for how modern AI models are created today. So it consisted of this

created today. So it consisted of this threestep training process but all starts with pre-training. What this

consists of is taking basically all the useful information from the internet. So

these are things like books, code bases, old texts, memes and jokes and entertaining content. things that

entertaining content. things that absolutely make no sense and aren't helpful to anyone in any way and a bunch of other types of documents. Taking all

of this text data and using it to train a so-called foundation model. This model

is quite different than the LLMs we're used to these days. Essentially, this is a super powerful autocomplete function.

Given a string of text, this model just predicts what comes next. what is the most likely next piece of text to follow in a particular sequence. So in essence,

this is just a document completer based on all the text that it's seen on the internet. So while you could get one of

internet. So while you could get one of these foundation models to do interesting and helpful tasks, it really took a lot of work to get it to be helpful. What was really required to

helpful. What was really required to make the interface more natural was this second step of supervised fine-tuning

which consisted of constructing these fake conversations between a user and assistant and then training this document completer on how to talk like

the assistant. So this went from just

the assistant. So this went from just being this document completer to being an instruction follower. While at this point we're pretty close to something that looks like chatbt, there's still some problems with this instruction

follower. Namely, it doesn't always give

follower. Namely, it doesn't always give helpful responses. It might tend to

helpful responses. It might tend to hallucinate a lot. It might give you unhelpful or unsafe responses to questions because again, this is just trained on the wildness of the internet.

It's not like a normal person that can use proper judgment in its responses.

So, there's this final step called reinforcement learning from human feedback. And the goal of this step is

feedback. And the goal of this step is to align this instruction follower with human preferences to essentially take this instruction follower and turn it

into a helpful assistant. This consists

of a few substeps. The first thing that Open AI did is that they got a bunch of input requests, a bunch of prompts. They

gave it to their instruction follower had to generate responses and then they hired a bunch of contractors and data labelers to assign a ranking to the

responses. They would have the model

responses. They would have the model generate let's say 10 responses to a given request and the labelers would just rank these responses based on which

ones were good and which ones were bad.

Then they would take these preferences, these labels from the human contractors and they would distill it into a special kind of AI model called a reward model.

And this reward model was essentially a proxy for human preferences. In other

words, you could give it a response from the large language model and it would assign a reward. It would basically predict whether a human would like the response or not like the response. So

essentially what happens is the rewards are used as feedback and teach the language model which responses are good and which responses are bad. This can

happen all in real time. So the model can just go off and get trained by the reward model without the bottleneck of human labelers. And so the result of

human labelers. And so the result of this process is this final language model which is actually a helpful assistant. In the months following the

assistant. In the months following the release of Chad, it was obvious that there was a lot of excitement around this and a lot of people started hopping on board. So we started seeing this

on board. So we started seeing this influx of large language models like CHAGBT. So we had GBT4 which was used to

CHAGBT. So we had GBT4 which was used to create the next generation of Chad GBT.

We had Llama from Meta which was a openweight open- source model. We had

Claude from Anthropic. So actually a lot of the folks from OpenAI that had created Chadup left the company to start Anthropic and Claude was their first

model. Also Google released a model

model. Also Google released a model called Bard which was a precursor to the current Gemini model they have. But of

course it wasn't just optimism and excitement. There was also a lot of

excitement. There was also a lot of concern. There was a popular open letter

concern. There was a popular open letter signed by a lot of prominent folks in the technology and AI space including Elon Musk, Steve Waznjak, and Yosha

Benjio. And they were asking for all AI

Benjio. And they were asking for all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPD4. But of course, this never happened. The race raged on and

never happened. The race raged on and all these big players were trying to build the best AI model or at least be competitive with chat GPT. And lucky for them, the recipe for creating the best

AI model was actually pretty well understood. A few years earlier, there

understood. A few years earlier, there was a paper from the team at OpenAI, many of which are the ones who founded Anthropic that discovered the so-called

LLM scaling law, which was essentially if you make the models bigger, you train it on more data, and you train it for longer, the models become better. This

was the recipe driving all the large language models at the time and even still today. I'll just show a couple

still today. I'll just show a couple plots from this paper. This phenomenon

is shown here. So the x-axis is the size of the data. The y-axis is the performance. So lower is better. We want

performance. So lower is better. We want

the loss to be as low as possible. And

then the colors are different sizes of models. From top down the models are

models. From top down the models are getting bigger and bigger. And so we can see the biggest model performs the best.

And then the bigger the data set is, the better the performance of the model. And

so this is showing that more parameters, more data means better performance. And

then this is showing the other part of it where we have again the number of parameters like the size of the model as the color. X-axis is how long the model

the color. X-axis is how long the model was trained for. And again on the Y-axis we have the performance. And so this is showing that the more parameters we have, so as we get more yellow, we have

more parameters. As it's trained for

more parameters. As it's trained for longer, that's the x-axis, the loss goes down, so the performance is going up.

More parameters trained longer, better performance. These scaling laws gave

performance. These scaling laws gave companies a predictable path for AI investment. They were confident that if

investment. They were confident that if we just get the GPUs, if we just get the data, we can get bigger models and train them for longer and we'll have the best

AI or at least we'll have a AI that's competitive with what's currently out there. And so this is why Nvidia went

there. And so this is why Nvidia went from being some tech company that people in the industry knew about to being the most valuable company in the world. This

scaling law is the reason behind that.

Although there was a lot of excitement and investment going into AI at this point, the models at the time still had a lot of problems. Namely, they would tend to hallucinate. And for the most part, they would do really cool and

impressive things, but they still weren't a super valuable technology, especially given the massive investment that companies were making into it. Just

to show a few examples of this. If you

went to Chachi and you said summarize this YouTube video and you gave it a link. At the time, Chachi had no ability

link. At the time, Chachi had no ability to search the web or even if it did, it couldn't go to YouTube videos and get the transcript. But even still, it would

the transcript. But even still, it would happily and confidently give a summary to the video. And this is just completely wrong. But then there were

completely wrong. But then there were other headlines. There was this huge

other headlines. There was this huge gaff by Bard, which is probably what killed it. it gave the wrong answer in a

killed it. it gave the wrong answer in a public demo and that kind of caused Google stock to plummet like a hund00 million dollars or something like that.

There was cases where lawyers would use chatb and chat was citing fake court cases and more and more people started to realize that hallucinations were a

pretty big problem with these models.

This led to the realization that the LLM by itself isn't something that is super valuable. in order to turn it into

valuable. in order to turn it into something that actually makes an impact, you're going to need some scaffolding.

In other words, you're going to need to build things around the model to make them more helpful. This is around the summer of 2023 and there was a lot of stuff coming out. Prompt engineering was

the new sexy job that everyone was talking about. This is scaffolding in

talking about. This is scaffolding in the sense of you're adding specific things to your request to the LLM to make it more helpful. Popular one was chain of thought reasoning. So simply

adding a simple line like think step by step would result in the model performing much better. There were also different prompting frameworks, many of which we've forgotten about, but one

that people still talk about sometimes is the react framework. So you'd

basically prompt the model to think first, then act, and then observe the results of its actions. Of course, there are other prompt engineering tricks that we still use today like giving the model

a role, giving it an objective, clear instructions, giving it examples, using structured text, giving it context, and on and on and on. Also, around this time, we saw the rise of LLM frameworks.

So, things like lang chain and llama index. There were many others, but these

index. There were many others, but these are two that have stood the test of time. We're also seeing the early days

time. We're also seeing the early days of multiLM systems. So there were people talking about having a model generate some code or write something and then

have another model critique it and then just have them go back and forth in a loop. Another popular approach was to do

loop. Another popular approach was to do model routing. So you would have some

model routing. So you would have some like rule-based way of routing a user's request to different models depending on the difficulty. So you could have like a

the difficulty. So you could have like a expensive model and a cheap model and then you can route the request accordingly. While this just scratches

accordingly. While this just scratches the surface of different things we can do for scaffolding, by far the most popular approach at this time was retrieval augmented generation or rag

for short. And this was just

for short. And this was just automatically giving LLM's helpful context. The way rag worked in these

context. The way rag worked in these days is that the system would take a user query. It would use that query to

user query. It would use that query to kick off some retrieval step. So

grabbing some relevant documents from a vector database and then those results and the user query would be combined into a prompt. Then you'd pass the prompt to the LM and then finally you'd

have the LM generate a response. Rag

turned out to be a very helpful technique in addressing the hallucination problem. So, if the user

hallucination problem. So, if the user asks about something that is more recent or it asks about something that wasn't in the pre-training data at all, the LM would just make something up because

it's been trained to always give a response and always try to be helpful.

However, with rag, because you can automatically fetch relevant context and give it to the LM, this significantly reduces the probability that it'll just make something up because the model has

something to ground its responses in.

Soon after the popularity of Rag, there was of course people saying Rag is dead because context windows started getting longer. So this kind of funny to look

longer. So this kind of funny to look back on and hard to believe how we survived before. But these initial

survived before. But these initial models that we were seeing in like spring summer of 2023, they had tiny context windows compared to the ones now. So GBT4 had a context window of

now. So GBT4 had a context window of 8,000 tokens, which is about 12 pages of text like a Microsoft Word document. And

even 32,000 was like a mindblowing amount, which doesn't even come out to a book. Llama 2 had 4,000 and then the

book. Llama 2 had 4,000 and then the first version of Claude only had 9,000.

However, in the late summer and the fall of 2023, we saw a huge boost in context lengths. They went to like a 100,000

lengths. They went to like a 100,000 tokens plus. So these were like hundreds

tokens plus. So these were like hundreds of pages. You could fit a textbook worth

of pages. You could fit a textbook worth of data into a model's context window.

And today we have Gemini which has like a million token context window and llama for scout has like a 10 million token context window. But of course rag is not

context window. But of course rag is not dead. If anything long context models

dead. If anything long context models actually made rag more helpful because now you can give more context to the model if needed. But I guess this is

always a fun headline to put on a blog post or x post or something like that.

Another major limitation of large language models at the time were that they only could process text and much of the world's data is not text. We have

images, we have videos, we have audio, we have unstructured documents which are hard to turn into text like scans of PDFs or PowerPoint slides or whatever it

is. And this was fixed with the

is. And this was fixed with the innovation of multimodal models. The

earliest popular version of this was GPT4 V. And so essentially what this

GPT4 V. And so essentially what this consisted of was taking GPD4 and adding this visual adapter to it. So adding a

mini model that would take a image and translate it into something that GPT4 could understand and essentially giving GPT4 the ability to see images. While

this gives the basic capabilities, this got a lot better in spring of 2024 with the release of GPD40. Also, Google

released Gemini around this time with similar capabilities. But essentially,

similar capabilities. But essentially, these next generation of models were natively multimodal. They were developed

natively multimodal. They were developed to process text, images, and audio. The

main value of integrating all of these things in an endto-end way is that the models just get much faster. This is

kind of like a clunky way to do it and just leads to slower inference times.

Doing it all end to end allows the models to be faster and cheaper and more effective. And now today, this is the

effective. And now today, this is the standard. Most models that you interact

standard. Most models that you interact with are multimodal. Even if they don't handle audio natively, they handle text and images, which are probably the two most important things. So the summer of

24 kind of felt like things were slowing down. People were asking where's GBT5?

down. People were asking where's GBT5?

Maybe the ride was over. But then in September of 2024, OpenAI came out with their own one model and introduced these so-called reasoning models. In other

words, these are models that can think before responding. At this point,

before responding. At this point, reasoning and thinking models are everywhere. So, you're probably familiar

everywhere. So, you're probably familiar with this. But just to recap, the way

with this. But just to recap, the way these models would work is that if you would ask it a hard question like, "What is the airspeed velocity of an unladen swallow?" instead of just jumping to an

swallow?" instead of just jumping to an answer, it would first stop and think about the question. And so you'll see like a little summary of what the model is thinking and then once the model has thought through it, it will actually

respond to the question and have its final response. So here it realizes that

final response. So here it realizes that this is like a joke question from Montipython and the holy grail and it responds accordingly. So this was a

responds accordingly. So this was a major innovation with reasoning models as we'll talk about in a second. But

there was another story here with the recent models which was a few months later DeepSseek came out with a model called DeepSseek R1 and they basically just gave away all the secrets because

OpenAI they had a blog post that described how 01 worked but there weren't really a lot of details on how they pulled this off. Deepseek was able to replicate the result and then they released a paper that gave the open

source community a lot more insight into how it worked and it also kicked off a lot of geopolitical tensions between like who's going to own AI? Is it going to be China? Is it going to be the US?

And of course, this is still something that people are talking about. But from

a technical standpoint, reasoning models were probably the most significant inflection point in AI since the release of chatbt because this has really

ushered in a new era of large language models. And this all boils down to the

models. And this all boils down to the discovery of so-called test time compute scaling. In other words, the more tokens

scaling. In other words, the more tokens a model generates, the better its performance tends to be. Just to kind of explain why this is a big deal. Up until

this point, the path for making language models more effective was so-called train time compute. It's that recipe I mentioned earlier where you make the models bigger, train it on more data,

and you train it for longer, and you get better performance. That's what's being

better performance. That's what's being shown here. Those three ingredients are

shown here. Those three ingredients are kind of compressed into this train time compute axis here. And we can see that the longer the model is trained for, the better its performance is getting. It's

going up and up and up and this is on a math benchmark here. So this is from a blog post by OpenAI reference number 10.

However, with the discovery of 01 and test time compute scaling, we realized that training time compute isn't the only way to improve language models. We

also had this other axis of test time compute. This is again this idea that

compute. This is again this idea that the more tokens a model generates, the better its performance. So as the model generates more and more tokens, it thinks for longer and longer before giving an answer, the better its

performance tends to be. And so this on a math benchmark, but this is a universal phenomenon basically across all different tasks and domains. The

more tokens that the model generates, basically the longer it thinks for, the better its performance on that specific task. So far, we've talked about

task. So far, we've talked about reasoning models and how they exhibit better and better performance by generating more tokens. But how are these models created? So this goes back

to that third step of training JGBT which was reinforcement learning from human feedback where we basically had the LLM get feedback from this reward

model which would tell the LLM whether its responses were good or bad. This

reward model was a proxy for human preferences. So over time the model's

preferences. So over time the model's responses would get more and more aligned would become more and more helpful. And so the key thing here is

helpful. And so the key thing here is that the model is learning how to generate more helpful responses through trial and error. It's going to try something, get feedback, try something

else, get feedback over and over again.

And to create these reasoning models, DeepSseek and OpenAI followed a similar process, but instead of doing reinforcement learning from human feedback, they did what I'll call real

reinforcement learning, which is they had the model do a specific task.

Instead of having it go through this reward model, they had a set of rule-based checks. So, this is just a

rule-based checks. So, this is just a program that evaluates the correctness of the model's output. This gives a concrete feedback signal. So it's not

just a prediction of whether a person would like it or not like it, but rather this is a binary correct incorrect feedback signal to the language model.

While the reinforcement learning from human feedback was helpful for tasks like instruction following, general question answering, and safety, there's a core problem with reinforcement learning from human feedback, which is

that the quality of the model is going to be determined by the quality of this reward model that you create. So

basically, your reward model is going to act as a bottleneck for this reinforcement learning system. However,

reasoning models, they don't have the same bottleneck because they're trained on concrete tasks like math problems or STEM Q&A or real world software

engineering tasks. And these problems

engineering tasks. And these problems have clear right answers. And so the result of a training strategy like this, if we look specifically at DeepSeek R1,

we'll see that its chain of thought length increases with performance gains.

So Deepseek R1 is trained via this reinforcement loop on these concrete math encoding tasks. And we can see that the longer and longer it's trained, the more and more accurate it becomes, as

expected. But also as it's trained for

expected. But also as it's trained for longer and its accuracy goes up, we also see its average length per response goes up. And so this is a concrete

up. And so this is a concrete demonstration of the test time compute.

And so as the model's accuracy goes up and up and up, this has correlated with its average response length getting longer and longer. Using reinforcement

learning in this way has kicked off a brand new paradigm for training large language models. Now a big focus is

language models. Now a big focus is doing this reinforcement learning to improve the model's performance on specific tasks. And perhaps the most

specific tasks. And perhaps the most successful example of this are these modern deep research tools that we're seeing. And so basically all the big AI

seeing. And so basically all the big AI providers have a deep research functionality. So these are models that

functionality. So these are models that can go off do research and generate detailed reports based on that research.

This first came out with chatbt and it would do this whole workflow of coming up with a research plan. Goes off and does hundreds of searches. It'll reason

through the searches and redirect its research based on its findings. And then

once it's gathered all the data that it feels it needs, it'll generate a report.

And so this whole workflow was trained in this end-to-end reinforcement learning kind of way. The model is trained to do different research tasks using reinforcement learning. And of

course, this is everywhere now. So

research is now part of Claude, Gemini, Perplexity, Grock, and other AI tools.

So reasoning models unlocked tools like deep research, but they also laid the foundation for making LLMs actually good at calling tools. And so even though

function calling had been around for a while at this point, it was hard to get working well. But with reasoning models,

working well. But with reasoning models, this was a turning point. So a few things needed to come together in order to make tool calling good. The first was structured outputs. So this was getting

structured outputs. So this was getting an LLM to output structured text like JSON. This is necessary because the

JSON. This is necessary because the structured format is what's going to be parsed by a computer to actually execute a tool call or a function call. Another

thing that turned out to be helpful is reasoning models because a tool call isn't trivial. There's a lot that goes

isn't trivial. There's a lot that goes into it. And so relying on models that

into it. And so relying on models that would just respond to requests just in one shot, they turned out to not do a very good job at this. However,

reasoning models, they could stop and think about the problem and kind of break down the task, plan out what tools they want to use. And this made tool calling much more reliable. And finally,

models started to get specifically trained on tool usage, which consists of multiple steps. Like one, you got to

multiple steps. Like one, you got to plan out how you're going to solve the task. You have to decide okay here are

task. You have to decide okay here are all the tools I have available which ones are actually helpful to me then thinking through what arguments am I going to pass into this tool that I want

to use and then finally learning how to generate a response from that actual tool call. Maybe initially people

tool call. Maybe initially people thought that if we have structured outputs we can get reliable tool calls but really it took like another maybe year or 18 months to get this to work

reliably. So this ability for large

reliably. So this ability for large language models to just know how to use tools out of the box without being trained on the specific tool beforehand or getting very detailed instructions in

this prompt. You can just give it the

this prompt. You can just give it the tool schema and it'll do a decent job of knowing when to use the tool and how to use the tool. And so this has made it possible for people to like mix and

match different tool sets and models together, which is one of the reasons why the model context protocol or MCP for short has become so popular recently. If you haven't heard of MCP

recently. If you haven't heard of MCP before, this is just a universal way to give models tools and context. The

analogy that Anthropic uses to describe this is that MCP is kind of like the USBC port of AI apps. Before old

computers had like a bunch of different types of ports for different types of devices, but everything eventually became standardized with the USBC port.

Even now, you'll have laptops that only have USBC ports. And that port is used for power, for iPhones, for printers, plugging in a keyboard, a mouse, or

whatever device you want to plug into your laptop. So, MCP serves a similar

your laptop. So, MCP serves a similar function for AI applications. So you can think of your AI app as the laptop and then you can plug in all different types

of tools and integrations into that application using MCP. What this enables you to do is now you can have custom integrations with your AI applications.

So for example, you can connect your Slack account and your Google Drive account to Claude desktop. You can have Claude pull data from Google Drive and post in different Slack channels if you

really wanted to. But you can also easily spin up AI agents. Now, in other words, you can take your favorite AI model and then you can equip it with

different tools that are provided by different MCP servers and now you have an AI agent that can interact with the world and do things on your behalf. This

is a perfect segue into AI agents. While

there have been a lot of different definitions of agents and a lot of controversies over the definition here, I'll just define an agent as models that can interact with the world via tools.

So really, there are two essential ingredients to an agent. It's the model and then the tools that it has access to. And its tools allow it to interact

to. And its tools allow it to interact with the real world and take actions, retrieve data, create software, send emails, and on and on and on. 2025 was

said to be the year of AI agents. So

Jensen, the CEO of Nvidia, said AI agents are going to get deployed this year. Sam Alman had a similar sentiment.

year. Sam Alman had a similar sentiment.

He was saying AI agents will join the workforce in 2025. And YC made huge ban to vertical AI agents in 2025. And they

had an expost saying that 2025 is shaping up to be the year of AI agents.

Today it's near the end of 2025. And

I'll say that these predictions were correct. And that's mainly because of

correct. And that's mainly because of what we're seeing with coding agents.

And so systems like cloud code, codeex, and many others. These are specialized AI models that you can deploy into your codebase. And they have different sets

codebase. And they have different sets of tools that make them very helpful for developing software. By far, this is the

developing software. By far, this is the greatest economic value from AI today, using it to generate software, to solve business problems, and to create companies. But another big thing about

companies. But another big thing about these coding agents is that I'd say these are the blueprint for other AI agents that we'll probably be seeing over the next year. And so the great

thing about coding is that you can do this reinforcement learning specifically on coding tasks. Anthropic OpenAI and many other groups spend a lot of time on thinking about okay what's the right

task? What's the right reinforcement

task? What's the right reinforcement learning environment to create a helpful coding agent? And you can see based on

coding agent? And you can see based on the results of these products how effective that approach can be. I

imagine in 2026 this theme will continue but start to get expanded to other areas and other domains. But one of the main reasons why agents are so helpful. When

you bring together reasoning and tool calling and creating these specialized agents on specific tasks, the main lever that's being pulled is that more tokens

are being generated. So you're able to improve performance through test time compute. So again, more tokens means

compute. So again, more tokens means better performance. So when an agent is

better performance. So when an agent is reasoning and calling tools and reasoning some more and generating lots and lots of tokens, his performance is getting better and better on specific

tasks. And so the natural way to scale

tasks. And so the natural way to scale this up is to have multiple agents. I'll

specifically talk about cloud code cuz that's the one I'm most familiar with.

One of the most powerful features of Claude Code is that there are these specialized agents. So you can put cloud

specialized agents. So you can put cloud code in planning mode where it has a set of tools and instructions on specifically coming up with plans. You

can put it in autoedit mode where it'll just implement coding changes. So you

can think of these as two different specialized agents operating within the same context window. You can also create custom agents if you want to create custom tools on actually running a

development server and then having the agent poke around the website and taking screenshots and making sure everything looks good. You can do that. Having

looks good. You can do that. Having

specialized agents is good because if you have too big of a tool set or the instructions are too general, the agents performance tends to degrade. So it's

helpful to have this specialization.

Another thing that has come up as people have deployed agents and tried to really push test time compute to the max, have it really generate lots and lots of tokens, is that as these context windows

get longer and longer, the performance of these agents starts to degrade because there's lots of irrelevant text in the context window. So, of course, there are tools that are integrated into

cloud code and other systems that will just automatically compress or clean up the context window. But another version of this is that you can create sub agents that will go off and do a

specific task and report back the result while preserving your same context window. So instead of having all your

window. So instead of having all your agents share one context window, you can have separate context windows so you can manage this text and avoid so-called context rot. And then finally, something

context rot. And then finally, something people are doing is that they'll actually just have multiple agents running in parallel, whether they're all collaborating on a single feature or you

have multiple agents working on the same thing just so you can see which one does the best implementation. For example,

you're having an agent implement a feature, you can have like three agents running in parallel trying to implement the same thing. Have them go off and do that fix for like 20 minutes, for an hour, whatever it is. And then when

they're all done, you can review the code and see which one actually did a good job. And so these are all different

good job. And so these are all different things we're seeing today on scaling up agents and scaling up test time compute to have these systems generate more value and make more of an impact. That's

my read of where we are right now. And I

imagine that over the next year in 2026, we will continue to scale up AI agents, figuring out how to make agents running in parallel more helpful. So having

multi- aent systems that don't go off the rails so easily, I think people are going to figure out how to do context management in a much better way, whether that's through sub agents, helpful heruristics, or both. And I do think

we'll have more and more specialized agents, not just coding agents, but agents trained in an end-to-end way for different end use cases. And of course, many other things happened in AI over

the past 3 years that didn't really fit nicely into this narrative. If you want me to cover any of these topics or have questions about anything I talked about in this video, please let me know in the comments below. And as always, thank you

comments below. And as always, thank you so much for your time and thanks for watching.

Loading...

Loading video analysis...