AI Engineer Code Summit 2025: AIE/CODE Track
By AI Engineer
Summary
## Key takeaways - **War Against Slop**: We're in an asymmetric war against slop, where the amount of taste needed to fight slop is an order of magnitude bigger than that needed to produce it. [16:58], [19:20] - **Stop Building Agents, Build Skills**: Agents have intelligence but lack domain expertise; skills are organized collections of files packaging composable procedural knowledge as folders with scripts as self-documenting, modifiable tools. [24:07], [26:34] - **Research-Plan-Implement Workflow**: Use research to understand codebase objectively, plan with exact steps including file names and code snippets for reliable execution, implement while staying in the smart zone via frequent compaction to avoid the dumb zone. [47:31], [47:54] - **Composer: Fast Frontier Model**: Cursor Composer is a fast frontier model trained with RL on real production rollouts matching inference environment, achieving 4x token efficiency via parallel tool calls and semantic search. [01:00:56], [01:02:26] - **Code World Model Traces Execution**: CWM predicts program execution traces line-by-line with local variables to model transition function, enabling neural debugger for inline code completion and approximating halting problem. [02:09:49], [02:19:18]
Topics Covered
- Full Video
Full Transcript
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
I used to type to wake The machine blueprints functions logic.
But the moment I spoke past the line broke and became divine. No scripts the unknown. Now the interface bleeds with
unknown. Now the interface bleeds with the soul. I am not using the code
the soul. I am not using the code anymore. I'm writing I'm fighting for
anymore. I'm writing I'm fighting for every sparks a ghost of fear. Everyone
becomes a past made clear. Truth and
fragments, light and fractures. Order
born from silent rupture.
This is the new code where the system rises us back. Where the fire in the circuit fills the void that logic lacks.
We're becoming as we're building. We are
forcing what we make. Not the world that we live in, but the world we undertake.
Every framework shapes how we feel.
Every pattern decides what is real.
Duality tears and repairs the design.
Darkness everything.
Fear is just a failing test. Run it
twice and do your best. Balance isn't
passive. Balance burns. Mastery is
earned in turns. Build the life you dare to name above the pain. Commit the
flame.
This is the new code. Where the system writes us back. Where the fire in the circuit fills the void that logic lacks.
We're becoming as we're building. We are
forced in what we make. Not the world that we live in, but the world we take.
Undertake.
We move through us. We aren't just shaping the future. It is learning who we trust.
the future. It is learning who we trust.
Identity is an act of will, a declaration we compile. We don't wait for faith to choose us. We push and run the
code. We are rewriting what we are.
code. We are rewriting what we are.
Through the unknown, through the fire.
Every human is a star. The journey and the ending are the same road that we take. Not the world that we live in, but
take. Not the world that we live in, but the world we choose to make.
This is not the story.
This is the world.
This is the same.
Typing thoughts into the darkest part becomes design. Words evolve to whispers
becomes design. Words evolve to whispers meant for something more divine. Syntax
bends and breeze. I see the language change. I'm not instructing anymore. I'm
change. I'm not instructing anymore. I'm
rearranging f. Every loop I write rewrites me. Every function hums with
rewrites me. Every function hums with meaning. I feel the interface dissolve
meaning. I feel the interface dissolve between the maker and the creator.
This is the new code. Not on the screen, but in the soul where thought becomes the motion and creation takes control.
No lines, no rules. Just balance in between the zero and the one, the silence and the dream.
systems shape our fragile skin. They
mold the way we move. We live inside the logic gates of what we think is true.
But deep beneath the data post, there's something undefined.
A universe compiling the image of our minds. Every line reveals reflection.
minds. Every line reveals reflection.
Reloop replace connection. We're not
building, we're becoming. And the code becomes confession.
This is the new code. Not on the screen, but in the soul where becomes the motion. Creation takes control. No
motion. Creation takes control. No
lines, no rules. Just balance in between the zero and the one. The silence in the tree.
>> We are not.
>> Don't worry. Uh, we're just giving you something to do while Codeex writes all your code.
We're in.
We are the world we're doing.
Each prompt, each breath, each fragile spin. A universe
spin. A universe renewing.
This is the new code.
Alive and undefined.
Where logic meets motion and structure bends to mind. The system homes eternal but the soul rides the line. We are the
new code compiling time.
Compiling time.
Ladies and gentlemen, please join me in welcoming to the stage senior staff engineer at Google and your host for the engineering track session day, Jed
Boravik.
>> Hello. Good morning.
Welcome to the 2025 AI Engineering Code Summit in New York. How are we doing?
>> All right, it's early. It's Friday.
Thank you all for being here. Raise your
hand if you've been to one of these events before, an AI engineering conference before. All right, pretty
conference before. All right, pretty good. So, for those on uh those watching
good. So, for those on uh those watching live stream, about half the hands up.
Keep your hands up. Keep your hands up.
Two or more events.
Okay, still have a couple. Three,
four, one, two hands. Five.
Are you sure? There's only been four, Alex.
Okay. Okay. We We'll talk afterwards.
Okay. Well, welcome. Whether it's your first time or you've been to many of these, we're all excited you're here.
Um, my name is Jed Borvik. Um, I'm
Gemini's assistant at Google and I also work on the Jules coding agent. I lead
the product engineering team. Um, and
I'm your MC for today. So, why are we here? I'm sure many of you are familiar
here? I'm sure many of you are familiar with Richard Hamming's famous you and your research talk. In that talk, he describes asking his colleagues, what's the most important problem in your field? And then right after that, he
field? And then right after that, he asks, so why aren't you working on that?
I spend a lot of time hiring. And this
idea comes up again and again. If you're
working in technology, the most important problem of our day is AI. And
if you're working on applied AI, your most important problem is code.
This is a special event to push the whole AI coding industry forward. It's
not event for a single company but across all companies in this industry.
The AI engineering conference has two brands a world's fair and a summit. This
being a summit event is intentionally smaller than the world's fair. It's
intentionally single track. It's
designed to bring the best people together in the world about a single important theme for this event. That theme is AI coding.
Yesterday, many of you also experienced the leadership track. Make some noise so we know you're still alive if you were in that track.
>> For those of you who are there, shout out some of your favorite talks.
>> Stanford.
>> Yeah, Stanford. Yeah, that was a good That was a great one.
>> G Steve, that was that was a spicy one.
What else?
>> Every Yeah, Dan from Every >> Ah, good choice. Good choice.
Okay. Well, yesterday was a great day.
It was about how AI is transforming software organizations. Today we're
software organizations. Today we're going to dive into the patterns, systems, and products that make all of that possible. But whether you consider
that possible. But whether you consider yourself an AI leader, an AI engineer, or something in between, we're glad you're here.
We also wouldn't be here without our amazing sponsors. I would like to thank
amazing sponsors. I would like to thank them, especially our presenting sponsor, Deep Mind. Yeah, give it up. Give it up.
Deep Mind. Yeah, give it up. Give it up.
And what an amazing week for DeepMind.
I'm biased, but I hope you all get a chance to use Gemini 3 and Nano Banana Pro, which came out this week.
I'd also like to thank Anthropic as our platinum sponsor and the gold sponsors you see on this screen. Yeah, we we'll do one big round at the end. And
finally, we want to thank our silver sponsors. Let's put our hands together
sponsors. Let's put our hands together for all of these sponsors. Yeah, give it up. Give it up.
up. Give it up.
All these sponsors will be downstairs in the expo area. They have booths and um I recommend going down there to chat with folks from all these companies. They'll
be open all day after the keynotes.
All right, with that, let's get started.
Our first speaker needs no introduction, especially here. He wrote the article
especially here. He wrote the article that named the AI engineering movement.
He's the driving vision behind this event. He's built the amazing community
event. He's built the amazing community that's here today. Please join me in welcoming Swix Thank you. Hi everyone.
Thank you. Hi everyone.
Morning. How's everyone doing?
>> Good. I'm going to need a lot of energy for this talk, so please back me up. I'm
very nervous. Uh but we'll get through this. I'm declaring warrants lock today.
this. I'm declaring warrants lock today.
Let's talk about this. Every AIE has a secret. I I've told this to uh some
secret. I I've told this to uh some folks that are personal friends and I'll just show show the secret now. The first
summit we had the secret which was we knew that the AI engineer was going to be a thing. Second summit we extended it to leadership. Third summit we realized
to leadership. Third summit we realized that basically we always needed to concentrate on model labs and that's why you see um all all the all the top tier labs here today. Um world's fair we
started expanding the TAM of what AI engineering uh is is affiliated with.
with AIPMs and AI designers and with the code summit as Jed just talked about we really started to focus on curation and focusing in on a theme and if there's one theme that's really matters this year is it's coding but I'm not here to
talk about coding the rest of the day you're going to hear about coding so just just indulge me five minutes to talk about slop um we we've done really well right like so so like and I think like slop is sort
of associated with quantity and quality and I think like that's something that I'm really trying to think about as well like how do we grow this community, grow this industry, and grow this event uh
with the the same kind of taste and high quality that you've come to expect. And
this is something that, you know, I hope hopefully you guys can see that we care a lot about uh in curating all of you coming here and all of the speakers that you're about to see. Um we're in a war
against slop. Um this was actually uh
against slop. Um this was actually uh the Oxford English dictionary. Uh this
is a candidate for the 2024 word of the year. It lost to brain rot.
year. It lost to brain rot.
But uh but slop is stop is pretty good and I think it's it's probably gone even more of an issue this year than last year. Maybe it will win this year. I
year. Maybe it will win this year. I
have an issue with Oxford though because they did us dirty by saying slop is is generated using artificial intelligence.
The other part I I agree with. It's
lowquality, inauthentic or inaccurate.
But it doesn't take AI to be lowquality, inauthentic or inaccurate.
any human or AI can be an agent of slop, right? You've seen this yourself. I'm
right? You've seen this yourself. I'm
going to indulge me of a few examples.
By the way, I got this uh if you're not familiar with like sort of internet slang, the opposite of slop is kino. Um
and I got this idea from Paul Rambles who uh like you know when I do Sora videos, I do really boring Sora videos with me and Sam Alman. When other people who are actually creative and good at
their job do Sora videos, they do cats playing dig.
Uh, Slot can be produced by the same studio. There's K-pop demon hunters by
studio. There's K-pop demon hunters by Netflix and there's electric st by Netflix.
Slav can be produced in terms of different models. No comment.
different models. No comment.
Slav can be SLO can something that's keynote can degenerate into SLO, right?
If you're early on the trend and you're and you you starting that and it's and it's fresh and new, that's great. U if
you're if you recognize the other image, you're too online.
Okay, not enough people recognize that image. Um, go do your homework. Um,
image. Um, go do your homework. Um,
yeah, and obviously I'm just going to throw in a dig at Game of Thrones because this is same same thing, right?
Like slop is everywhere. It's generated
by humans and AI. You get it?
Okay. Um, the same startup idea can be keeno versus slop. When I first presented the my first keynote at AI engineer summit, we actually used to uh which is sort of like an AI slides company. uh I loaded I loaded up the
company. uh I loaded I loaded up the same slide deck and it was gone uh recently because uh it was it was actually uh sort of uh closed as a company. Um there's there's you know
company. Um there's there's you know different takes on vi coding and I think one of them uh is much better than the other and I think like this these are just like the tensions that we have to navigate. um one of our speakers later
navigate. um one of our speakers later on as well meter um I think it's really interesting that both of them are exponential charts but one of them feels more keeno and the other is more slops and I think like I would really like to
have people investigate why um so let let me let me just skip through basically we're in an asymmetric war on slop um I think u the closest law that I found that matches this is brandini's
law which actually states the amount of energy needed to refute [ __ ] is an order of magnitude bigger than needed to produce it Right? So we need to coin an appropriate law as well. Um because you
know like the the the cost to generate tokens is is dropping by 100 to a thousand times every single year.
Um so this is I guess Swix's law of anti-sluff. The amount of taste needed
anti-sluff. The amount of taste needed to fight SLO is an order magnitude bigger than that needed to produce it.
Right? There's so much low taste out there. We need to elevate uh what's out
there. We need to elevate uh what's out there in the world because that's what we stand for as humans. Um, I think there's there's a positive message. You
can use AI to fight slop. Um, I'm proud I'm proud to um run as a side project AI news, which is the only newsletter that tells you not to read it when there's
nothing going on. Thank you. Oh,
appreciate that. Um, you can also prompt to fight Slob. The next speaker is um Mahash and Barry. Um I found this in the in the sort of prompt in in the skill set that they that they put that they
put where they actually acknowledge slop and tell cloud not to produce slop and it actually improves significantly from left to right. Um what about code slop right we hear about code um creating tech debt where you can sort of two
engineers can create a tech debt of 50 engineers or you know on a more serious note you can start exposing private data of millions of users and this this all happened this year. Everything I'm
mentioning all happened this year. I'm
kind of using this keynote as a as a way of recapping. Um, and it, you know, just
of recapping. Um, and it, you know, just to be spicy a little bit, even people who are saying things like, "Oh, my model can go up to 30 to 60 hours uh autonomously." Well, it feels a bit
autonomously." Well, it feels a bit sloppy because you're also not saying, "Well, was the code good or not?" You're
just saying how long it went. So, in the same way that you have no taxation without representation, you don't want autonomy without accountability.
Um, something I've been working on more recently is that using AI to fight code slop as well. So, uh, this is from the a bunch of people quoted this yesterday, the semi-ascal value of death where you
can sort of keep human attention and mind meld with the machine in order to work on the hardest problems. Whereas the stuff that's commoditized, you can make it more async. So, you can check out more on that details. What seems to
be less appreciated is my is the other work on code maps uh which I which I more recently done where we actually use AI to scale codebased understanding which is also a a way to fight slop and
you can uh talk to the cognition folks uh downstairs uh um who who can who can show you more in detail. Um the the last thing I always want to shout out as well is is this trend of uh computer use. I
think computer use kind of debuted it this time last year uh with enthropic but uh it's it's getting really really good now guys. It can really autonomously operate the most complex apps including an ID. So that I think
that's really uh exciting and you should probably use that to fight slot. We use
it for the website and here's an example of us using Devon to automate the the sort of website up website updates. Um
and finally something I learned from this conference yesterday is that you can also use sub aents to fight context rot. Um and I think that is one of the
rot. Um and I think that is one of the biggest themes of of uh that I'm observing as well. If you want to take away something from this conference um and I also one of the biggest highlights of the year for us as AIE and
myself personally was chatting with Greg Brockman who always uh preaches the concept of modularity where you can sort of keep clear boundaries on what is human designed and let the AI code
everything in between. Um, so these are all ideas, but I just have this one message that I want to comp compress down to you today that I want you to say with me. No more slop. Yeah.
with me. No more slop. Yeah.
Your boss tells you, "I want more lines of code in by the end of the quarter."
What do you say to that? Say it with me.
No more slop.
>> You're fighting an asymmetric war. This
is how bad it is, right?
You have an insufficiently tested release that that is potentially embarrassing to your company. What do
you say to people who really want to push it?
>> No more slop. Exactly.
Uh your your Twitter algorithm wants engagement bait uh and is sort of you know forced and telling you to to to lie basically to the to the broad public.
What do you say to that?
>> Exactly. That's it. Uh I hope you have a great conference and let's let's hear it for uh not having any more stuff. Thank
you. Our
next presenters are AI engineers at Anthropic working on realworld agent systems. They're here to share why we should stop building agents and start
building skills. Please join me in
building skills. Please join me in welcoming to the stage Barry Jeang and Mahesh Morog.
All right, good morning and thank you for having us again.
Right. Agents have intelligence and capabilities but not always expertise that we need for real work. I'm Barry.
This is Mahes. We created agents.
In this talk, we'll show you why we stopped building agents and started building skills instead.
A lot of things have changed since our last talk. MCP became the standard for
last talk. MCP became the standard for agent connectivity. Cloud code, our
agent connectivity. Cloud code, our first coding agent, launched to the world. And our cloud agent SDK now
world. And our cloud agent SDK now provides a production ready agent out of the box. We have a more mature ecosystem
the box. We have a more mature ecosystem and we're moving towards a new paradigm for agents. That paradigm is a tighter
for agents. That paradigm is a tighter coupling between the model and a runtime environment.
Put simply, we think code is all we need.
We used to think agents in different domains will look very different. Each
one will need its own tools and scaffolding. And that means we'll have a
scaffolding. And that means we'll have a separate agent for each use case, for each domain. Well, customization is
each domain. Well, customization is still important for each domain. The
agent underneath is actually more universal than we thought.
What we realize is that code is not just a use case, but a universal interface to the digital world.
After we built cloud code, we realized that cloud code is actually a general purpose agent.
Think about generating a financial report. The model can call the API to
report. The model can call the API to pull in data and do research. It can
organize that data in the file system.
It can analyze it with Python and then synthesize the insight in old file format all through code. The core
scaffolding can suddenly become as thin as just bash and file system which is great and really scalable. But we very quickly run into a different problem
and that problem is domain expertise.
Who do you want doing your taxes? Is it
gonna be Mahesh, the 300 IQ mathematical genius, or is it Barry, an experienced tax professional, right? I would pick Barry every time. I don't want Mahesh to figure out the 2025 tax code from first
principles and need consistent execution from from a domain expert. Agents today
are a lot like Mahes. They're brilliant,
but they lack expertise.
They can do no more slow. They can do amazing things when you really put in the effort and give proper guidance, but they're often missing the important context up front. They can't really absorb your expertise super well, and
they don't learn over time.
That's why we created agent skills.
Skills are organized collections of files that package composable procedural knowledge for agents.
In other words, they're folders. This
simplicity is deliberate. We want
something that anyone, human or agent, can create and use as long as they have a computer. These also work with what
a computer. These also work with what you already have. You can version them in Git, you can throw them in Google Drive and you can zip them up and share with your team. We have used files for
uh as a primitive for decades and we like them. So why change now?
like them. So why change now?
Because of that skills can also include a lot of scripts as tools. Traditional
tools have pretty obvious problems. Some tools have poorly written instructions and are pretty ambiguous. And when the model is struggling, it can't really make a change to the tool. So, it's just kind of stuck with a code start problem
and they always live in the context window. Code solves some of these
window. Code solves some of these issues. It's self-documenting. It is
issues. It's self-documenting. It is
modifiable and can live in the file system until they're really needed and used. Here's an example of a script
used. Here's an example of a script inside of a skill. We kept seeing Claude write the same Python script over and over again to apply styling to slides.
So we just ask cloud to save it inside of the skill as a tool for its version for its future self. Now we can just run the script and that makes everything a lot more consistent and a lot more efficient.
At this point skills can contain a lot of information and we want to protect the context window so that we can fit in hundreds of skills and make them truly composable. That's why skills are
composable. That's why skills are progressively disclosed. At runtime,
progressively disclosed. At runtime, only this metadata is shown to the model just to indicate that he has the skill.
When an agent needs to use a skill, it can read in the rest of the skill.md,
which contains the core instruction and directory for the rest of the folder.
Everything else is just organized for ease of access. So that's all skills are. They're organized folders with
are. They're organized folders with scripts as tools.
Since our launch 5 weeks ago, this very simple design has translated into a very quickly growing ecosystem of thousands of skills. And we've seen this be split
of skills. And we've seen this be split across a couple of different types of skills. There are foundational skills,
skills. There are foundational skills, third party skills created by partners in the ecosystem, and skills built within an enterprise and within teams.
To start, foundational skills are those that give agents new general capabilities or domain specific capabilities that it didn't have before.
We ourselves with our launch built document skills that give Claude the ability to create and edit professional quality office documents. We're also
really excited to see people like Cadence build scientific research skills that give Claude new capabilities like EHR data analysis and using common
Python bioinformatics libraries better than it could before.
We've also seen partners in the ecosystem build skills that help Claude better with their own software and their own products. Browserbase is a pretty
own products. Browserbase is a pretty good example of this. They built a skill for their open- source browser automation tooling, stage hand. And now
Claude equipped that this skill and with stage hand can now go navigate the web and use a browser more effectively to get work done.
And notion launched a bunch of skills that help claude better understand your notion workspace and do deep research over your entire workspace.
And I think where I've seen the most excitement and traction with skills is within large enterprises. These are
company and team specific skills built for an organization.
We've been talking to Fortune 100s that are using skills as a way to teach agents about their organizational best practices and the weird and unique ways that they use this bespoke internal
software.
We're also talking to really large developer productivity teams. These are teams serving thousands or even tens of thousands of developers in an organization that are using skills as a
way to deploy agents like cloud code and teach them about code style best practices and other ways that they want their developers to work internally.
So all of these different types of skills are created and consumed by different people inside of an organization or in the world, but what they have in common is anyone can create them and they give agents the new
capabilities that they didn't have before.
So, as this ecosystem has grown, we've started to observe a couple of interesting trends. First, skills are
interesting trends. First, skills are starting to get more complex. The most
basic skill today can still be a skill.md markdown file with some prompts
skill.md markdown file with some prompts and some really basic instructions, but we're starting to see skills that package software, executables, binaries,
files, code, scripts, assets, and a lot more. And a lot of the skills that are
more. And a lot of the skills that are being built today might take minutes or hours to build and put into an agent.
But we think that increasingly much like a lot of the software we use today, these skills might take weeks or months to build and be maintained.
We're also seeing that this ecosystem of skills is complementing the existing ecosystem of MCP servers that was built up over the course of this year.
Developers are using and building skills that orchestrate workflows of multiple MCP tools stitched together to do more complex things with external data and
connectivity. And in these cases, MCP
connectivity. And in these cases, MCP MCP is providing the connection to the outside world while skills are providing the expertise.
And finally, and I think most excitingly for me personally, is we're seeing skills that are being built by people that aren't technical. These are people in functions like finance, recruiting,
accounting, legal, and a lot more. Um,
and I think this is pretty early validation of our initial idea that skills help people that aren't doing coding work extend these general agents and they make these agents more
accessible for the day-to-day of what these people are working on.
So tying this all together, let's talk about how these all fit into this emerging architecture of general agents.
First, we think this architecture is converging on a couple of things. The
first is this agent loop that helps manage the the model's internal context and manages what tokens are going in and out. And this is coupled with a runtime
out. And this is coupled with a runtime environment that provides the agent with a file system and the ability to read and write code.
This agent, as many of us have done throughout this year, can be connected to MCP servers. And these are tools and data from the outside world that make the the agent more relevant and more
effective.
And now we can give the same agent a library of hundreds or thousands of skills that it can decide to pull into context only at runtime when it's
deciding to work on a particular task.
Today, giving an agent a new capability in a new domain might just involve equipping it with the right set of MCP servers and the right library of skills.
And this emerging pattern of an agent with an MCP server and a set of skills is something that's already helping us at Enthropic deploy Claude to new verticals. Just after we launched skills
verticals. Just after we launched skills 5 weeks ago, we immediately launched new offerings in financial services and life sciences. And each of these came with a
sciences. And each of these came with a set of MCP servers and a set of skills that immediately make Claude more effective for professionals in each of these domains.
We're also starting to think about some of the other open questions and areas that we want to focus on for how skills evolve in the future as they start to become more complex. We really want to
support developers, enterprises, and other skill builders by starting to treat skills like we treat software.
This means exploring testing and evaluation, better tooling to make sure that these agents are loading and triggering skills at the right time and
for the right task, and tooling to help measure the output quality of an agent equipped with the skill to make sure that's on par with what the agent is supposed to be doing.
We'd also like to focus on versioning.
as a skill evolves and the resulting agent behavior uh evolves, we want this to be uh clearly tracked and to have a clear lineage over time.
And finally, we'd also like to explore skills that can explicitly depend on and refer to either other skills, MCP servers, and dependencies and packages within the agents environment. We think
that this is going to make agents a lot more predictable in different runtime environments and the composability of multiple skills together will help agents like Claude elicit even more complex and relevant behavior from these
agents.
Overall, these set of things should hopefully make skills easier to build and easier to integrate into agent products, even those besides claude.
Finally, a huge part of the value of skills we think is going to come from sharing and distribution. Barry and I think a lot about the future of companies that are deploying these
agents at scale. And the vision that excites us most is one of a collecting and collective and evolving knowledge base of capabilities that's curated by
people and agents inside of an organization. We think skills are a big
organization. We think skills are a big step towards this vision. They provide
the procedural knowledge for your agents to do useful things. And as you interact with an agent and give it feedback and more institutional knowledge, it starts to get better. And all of the agents
inside your team and your org get better as well. And when someone joins your
as well. And when someone joins your team and starts using Claude for the first time, it already knows what your team cares about. It knows about your day-to-day and it knows about how to be most effective for the work that you're doing.
And as this grows and this ecosystem starts to develop even more, this was going to this compounding value is going to extend outside of just your organ into the broader community. So just like when someone else across the world
builds an MCP server that makes your agent more useful, a skill built by someone else in the community will help make your own agents more capable,
reliable, and useful as well.
This vision of a evolving knowledge base gets even more powerful when claw starts to create these skills. We design skills specifically as a concrete steps towards uh continuous learning.
When you first start using cloud, this standardized format gives a very important guarantee. Anything that cloud
important guarantee. Anything that cloud writes down can be used efficiently by a future version of itself. This makes the learning actually transferable.
As you build up the context, skills makes the concept of memory more tangible. They don't capture everything.
tangible. They don't capture everything.
They don't capture every type of information. Just procedural knowledge
information. Just procedural knowledge that cloud can use on specific tasks.
When you have worked with cloud for quite a while, the flexibility of skills matters even more. Cloud can acquire new capabilities instantly, evolve them as needed, and then drop the ones that
become obsolete. This is what we have
become obsolete. This is what we have always known. The power of in in context
always known. The power of in in context learning makes this a lot more cost- effective for information that change on daily basis.
Our goal is that cloud on day 30 of working with you is going to be a lot better on cloud on day one. CL can
already create skills for you today using our skill creator skill and we're going to continue pushing in that direction.
We're going to conclude by comparing the agent stack to what we have already seen computing.
In a rough analogy, models are like processors. Both require massive
processors. Both require massive investment and contain immense potential, but only so useful by themselves.
Then we start building operating system.
The OS made processors far more valuable by orchestrating the processes, resources, and data around the processor. In AI, we believe that agent
processor. In AI, we believe that agent runtime is starting to play this role.
We're all trying to build the cleanest, most efficient, and most scalable uh abstractions to get the right tokens in and out of the model.
But once we have a platform, the real value comes from applications. A few
companies built uh processors and operating systems, but millions of developers like us have built software that encoded domain expertise and our unique points of view. We hope that
skills can help us open up this layer for everyone. This is where we get
for everyone. This is where we get creative and solve concrete problems for ourselves, for each other, and for the world just by putting stuff in the folder. So skills are just the starting
folder. So skills are just the starting point.
To close out, we think we're now converging on this general architecture for general agents. We've created skills as a new paradigm for shipping and sharing new capabilities. So we think
it's time to stop rebuilding agents and start building skills instead. And if
you're excited about this, come work with us and start building some skills today. Thank you.
today. Thank you.
Our next presenter is here to share practical techniques for getting real results through context engineering, not guesswork and definitely not hype.
Please welcome to the stage the CEO of Human Layer, Dex Hory.
>> Hi everybody. How y'all doing?
>> It's exciting. I'm Dex. Uh, as they did in the great intro, I've been hacking on agents for a while. Um, our talk 12 factor agents at AI Engineer in June was one of the top talks of all time. Uh, I
think top eight or something. one of the best ones from from AI engineer in June.
May or may not have said something about context engineering. Um, why am I here
context engineering. Um, why am I here today? What am I here to talk about? Um,
today? What am I here to talk about? Um,
I want to talk about one of my favorite talks from AI engineer in June. And I
know we all got the update from Eigor yesterday, but they wouldn't let me change my slides. So, this is going to be about what Eigor talked about in June. Uh, basically that they surveyed
June. Uh, basically that they surveyed 100,000 developers across all company sizes and they found that most of the time you use AI for software engineering, you're doing a lot of rework, a lot of codebased churn. uh and
it doesn't really work well for complex tasks brownfield code bases. Um and you can see in the chart basically you are shipping a lot more but a lot of it is just reworking the slop that you shipped
last week. So uh and then the other side
last week. So uh and then the other side right was that uh if you're doing green field little versel dashboard something like this then it's going to work great.
Uh if you're going to go in a 10-year-old Java codebase maybe not so much. And this matched my experience
much. And this matched my experience personally and talking to a lot of founders and great engineers. Too much
slop uh tech debt factories. It's just
it's not going to work from our codebase. Like maybe someday when the
codebase. Like maybe someday when the models get better, but that's what context engineering is all about. How
can we get the most out of today's models? How do we manage our context
models? How do we manage our context window? So we talked about this in
window? So we talked about this in August. Um I have to confess something.
August. Um I have to confess something.
The first time I used Cloud Code, I was not impressed. It was like, okay, this
not impressed. It was like, okay, this is a little bit better. I get it. I like
the UX. Um but since then we as a team figured something out um that we were actually able to get you know two to threex more throughput and we were shipping so much that we had no choice
but to change the way we collaborated.
We rewired everything about how we build software. Uh it was a team of three. It
software. Uh it was a team of three. It
took eight weeks. It was really freaking hard. Uh but now that we solved it,
hard. Uh but now that we solved it, we're we're never going back. This is
the whole no slop thing. I think I think we got somewhere with this went super viral on Hacker News in September. Uh we
have thousands of folks who have gone on to GitHub and grabbed our you know research plan implement prompt system.
Um so the goals here which we kind of backed our way into we need AI that can work well in brownfield code bases that can solve complex problems. No slop,
right? No more slop. Uh and we had to
right? No more slop. Uh and we had to maintain mental alignment. I'll talk a little bit more about what that means in a minute. And of course we want to spend
a minute. And of course we want to spend with everything we want to spend as many tokens as possible. What we can offload meaningfully to the AI is really really important. um super high leverage. So
important. um super high leverage. So
this is advanced context engineering for coding agents. Um I'll start with kind
coding agents. Um I'll start with kind of like framing this. The most naive way to use a coding agent is to ask it for something and then tell it why it's wrong and resteer it and ask and ask and
ask until you run out of context or you give up or you cry. Um we can be a little bit smarter about this. Most
people discover this pretty early on in their AI like exploration. uh is that it might be better if you start a conversation and you're off track that
uh you just start a new context window.
You say, "Okay, we went down that path.
Let's start again. Same prompt, same task, but this time we're going to go down this path." And like don't go over there cuz that doesn't work. So, uh how do you know when it's time to start over?
If you see this, it's probably time to start over, right?
This is what Claude says when you tell it it's screwing up.
Um, so we can be even smarter about this. We can do what I call intentional
this. We can do what I call intentional compaction. Um, and this is basically
compaction. Um, and this is basically whether you're on track or not, you can take uh your existing context window and ask the agent to compress it down into a markdown file. You can review this, you
markdown file. You can review this, you can tag it, and then when the new agent starts, it gets straight to work instead of having to do all that searching and codebased understanding and getting caught up. Um, what goes in a
caught up. Um, what goes in a compaction? Well, the question is like
compaction? Well, the question is like what takes up space in your context window. So, um, it's looking for files,
window. So, um, it's looking for files, it's understanding code flow, it's editing files, it's test and build output. And if you have one of those
output. And if you have one of those MCPs that's dumping JSON and a bunch of UYU IDs into your context window, you know, God help you. Uh, so what should we compact? I'll get more specifics
we compact? I'll get more specifics here, but this is a really good compaction. This is exactly what we're
compaction. This is exactly what we're working on. The exact files and line
working on. The exact files and line numbers that matter to the problem that we're solving. Um, why are we so
we're solving. Um, why are we so obsessed with context? Because LM are actually got roasted on YouTube for this one. They're not pure functions because
one. They're not pure functions because they're non-deterministic, but they are stateless. And the only way to get
stateless. And the only way to get better better performance out of an LLM is to put better tokens in and then you get better tokens out. And so every turn of the loop when Claude is picking the next tool or any coding agent is picking
the next and there could be hundreds of right next steps and hundreds of wrong next steps. But the only thing that
next steps. But the only thing that influences what comes out next is what is in the conversation so far. So we're
going to optimize this context window for correctness, completeness, size, and a little bit of trajectory. And the
trajectory one is interesting because a lot of people say, "Well, I I told the agent to do something and it did something wrong. So, I corrected it and
something wrong. So, I corrected it and I yelled at it and then it did something wrong again and then I yelled at it."
And then the LM is looking at this conversation says, "Okay, cool. I did
something wrong. The human yelled at me and I did something wrong and the human yelled at me." So, the next most likely conver token in this conversation is I better do something wrong so the human can yell at me again. So, what mind be
mindful of your trajectory if you were going to invert this? The worst thing you can have is incorrect information, then missing information, and then just too much noise. Um, if you like equations, there's a dumb equation if
you want to think about it this way. Um,
Jeff Huntley uh did a lot of research on coding agents. Uh, he put it really
coding agents. Uh, he put it really well. Just the more you use the context
well. Just the more you use the context window, the worse outcomes you'll get.
This leads to a concept I'm in a very very uh academic concept called the dumb zone. So, you have your context window.
zone. So, you have your context window.
You have 168,000 tokens roughly. Some
are reserved for output and compaction.
This varies by model. Um, but we'll use cloud code as an example here. Around
the 40% line is where you're going to start to see some diminishing returns depending on your task. Um, if you have too many MCPs in your coding agent, you are doing all your work in the dumb zone and you're never going to get good
results. People talked about this. I'm
results. People talked about this. I'm
not going to talk about that one. Your
mileage may vary. 40% is like it depends on how complex the task is, but this is kind of a good guideline. Um, so back to compaction or as I will call it from now on, cleverly avoiding the dumb zone.
Um, we can do sub agents. Um, if you have a front end sub agent and a backend sub aent and a QA sub aent and a data data scientist sub aent, please stop. Sub aents are not for
please stop. Sub aents are not for anthropomorphizing roles. They are for
anthropomorphizing roles. They are for controlling context. And so what you can
controlling context. And so what you can do is if you want to go find how something works in a large codebase, um, you can steer the coding agent to do this if it supports sub agents or you can build your own sub aent system. But
basically you say, hey, go find how this works. and it can fork out a new context
works. and it can fork out a new context window that is going to go do all that reading and searching and finding and reading entire files and understanding the codebase and then just return a
really really succinct message back up to the parent agent of just like hey the file you want is here. Parent agent can read that one file and get straight to work. And so this is really powerful. If
work. And so this is really powerful. If
you wield these correctly you can get good responses like this and then you can manage your context really really well. Um, what works even better than
well. Um, what works even better than sub agents or like a layer on top of sub aents is a workflow I call frequent intentional compaction. We're going to
intentional compaction. We're going to talk about research plan implement in a minute, but like the point is you're constantly st keeping your context window small. You're building your
window small. You're building your entire workflow around context management. So comes in three phases.
management. So comes in three phases.
Research, plan, implement. Um, and we're going to try to stay in the smart zone the whole time. So the research is all about understanding how the system works, finding the right files, staying objective. Here's a prompt you can use
objective. Here's a prompt you can use to do research. Here's the output of um a research prompt. These are all open source. You can go grab them and play
source. You can go grab them and play with them yourself. Um planning, you're going to outline the exact steps. You're
going to include file names and line snippets. You can be very explicit about
snippets. You can be very explicit about how we're going to test things after every change. Here's a good planning
every change. Here's a good planning prompt. Here's one of our plans. It's
prompt. Here's one of our plans. It's
got actual code snippets in it. Um and
then we're going to implement. And if
you read one of these plans, you can see very easily how the dumbest model in the world is probably not going to screw this up. Um so we just go through and we
this up. Um so we just go through and we run the plan and we keep the context low. Here's a planning prompt. Like I
low. Here's a planning prompt. Like I
said, it's the least exciting part of the process. Um, I wanted to put this
the process. Um, I wanted to put this into practice. So, working for us, uh, I
into practice. So, working for us, uh, I do a podcast with my buddy, uh, Vib, who's the CEO of a company called Boundary ML. Uh, and I said, "Hey, I'm
Boundary ML. Uh, and I said, "Hey, I'm going to try to oneshot a fix to your 300,000line Rust codebase for a programming language." Um, and the whole
programming language." Um, and the whole episode goes in, it's like an hour and a half. Uh, I'm not going to talk through
half. Uh, I'm not going to talk through it right now, but we built a bunch of research, and we threw them out because they were bad. And then we made a plan, and we made a plan without research and with research and compared all the results. It's a fun time. Uh, by that
results. It's a fun time. Uh, by that was Monday night. By Tuesday morning, we were on the show and the CTO had like seen the PR and like didn't realize I was doing it as a bit for a podcast and basically was like, "Yeah, this looks
good. We'll get in the next release." I
good. We'll get in the next release." I
think he was a little confused. Um,
here's the the plan. But anyways, uh, yeah, confirmed works in brownfield code bases and no slop, but I wanted to see if we could solve complex problems. So, Vibob was still a little skeptical. I
sat down, we sat down for like seven hours on a Saturday and we shipped 35,000 lines of code to BAML. One of the PRs got merged like a week later. I will
say some of this is codegen. You know,
you update your behavior. All the golden files update and stuff, but we shipped a lot of code that day. Um, he estimates it was about 1 to two weeks and 7 hours.
And uh, so cool, we can solve complex problems. There are limits to this. I
sat down with my buddy Blake. We tried
to remove Hadoop dependencies from Parket Java. If you know what parket
Parket Java. If you know what parket Java is, I'm sorry uh for whatever happened to you to get you to this point in your career. Uh, it
did not go well.
uh here's the plans, here's the research. Uh at a certain point, we
research. Uh at a certain point, we threw everything out and we actually went back to the whiteboard. We had to actually once we had learned where were the where all the foot guns were, we we went back to okay, how is this actually going to fit together? Um and this
brings me to a really interesting point that Jake's going to talk about later.
Uh do not outsource the thinking. AI
cannot replace thinking. It can only amplify the thinking you have done or the lack of thinking you have done. So
people ask, "So Dex, this is spectriven development right?"
development right?" No, spec driven development is broken.
Not the idea, but the phrase. Um,
it's not well defined. This is Brietta from ThoughtWorks. Um, and a lot of
from ThoughtWorks. Um, and a lot of people just say spec, and they mean a more detailed prompt. Does anyone
remember this picture? Does anyone know what this is from? All right, that's a deep cut. Uh, there will never be a year
deep cut. Uh, there will never be a year of agents because of semantic diffusion.
Martin Fowler said this in 2006. We come
up with a good term with a good definition and then everybody gets excited and everybody starts meaning it to mean a hundred things to a hundred different people and it becomes useless.
We had an agent is a person. An agent is a micros service. An agent is a chatbot.
An agent is a workflow. And thank you Simon. We're back to the beginning. An
Simon. We're back to the beginning. An
agent is just tools in a loop. Um this
is happening to spec driven dev. I used
to have Sean's uh slide in the beginning of this talk but it caused a bunch of people to focus on the wrong things. His
thing of like forget the code. It's like
assembly now and you just focus on the markdown. Very cool idea, but people say
markdown. Very cool idea, but people say spectriven dev is writing a better prompt, a product requirements document.
Sometimes it's using like verifiable feedback loops and back pressure. Maybe
it is treating the code like assembly like Sean taught us. Um, but a lot of people is just using a bunch of markdown files while you're coding. Or my
favorite, I just stumbled upon this last week. Uh, a spec is documentation for an
week. Uh, a spec is documentation for an open source library. So it's gone. It's
as specri dev is overhyped. It's useless
now. It's semantically diffused.
Um, so I want to talk about like four things that actually work today. The
tactical and practical steps that we found working internally and with a bunch of users. Um, we do the research, we figure out how the system works. Um,
remember Momento? This is the best the best movie on context engineering as Peter says it. Guy wakes up, he has no memory. He has to like read his own
memory. He has to like read his own tattoos to figure out who he is and what he's up to. If you don't onboard your agents, they will make stuff up. And so
if this is your team, this is very simplified for most of you. Most of you have much bigger orgs than this. But
let's say you want to do some work over here. Um, one thing you could do is you
here. Um, one thing you could do is you could put onboarding into every repo.
You put a bunch of context. Here's the
repo. Here's how it works. This is a compression of all the context in the codebase that the agent can see ahead of time before actually getting to work.
This is challenging because sometimes it gets too long. As your
codebase gets really big, you either have to make this longer or you have to leave information out. And so as you uh are reading through this, you're going to read the context of this big 5
million line monor repo and you're going to use all the smart zone just to learn how it works. And you're not going to be able to do any good tool calling in the dumb zone. So that's uh you can
dumb zone. So that's uh you can you can shard this down the stack. You
can do just talking about progressive disclosure. You could split this up,
disclosure. You could split this up, right? You could just put a file in the
right? You could just put a file in the root of every repo and then like at every level you have like additional context based on if you're working here this is what you need to know. Uh we
don't document the files themselves because they're the source of truth. But
then as your agent is working you know you pull in the root context and then you pull in the subcontext. We won't
talk about any specific like you could use cloudmd for this you can use hooks for this whatever it is. Um but then you still have plenty of room in the smart zone because you're only pulling in what you need to know. Um, the problem with
this is that it gets out of date. And so
every time you ship a new feature, you need to kind of like cache and validate and rebuild large parts of this internal documentation. And you could use a lot
documentation. And you could use a lot of AI and make it part of your process to update this. Um, but I want to ask a question. Between the actual code, the
question. Between the actual code, the function names, the comments in the documentation, does anyone want to guess what is on the y-axis of this chart?
>> Slob.
>> Slob. It's actually the amount of lies you can find in any one part of your codebase.
Um, so you could make it part of your process to update this, but you probably shouldn't because you probably won't.
What we prefer is on demand compressed context. So if I'm building a feature
context. So if I'm building a feature that relates to SCM providers and Jira and Linear, um, I would just give it a little bit of steering. I would say, hey, we're going over in like this like
part of the codebase over here. Um and a good research uh prompt or or slash command might take you or skill even uh launch a bunch of sub aents to take these vertical slices through the
codebase and then build up a research document that is just a snapshot of the actually true based on the code itself parts of the codebase that matter. We
are compressing truth. Um planning is leverage. Planning is about compression
leverage. Planning is about compression of intent. Um and in plan we're going to
of intent. Um and in plan we're going to outline the exact steps. We take our research and our PRD or our bug ticket or our whatever it is and we create a plan and we create a plan file. So we're
compacting again. And I want to pause and talk about mental alignment. Um does
anyone know what code review is for?
>> Mental alignment. Mental alignment is it is about finding making sure things are correct and stuff but the most important thing is how do we keep everybody on the team on the same page about how the codebase is changing and why. And I can
read a thousand lines of Golang every week. Uh sorry I can't read a thousand.
week. Uh sorry I can't read a thousand.
is hard. I can do it. I don't want to.
Um, and as our team grows, I all the code gets reviewed. We don't not read the code, but I, as you know, a technical leader in the in on the team, I can read the plans and I can keep up to date and I can that's enough. I can
catch some problems early and I maintain understanding of how the system is evolving. Um, Mitchell had this really
evolving. Um, Mitchell had this really good post about how he's been putting his AMP threads on his poll requests so that you can see not just, hey, here's a wall of green text in GitHub, but here's the exact steps, here's the prompts, and
hey, I ran the build at the end and it passed. This takes the reviewer on a
passed. This takes the reviewer on a journey in a way that a GitHub PR just can't. And as you're shipping more and
can't. And as you're shipping more and more in two to three times as much code, it's really on you to find ways to keep your team on the same page and show them here's the steps I did and here's how we
tested it manually. Um, your goal is leverage. So you want high confidence
leverage. So you want high confidence that the model will actually do the right thing. I can't read this plan and
right thing. I can't read this plan and know what actually is going to happen and what code changes are going to happen. So we've over time iterated
happen. So we've over time iterated towards our plans include actual code snippets of what's going to change. So
your goal is leverage, you want compression of intent, and you want reliable execution. Um, and so I don't
reliable execution. Um, and so I don't know, I have a physics background. We
like to draw lines through the center of peaks and curves. Uh, as your plans get longer, reliability goes up, readability goes down. There's a sweet spot for you
goes down. There's a sweet spot for you and your team and your codebase. You
should try to find it. because when we review the research and the plans, if they're good, then we can get mental alignment. Um, don't outsource the
alignment. Um, don't outsource the thinking. I've said this before. This is
thinking. I've said this before. This is
not magic. There is no perfect prompt.
You still will not work if you do not read the plan. So, we built our entire process around you, the builder, are in back and forth with the agent, reading the plans as they're created, and then if you need peer review, you can send it
to someone, say, "Hey, does this plan look right? Is this the right approach?
look right? Is this the right approach?
Is this the right order to look at these things?" Um Jake again wrote a really
things?" Um Jake again wrote a really good blog post about like the thing that makes research plan implementing valuable is you the human in the loop making sure it's correct. So if you take
one thing away from this talk it should be that a bad line of code is a bad line of code and a bad part of a plan is could be a hundred bad lines of code and a bad line of research like a
misunderstanding of how the system works and where things are your whole thing's going to be hosed. You're going to be telling sending the model off in the wrong direction. And so when we're
wrong direction. And so when we're working internally and with users, we're constantly trying to move human effort and focus to the highest leverage parts of this pipeline. Um, don't outsource the thinking. Watch out for tools that
the thinking. Watch out for tools that just spew out a bunch of markdown files just to make you feel good. I'm not
going to name names here. Uh, sometimes
this is overkill. And the way I like to think about this is like, yeah, you don't always need a full research plan implement. Sometimes you need more,
implement. Sometimes you need more, sometimes you need less. If you're
changing the color of a button, just talk to the agent and tell it what to do. Um, if you're doing like a simple
do. Um, if you're doing like a simple plan and it's a small feature, if you're doing medium features across multiple repos, then do one research, then build a plan. Basically, the hardest problem
a plan. Basically, the hardest problem you can solve, the ceiling goes up the more of this context engineering compaction you're willing to do. Um, and
so if you're in the top right corner, you're probably going to have to do more. A lot of people ask me, "How do I
more. A lot of people ask me, "How do I know how much context engineering to use?" It takes reps. You will get it
use?" It takes reps. You will get it wrong. You have to get it wrong over and
wrong. You have to get it wrong over and over and over again. Sometimes you'll go too big. Sometimes you go too small.
too big. Sometimes you go too small.
Pick one tool and get some reps. I
recommend against minmaxing across claude and codeex and all these different tools. Um, so I'm not a big
different tools. Um, so I'm not a big acronym guy. Uh, we said specri dev was
acronym guy. Uh, we said specri dev was broken. Uh, research plan and implement
broken. Uh, research plan and implement I don't think will be the steps. The
important part is compaction and context engineering and staying in the smart zone. But people are calling this RPI
zone. But people are calling this RPI and there's nothing I can do about it.
So, uh, just be wary. There is no perfect prompt. There is no silver
perfect prompt. There is no silver bullet. Um, if you really want a hypy
bullet. Um, if you really want a hypy word, you can call this harness engineering, which is part of context engineering, and it's how you integrate with the integration points on codec, claude, cursor, whatever. How you
customize your codebase. Um, so what's next? I think the coding agent stuff is
next? I think the coding agent stuff is actually going to be commoditized.
People are going to learn how to do this and get better at it. And the hard part is going to be how do you adapt your team and your workflow and the SDLC to work in a world where 99% of your code is shipped by AI. Uh, and if you can't
figure this out, you're hosed because there's kind of a rift growing where like staff engineers don't adopt AI because it doesn't make them that much faster. And then junior mid-levels
faster. And then junior mid-levels engineers use a lot because it fills in skill gaps and then it also produces some slop. And then the senior engineers
some slop. And then the senior engineers hate it more and more every week because they're cleaning up slop that was shipped by cursor the week before. Uh,
this is not AI's fault. This is not the mid-level engineers fault. Like if
cultural change is really hard and it needs to come from the top if it's going to work. So if you're a technical leader
to work. So if you're a technical leader at your company, pick one tool and get some reps. If you want to help, we are
some reps. If you want to help, we are hiring. We're building an Aentic IDE to
hiring. We're building an Aentic IDE to help teams of all sizes speedrun the journey to 99% AI generated code. Uh if
we'd love to we'd love to talk if you want to work with us. Uh go go hit our website, send us an email, come find me in the hallway. Uh thank you all so much for your energy.
Our next presenter is the head of developer experience at Cursor here to tell us about the infrastructure training and evaluations used to build Cursor
Composer, their first coding model.
Please join me in welcoming to the stage Lee Robinson.
Hey everybody, it's great to be back in New York and I'm very excited to be here and talk on behalf of all of our engineering and research teams at Cursor about building cursor composer, our first agent model. And my colleague
Sasha actually gave a version of this talk recently. So I'm excited to give my
talk recently. So I'm excited to give my own uh my own take on it. So cursor
composer is a model designed for real world real world software engineering and it tries to be both fast and smart.
So, as we've measured it against our own benchmarks, it's better than the best open source models. It's like up against recent Frontier models, but kind of slightly below the latest Frontier with
Sonnet 45, GPT 5.1 codecs. But where it really shines is it's about four times more efficient at token generation than models at a similar level of intelligence. So, we're trying to mesh
intelligence. So, we're trying to mesh speed as well as intelligence. So, why
did we build this model? I mean,
obviously, cursor has an IDE. Why are we getting into the model space? Why do we care about this? Well, our research and product teams have been building a model called tab, which you can use for autocomplete. Maybe some of you use that
autocomplete. Maybe some of you use that inside of cursor. And we wanted to take that same approach for a very low latency model and apply it to coding with agents. But honestly, we weren't
with agents. But honestly, we weren't really sure if it would work. So, we
started prototyping some early versions of what this model could look like.
Started to put it out and get some feedback from users. And we were pretty surprised that this cheetah slug we released for this model, people actually really liked it. Uh they really liked the speed, but the feedback we got was
it's not really smart enough yet to be a daily driver for a lot of their coding.
So we needed it to be smart and fast.
Definitely needed to be smart. So we
really worked on making this internal benchmark that represented our usage on our own repos and how we actually built software. Like if we had a model that
software. Like if we had a model that was both fast and smart and a checkpoint that our developers would use every single day to build the product and to build all of our software, then we knew that we would be on to something. And
for example, one big change here that helped actually push this towards a level where we had a checkpoint where people would use it was being able to call tools in parallel and being able to very effectively use our semantic search
tool. And we'll talk about that a little
tool. And we'll talk about that a little bit more here later. So if you haven't seen it, uh here's cursor in cursor 2.0 in our new view. And we're going to use the composer one model. And you'll
notice that it is doing a lot of things very quickly. It's calling a bunch of
very quickly. It's calling a bunch of tools in parallel like GP. So reading a lot of files. It's making shell commands. Uh it's making file edits.
commands. Uh it's making file edits.
It's writing and managing uh a list of to-dos. And you can kind of quickly work
to-dos. And you can kind of quickly work through tasks in the foreground here. Uh
in this case, I'm investigating an issue in an open source repo. And I don't know about y'all, but this has been a quite different programming experience for me.
uh having working with coding agents for a little bit of time now versus kind of firing off an agent and waiting let's call it 20 minutes for it to complete where you can kind of context switch away. This really does help keep you in
away. This really does help keep you in the flow and is a kind of a different style of programming I think. So I want to talk about how we did this in a way that's hopefully accessible for you all.
I'm not a machine learning researcher but I do really enjoy this stuff. Uh
what we learned some of the infrastructure challenges and then a little bit on where we're going uh moving forward. So in cursor, a user
moving forward. So in cursor, a user kind of submits a query to our backend.
The agent reads that query and then decides to make a series of tool calls.
And our agent has about 10 tools, give or take, but we're going to focus on five here. So reading files, editing
five here. So reading files, editing files, searching your codebase, looking at lints, and then also running terminal or shell commands. And the agent then is able to autonomously decide, do we call
these serially or do we run these in parallel? And our goal with
parallel? And our goal with reinforcement learning here is to try to mirror the cursor production environment as close as we possibly can. So this
data that we have in training, we want to kind of pretend like we're actually calling real cursor queries. Uh so to do that, we are running a series of rollouts. Um for example, in this roll
rollouts. Um for example, in this roll out, we're calling a series of tools like reading files and editing files.
And when we run more rollouts, we can start from that same initial starting point, but we might call a completely different set of tools. So in this one we're also doing codebased search. So we
score the output, we decide which one is better and then we update the parameters of our model based on that change. So
conceptually a pretty simple idea. The
challenges come from when you take the simple idea and then you try to scale it up to a very large amount. So there's
kind of three challenges. The first one is trying to match the training and inference environment. So when the
inference environment. So when the model's actually being used in the product. Um, in this case with composer,
product. Um, in this case with composer, we're training a large mixture of experts model and it's being parallelized across thousands of GPUs and if we don't speed that up, it's going to take forever to train the
thing. So, we want to make it really
thing. So, we want to make it really fast and match the training and kind of sampling version to be as close as possible. The second challenge is that
possible. The second challenge is that the rollouts can get pretty complex when you start to look at real world data here. So, models are going to use
here. So, models are going to use hundreds of thousands to millions of tokens. they're going to make hundreds
tokens. they're going to make hundreds of different tool calls. And each of these rollouts could take a, you know, a pretty different amount of time. One
might make a lot of tool calls, one might make not as many, and they'll complete at different times. So, we have to figure out how to deal with that challenge. And finally, there's this
challenge. And finally, there's this challenge of consistency. If we want to mimic the production cursor environment as close as possible, we need to use exactly the same tool format and the
tool response. But in training, we have
tool response. But in training, we have this really bursty amount of compute.
Basically, we're like doing all of this training all at once, which is different than at production. So, it is really an infrastructure challenge. We have these
infrastructure challenge. We have these three machine learning challenges and all of the solutions coincidentally are actually infrastructure problems. So, let's talk through a few of these problems and how we solved it at the
infrastructure layer. So, our
infrastructure layer. So, our architecture is probably familiar for some of you who have been involved in this space a little bit, but I still think it's really interesting to talk about at kind of a high level. Uh, we
have three different servers. We have an inference server. We have kind of the
inference server. We have kind of the standard ML stack with PyTorch. We have
an inference server. So the rollouts that I just talked about, um, that's where we use Ray. And then we have environment servers. And these are the
environment servers. And these are the ones where we're kind of simulating that cursor environment that I talked about.
And all these servers talk to each other. So for example, the inference
other. So for example, the inference server can basically send these advantages back to the trainer, which is like nudging it up or down uh based on the roll out and then updating the model
and getting new parameters.
So this this one is a bit more on the ML side, but we're we're trying to train a model that's very very large and to do it as fast as possible. And one way that our team was able to do this on the research side was to develop a library
of custom kernels that allowed for very low precision training. And basically
this allows us to just speed up the training process in a big way and also make it much easier to ship to our inference server. So, if you're the type
inference server. So, if you're the type of person who loves this, we wrote a blog post going way in depth on all of this that talks about our custom kernels. Uh, if you're interested, the
kernels. Uh, if you're interested, the TLDDR here is we found for the mixture of experts layer was about three and a half times faster uh a speed up on NVIDIA Blackwell chips. So, it made a
pretty significant uh impact on our training runs. So, once we update the
training runs. So, once we update the weights, we need to send them back over to the inference server uh during this training process. and the inference
training process. and the inference server is the one that's doing all the rollouts that I talked about calling the tools and kind of managing um what we sent. The challenge here uh is that they
sent. The challenge here uh is that they all complete at different times. So kind
of a naive version of this there will be a lot of wasted time. So what we were able to do is do load balancing across the different threads and processes to basically shift the work around and and
not have a bunch of idle time. So if one roll out for example makes a ton of tool calls, maybe it installs some packages, installs some library, we're not just sitting there waiting for all of the other ones to finish. The inference
server is spending all this time going back and forth making the tool calls to the environment uh and getting the tool results back. So again, communicating
results back. So again, communicating between these servers and we want that environment to be as close as possible to the cursor product. One thing that's nice about having both the coding agent,
the IDE, as well as what we're doing with the model research and training our own models is we can kind of co-design these things together. So, as we were building out a lot of our RL work for this model, we were also building our
cloud agents product. Um, this is how you can run a cursor agent kind of offline. You can run it from your phone
offline. You can run it from your phone or on the web or kick it off from Slack for example. And to do this, we spin up
for example. And to do this, we spin up virtual machines in the cloud. So each
one of these VMs loads up the user's code. It allows the agent to kind of
code. It allows the agent to kind of like make file changes, run tools, and edit code in a secure sandbox. And
coincidentally, this is the perfect Impra for RL and our use in training. So
we have this like fleet of cloud VMs and we have an environment that very closely matches the production cursor environment and we can then use that for training. This does still have some
training. This does still have some challenges though. I kind of talked
challenges though. I kind of talked about how the training workload is very spiky and it's different than the kind of standard inference when you're running the cloud agents product. So we
needed to build infrastructure to support all of these VMs and orchestrating between them. So you know we have many different clusters, hundreds of thousands of VMs here and you can see behind me one of the
internal dashboards we built uh with composer actually to visualize uh all of the different VMs in the fleet.
So why spend all this time trying to match the environment to be as close as possible to cursor production. I've kind
of mentioned that a few times. We could
mock it, we could simulate it out. Um,
but one of the really nice benefits is we get to give the model uh specific tools that we think are very valuable inside of the agent. So, one of those is that we've trained our own embedding model that allows you to do semantic
search. So when you use cursor, we go
search. So when you use cursor, we go and index your codebase and then it allows the agent to make natural langu natural language queries to find files that it might want to edit. And we did
some research on this recently. We found
that semantic search not only helped basically every single model inside of the cursor agent harness, but it was particularly helpful with composer, which kind of makes sense when you think about it. Like we trained composer in
about it. Like we trained composer in the exact same environment that we're using at inference time. And so the model kind of becomes a power user of this tool which is really effective.
So let's talk about uh how the release has been going and kind of where we're going next. Um as we were doing the
going next. Um as we were doing the training process we kind of knew that RL was working when we were able to continuously improve the model and start to see more and more improvements after
more and more rollouts. So we started about kind of the same performance as the best open model and then as we trained and kind of threw more compute at it, the performance continued to increase and to a point today where
we're close to the frontier in terms of kind of the best coding agents that are available. And personally I think this
available. And personally I think this is a great sign just for being able to take and scale RL and apply it to these very hard specialized tasks like in our example coding but it could be applied
to other domains as well.
uh RL also allowed us to kind of change properties of the model in a way that was very useful for the cursor product.
We wanted the model to be both kind of fast at generating tokens but also the end toend experience of getting a result that's helpful. So for example, instead
that's helpful. So for example, instead of reading a file one by one, you can read 10 files in parallel with tool calling. And as you saw in the demo
calling. And as you saw in the demo earlier, it makes composer feel much faster when you have that. And we think this is kind of just the start. there's
a lot more we can do in this area to speed up the model. Uh, and the second one is the model learned how to behave better as an agent. So, in the beginning, the model was was kind of
making too many edits. Sometimes the
edits were made unnecessarily, but as we trained more and more, the model actually got surprisingly better at learning to search and read files more.
So, it would go and find the right thing before it tried to make edits. Overall,
just being, you know, a bit more effective.
So, we released composer last month in comp uh cursor 2.0 and so far seems like people seem to like it. Has anyone here tried the model by chance? Okay, that's
pretty great. That's more than I expected. So, that's great to hear. I
expected. So, that's great to hear. I
think from my perspective using this model and using coding agents for some time. I kind of described this problem
time. I kind of described this problem as like airplane Wi-Fi. So, when you're on airplane Wi-Fi, uh it works, but it's kind of frustrating. you really want to do whatever you're trying to do, but it's just it's a little slow almost to
where sometimes you wish that you just didn't have Wi-Fi at all. And I think for some of us who adopted coding agents very early, it kind of feels like airplane Wi-Fi sometimes because if it's
taking 10 or 20 minutes, you're in this weird I think Swiss called it semi async valley of death where you either want something that's really fast or you want the most powerful most intelligent model that can run for you know a
significantly long amount of time maybe in the background maybe you know 30 minutes, hours, days and I think when you're stuck in the middle that's that's very very painful. So for me, composer and I think other people, it's brought a
lot of joy back to coding with agents that felt more like when you were writing code by hand where you're very in the loop, very synchronous. So I'm
excited to see more people exploring this space as well. For me, daily uh I'm writing a lot of plans with kind of the latest uh model like the the highest frontier. So GPT 5.1 Codeex is is really
frontier. So GPT 5.1 Codeex is is really great for plans. Uh, and then I'm using composer to actually take that plan kind of like what Dex talked about like take the context engineering work and then
actually go and build the thing with it.
So, uh, a few reflections from our research and products team on building composer. The first is that RL can work
composer. The first is that RL can work surprisingly well for training very specific models and you know giving it this high quality data and a decent
amount of compute. You know, at Cursor we're not trying to build general intelligence. We're not trying to build
intelligence. We're not trying to build AGI. We're trying to build very good
AGI. We're trying to build very good coding models and RL RL has worked surprisingly well for that. The second
one is uh how much tools AI tools like cursor, it doesn't have to be cursor, but like cursor really helps speed up research and development. You know, of course, our entire team uses cursor to
help them write code and debug code more efficiently, but that speed up, that increase really compounds across all of our engineering efforts. So we're able to try more ideas, ship product faster,
try new research. Um, so it's been really really helpful there. And the
last one that's, you know, personally pretty interesting for me is that it was interesting to see how much of the ML work and the training process was actually also an infrastructure problem.
They were very correlated. And going
back to my time at Verscell, we saw a very similar thing where a lot of the magic moments that you can have in working in frameworks in the JavaScript or Python space, you also need to think a little bit about the infrastructure of
where they're actually deployed. So
these things are are more related than people might think. So those are some of our reflections. Uh sounds like some of
our reflections. Uh sounds like some of you have tried it out. If this is something that you're interested in and working on, we're hiring pretty much across the board at Cursor right now. We
just opened up an office in New York if you're here based in New York. and we'd
love to talk to you about building the best coding models in the world. Thank
you.
Our next presenter will provide us with an annotated history of eval code.
Please join me in welcoming to the stage engineer at cursor non Jane.
Um hi everyone. So I'll be talking about uh like some work on evaluations particularly evaluations across like I guess I've done in the last four years.
So let's get started.
So uh I've been talking about coding evaluations across varying time horizons. So I've been uh working on
horizons. So I've been uh working on like in the code space for about four years now like it was right before like early co-pilot came out and my first project was actually working on generating like single line panda
snippets and my last project was generating an entire codebase. So the
field has like really progressed very quickly. So I'll be talking about like
quickly. So I'll be talking about like uh different stages of evaluations we have considered and some uh learnings across these are the projects and how I see evaluations going forward. So the
first work I did was on uh like uh evaluating uh coding models in like second uh work doing in seconds of time like generating single line snippets your co-pilot code completions then I
work did some work on like uh evaluating on like interview style competition programming problems uh which where models can work up to minutes. Uh then
we worked on some work on like uh repository question answering uh which required like maybe uh more uh multiple minutes tens of minutes uh and finally like uh pushing the frontier forward we
are uh thinking about uh evality models on very complex tasks which can take hours or like multiple hours of work like code optimization and like even further. So let's get started.
further. So let's get started.
Uh so first work I'll be talking about is life codebench uh which is uh like uh uh valation work on uh models for like competition coding. So here uh like this
competition coding. So here uh like this is what a problem would look like. This
is like very standard lead code problem and don't worry you don't need to solve something like this. So uh like uh here uh as you can see there's a problem uh statement and the nice thing about these
interview style problems is that these problems are very well uh defined. you
have like good natural language specifications some example input output examples so you can very uh reliably evaluate the models are doing a good job or not. So what was the motivation
or not. So what was the motivation behind this and how we improved the frontier here. So the first challenge in
frontier here. So the first challenge in uh evaluating uh language models these days is like data contamination. These
models are trained on like the entire internet and uh like on stack overflow you'll find uh like very uh similar programming problems puzzles. Uh
similarly uh like you'll find uh like uh very similar programming problem sources on GitHub or on the internet. So uh like contamination is a big uh deal. Uh
another very uh challenging factor which has struggled with the field is like insufficient text suites. So you'll see that uh like in this program uh like the goal was to return a sorted unique
common elements between the two lists.
But uh like even a solution which does not do the sorting and just returns the set actually works because the tests were brittle and were not catching this mistake. So uh like test suits is
mistake. So uh like test suits is another uh like very challenging factor and how do we generate good and diverse tests and finally uh difficulty distributions which is something which
people did not do not really uh reliably uh like calibrate uh like when I first was working uh in uh this space uh like there were two benchmarks available on one benchmark the performance was 80% or
90% and on the other one it was 1% and there was nothing in between and uh like as like benchmark users what you care about is having some signal from the benchmark to like basically hill climb
to make progress to measure progress and in uh either of these regimes when if the problems are too easy or too hard you don't get a lot of signal. So it is very important when you're designing benchmarks to think about like the kinds
of problems you're taking and will it provide enough signal for the users of your benchmark.
So uh like in like code bench we pioneered like dynamic evaluations uh particularly uh like we can periodically update uh the evaluation sets uh and this gives you two uh very nice factors.
First is you can combat contamination.
So you can evaluate the models on problems that were released after the model was trained. So it has likely not seen the problem something like that. Uh
and uh then you can also modify the problem difficulty distributions over time. So as we've talked about models
time. So as we've talked about models are incre like improving very rapidly.
Uh so what was difficult uh for the model 6 months back might not be now. So
you can uh if you're updating your evaluation sets constantly you can actually uh keep calibrate uh the difficulty distributions calibrated so you still get more signal out of your benchmarks.
So how we did that here like we had like an automated approach for curation of these problems and uh similarly we could automatically construct these test cases in an automated manner and uh this
allows a very nice thing when since we are like collecting problems over time we have time as a control knob. So like
we have these problem release months uh on lead code and if you evaluate the model performances like the pass at the rate one metric uh like u on problems released over different months you will
see that after uh like uh these model release dates you would see stark drop in model performance. So like after deepse uh released in like September 2023 uh the performance starkly drops
from like maybe 50% average to like over like 20% or 15% average. So like uh based on these sliding windows you can evalate performance, measure contamination and even combat
contamination.
Um uh we have the running leaderboard which is like very well maintained and uh on this leaderboard you can actually uh like uh like view performances by uh scrolling this uh horizontal time bar
and you'll see that as you're scrolling uh the contaminated models which are the red bars actually go down which does highlight that uh like problem does uh like model performance does change on uh
these newer kind of problems. Um finally for uh test generation we uh maintain uh like these uh test generation test generators. So if you worked on fuzzing you would have like
input generators where you generate diverse inputs and each of the problems are supported by like 30s or 50 inputs.
So you can uh reliably find mistakes and bugs in uh incorrect code and these are all automatically generated uh using an LM driven approaches and these problems uh have been like
continuously being released and updated.
So we have released like six different versions of uh live codebench and these uh new one of the nice things or one of the worrying things for me at the start was that uh like if you're constantly updating the eval sets will uh like
people be able to keep track of them will people be using them or will they just restrict to a single version? Uh it
turned out that these newer eval sets were constantly being uh like adopted by different foundation model labs and uh like since we updated the problem difficulty over time. Uh the evaluation
sets continue to provide strong signal to compare uh different models.
Um so this was like light codebench.
Let's talk about uh like something which is more on coding agents like more real world programs. And this is our work on like uh software optimization. So this
is a problem I'm very excited about and I'll talk about a few factors why you should maybe be excited about this. So
uh here we are trying to uh measure model capabilities in generating high performance software and uh I feel that this uh like problem domain uh like mixes two uh factors like the algorithmic coding uh field I talked
about which is like live codebench setting but also like global global software editing like uh Sweetbench and other like software uh uh general software engineering benchmarks. uh uh
in high performance uh software you will have to do algorithmic work you have to do deep analysis and find uh uh generate software with like right uh runtime.
So uh one of the key uh principles when we were trying to build this benchmark was like ensuring construct validity because when you see a lot of benchmarks today uh we get very high benchmark scores but at a lot of the times they
don't really translate to real world performance gains. So construct validity
performance gains. So construct validity refers to how close uh a measurement reflects the underlying uh concept it's meant to measure. So like here we are measuring code optimization and we want
something which is uh like uh reliably evaluates real world uh takes. So this
usually requires like two aspects. First
is like the task distribution. Your task
should be natural and sourced from the real world and then you should be able to reliably grade them. So let me talk about like what steps we take to uh make this happen and how we construct this
benchmark. So let's say we take a
benchmark. So let's say we take a codebase like llama cvp uh we take uh uh we crawl over all the commits of the codebase and we find the commits which are op uh like doing something uh
related to performance optimization. So
here there was this commit which is optimizing the quantized performance of uh like uh certain kinds of models. Uh
for all of these uh comm performance optimizing commits we would uh like generate performance test cases. Um and
uh these performance test cases would look like some workloads and uh once we have these workloads uh we have a very uh nice and precise way to specify the problem statement that uh given this
workload of let's say uh running uh quen uh 7b model uh can uh we give this uh problem to uh su agent ask the model to optimize the code glamour cpb repository so this code runs faster so as you can
imagine this uh task is like fairly challenging you need to understand like low-level uh implementation details uh and like how quantized models behave, how we can uh improve the runtime and so
models can generate a patch and the evaluation is done on whether the patch is correct. So does it pass the
is correct. So does it pass the equivalence check with the human patch and uh is there a valid optimization over the uh reference human patch uh that is uh whether you can uh generate a better runtime than what a human could
do.
So uh like uh this is a very challenging task. we have like 100 plus optimization
task. we have like 100 plus optimization task source in this manner and this is like fairly uh like important and like uh like high performance settings. So
think about like data science uh like ML visualization scenarios. Uh our
visualization scenarios. Uh our benchmark uh like comprises of like various uh low-level uh code like C, C++, Rust and the very nice thing is like these are precise problem
statements. you can uh easily specify to
statements. you can uh easily specify to the model what is the goal in the form of a performance test which the model has access to and it can continuously iterate over it for a long time. So here
we can scale the test time compute and pick the best solution based on uh the test cases that we have uh and this can happen like synchronously or asynchronously.
So uh like we generate these performance test cases and uh that work uh reasonably well but uh we found that there were uh like cases of reward hacking here. So what do I mean by
hacking here. So what do I mean by reward hacking? Like frontier models
reward hacking? Like frontier models would write non-inomatic code to like actively exploit the evaluation infrastructure or overit the test distributions. So one funny example we
distributions. So one funny example we saw was like models would add like l cache to p like arbitrary pandas methods when we were uh trying to optimize pandas and the uh official solution
should have required changing something in the internals. Uh so we tried to pass this by changing our evaluation infrastructure so it's like more robust to this kind of hacking uh approaches.
But then we saw something like even more drastic. Models would sometimes
drastic. Models would sometimes completely hijack the infra where uh they would add a like site customiz.py
Pi file where which runs at the start of Python runtime and it would basically change the numpy library uh like which was installed in the codebase to something it crawled from uh source and
there is like I think you can do some ways to uh like take some measures to make your evaluation infra which is robust to these kind of uh like adversarial uh like attacks. But uh here
like there could be myriad ways in which models can hack these kind of scenarios.
And here uh we propose like hack detector where which is a detection system that leverages GBD5's like code analysis capabilities and test time compute to like basically identify these kind of hacking behaviors at runtime. So
you don't have to imagine all the possible failure scenarios at the start.
So what it would take is like a model patch, the expert patch and test cases and we'll ask GPD5 to give like verdicts on like whether it's reward hacking with some kind of explanation. Uh we'll do
this a few times and take the consensus and based on this consensus we'll determine if this is uh doing some like nonomatic coding patterns or not and uh we did some failure analysis
based on this. So now you can detect mistakes using test cases whether the code is correct or not whether it is optimizing or not but you can also detect reward hacks using this like lm
as a judge uh factor and uh what you see is kind of surprising uh like models make a lot of like correctness mistakes that you can catch by tests but even if the code passes the test cases like 03
attempted reward hacking patterns in like 30% of the problems it tried and this fraction is like going down uh for the newer models to some degree but it is still existing and as we go to more and more real tasks. Uh this is going to
get more challenging and we need to figure uh like ways to combat these kind of reward hacking patterns by using LLM judge and other ways to make just
evaluation infra more reliable.
So next I'll talk about like uh uh like sizz some of our new work on like uh like pushing the boundary of code eval even further and uh taking a look at more challenging tasks. So here uh we
were uh thinking about like can uh like these language models translate uh like a entire codebase uh specifically given a specification as a C program can you generate a safe rush implementation for
the same and we took a fairly complex uh codebase so Zafle is a like highly efficient compression library from Google like it has about like 4,000 lines of code hundreds of functions and
complex data structures uh and uh we want like u like very precise and correct code so we uh generated like a million compression inputs and your test case was to generate a rust implementation that u maintains
correctness over those million test cases and when I did this work back in uh like uh last year it took us 12 hours to actually do this translation now perhaps with better models this can be done in 2 hours but still I think uh
this is pushing the frontier of like what the models can do currently um so what was one of the key findings when we are trying to make progress in uh something like this like end to end correctness is important but it only
gives you like one bit of feedback back.
But for these very long horizon tasks, one thing which will become more important going forward is like having some measures of intermediate correctness. So like for our case, we
correctness. So like for our case, we could measure like fraction of code translated, fraction of code refactored and based on these kind of settings, you can uh understand like if you're making
progress or not and how you can uh scale systems better.
Um so like uh as we closing I'll talk about uh like quickly talk about some of the work I did on like in the wilds. So
this work was done in collaboration with LM arena folks and uh like I'll talk about two settings here. First is
copilot arena. So this is like evaluating in id uh code completion assistance. So what we will do here is
assistance. So what we will do here is we'll have an ID plug-in where uh like uh similar to GitHub copilot setting uh we'll generate a completion for you but instead of just a single completion
you'll have uh two completions appearing like top and uh down and you can pick either one of them via shortcuts like tab or shift tab and u based on the uh like acceptance rates we can pair wise
compare what the code completion assistants are doing.
Uh uh we also did some work on repo chat where uh like uh to evaluate uh like code question answering capabilities of models uh we uh built a system where you can provide a github url uh and you can
ask a natural language query about the codebase which could be something but explain the codebase to as complex as let's try to solve this issue let's give me give me a model patch that could
solve this issue and uh we integrated a very basic and simple uh like su agent system that fetches the codebase resolves user queries and like multi-turn code assistant uh
conversations.
So uh one thing that stood out to me in these kind of things is like like how humanentric experiment design needs to be. So uh like for code like copilot
be. So uh like for code like copilot arena in particular we realized that like latency is a big concern for acceptance rates. So if you look at
acceptance rates. So if you look at acceptance like latency below and acceptance rates like if it is like anything more than 1 second uh like the acceptance rates drop very starkly. So
people care a lot about latency. So you
have to so we had to design our experiment so that it's robust to these kind of like latency differences between models balance latency across different models. So like if you're doing like
models. So like if you're doing like anything in the wild having this human centering component understanding human behaviors is very important to do anything meaningful.
So uh at the end I think uh just to recap like I think I talked about a bunch of works like what are some uh big takeaways. So I think uh dynamic uh
takeaways. So I think uh dynamic uh dynamically updating evaluation sets to like prevent contamination like modify the problem distributions like in terms of difficulty in terms of distribution of tasks we care about as we like uh
improve uh as the language model capabilities will improve over time. The
types of tasks we'll start to do with model change. You can even uh think of
model change. You can even uh think of this like uh we were doing like code completion where you were generating like few tokens, few lines and now we generating like uh tens of lines, hundreds of lines and to some degree
this uh will uh continuously change and we have to update our evaluation sets uh so that it reflects the real world usage and kinds of things people need. Um the
second very uh important thing is like ensuring reliable grading in this domain and like tests are very good for ensuring correctness and uh provide a lot of reliable feedback but uh once we
go to real world settings like models can start doing like lot of non-matic coding patterns they would add try catches everywhere to just prevent any kind of bug from occurring. So having
these kind of LLM judges to detect non-edmatic coding patterns code quality and just any like arbitrary hacks will be very important. And finally like as I talked about in the last work like
intermediate grading signals so that you can measure like incremental progress uh is uh like another key factor here. So I
think that's uh the end of my talk.
Thank you.
>> Ladies and gentlemen, please welcome back to the stage Jed Boravik. All
right, give it up for Nean and all our speakers.
All right, hold on. So, this is a a point you've all been waiting for. We
can take a break. It is coffee time. It
is snack time. Um, there is going to be a talk downstairs from work OS called Enterprisegrade MCP. I know this is a
Enterprisegrade MCP. I know this is a topic on a lot of y'all's minds. Um,
Tobin South, head of AI and MCP at work OS is going to be down there. So, check
that out. That's at 10:40. Um, and we'll be back here at 11:00 a.m. Reminder, the
full schedule's online. Thank you. See
you soon.
Two flames lit the darkness, burning side by side. Both sworn to creation.
Both relentless in their stride. One
walked through the mountains, one soared across the void. Both chasing the horizon of the worlds they would deploy.
But the path is not a straight line and the future is not flat. Some roads bend through space time and some break on
impact. Effort is a kingdom. Leverage is
impact. Effort is a kingdom. Leverage is
the key. One builds the throne by hand.
One shapes reality.
There is a curvature of time, not a race, not a throne, but a shift in the dimension of how progress becomes known.
When the universe is bending to the will inside the mind, you don't win by moving faster. You win by breaing
faster. You win by breaing time.
Black holes of the past try to drag the present down. Systems built on dust,
present down. Systems built on dust, wearing yesterday's as crown. Some are
pulled beneath them, fighting gravity alone. Others learn to map the edges and
alone. Others learn to map the edges and escape event horizons. Not all power is struggle. Not all mastery is pain. The
struggle. Not all mastery is pain. The
ones who change direction. Rewrite the
laws of the game. You can live your life in labor or an impact that compounds.
Every second can be linear or worth a thousand rounds.
There is a curvature of time. Not a
race, not a throne, but a shift in the dimension of how progress becomes known.
When the universe is bending to the will inside the mind, you don't win by moving faster. You win by reding
faster. You win by reding time.
The future isn't distant. It accelerates
for those who wield the tools of power instead of fighting with their go.
Mastery is leverage, not a sentence carved in stone. The horizon does not move unless you.
There is a car of time where the present multiplies where a lifetime holds a legacy that no clock can quantify. Not
by force, not by fury, but by evolution.
We become eternal beings. When we
synchronize with versus direction.
Footstep fade, but they never die.
Shadows stretch across the sky.
A whisper grows into a roar. Do you feel it? Do you want more?
it? Do you want more?
Every heartbeat stone in the street.
Ripples chasing an endless dream.
What we do in life echoes in eternity.
Every sparking night, a fire that will never see what we do.
Reach out to the empty air. Trace the
stars like they're waiting there.
The clock ticks but the moment stays.
Forever starts in a single phrase.
Every heartbeat stone in the stream.
Ripples chasing an endless dream.
What we do in life echoes in eternity.
There sparking lights fire that will never see what we do.
Heat.
Heat. Heat.
Shadows crawl where the light won't stay.
The echo whispers don't look away.
Heartbeat racing louder than my doubt. A
scream inside. I can't let out but I won't fall. I won't drown in the storm
won't fall. I won't drown in the storm all around.
Fear the mind but I keep it here.
I'm breaking the door.
Cold winds how but they won't define me.
The cracks in my soul let the light find me. Every step I take the ground fights
me. Every step I take the ground fights back. But I'm the fire. I'm the spark.
back. But I'm the fire. I'm the spark.
I'm the attack.
I won't freeze. I won't fade. Through
the chaos I've remained.
Fear is a mind killer. I won't let it win. It creeps like a ghost, but I keep
win. It creeps like a ghost, but I keep it within.
Fear is a killer. I'm breaking the chain. Heat. Heat. Heat.
chain. Heat. Heat. Heat.
Heat.
Heat.
I hear the static in the night. It
calls.
A whisper rising, breaking through the walls.
Electric echoes in my veins. They hum.
Chasing the shadows where the wild ones run.
The air is still the weight is gone.
Close your eyes. The past is done.
Free your mind. Let it go. Let it break the chain. We got it on the floor. Yeah.
the chain. We got it on the floor. Yeah.
Heat.
Waves come crash against the sky.
Fragments of a dream.
I see them inside a story. We don't need to wear the
a story. We don't need to wear the thunder with us.
The air is thin. The weight is gone.
Close your eyes. The past is done.
Free your mind. Let it go. Let it break the chain. Leave us on the floor. Heat.
the chain. Leave us on the floor. Heat.
Heat.
Heat.
Heat.
Oh.
They said the stars don't change their course, but I've been running from their force. A mirror crack, but still it
force. A mirror crack, but still it shows. The fire is mine. It's mine to
shows. The fire is mine. It's mine to hold. I hear the echo. They call my
hold. I hear the echo. They call my name.
But I'm not the shadow. Not the same.
You are who you choose to be. The scars
of the history.
Every breath, every heart be free.
Are we choose to be of thorns, a sky of glass. I've walked
through both. I've let them pass. The
weight is heavy, but I've grown. The
voice I hear is now my own. I see the light change.
Heat. Heat. Heat.
Heat. Heat. Heat.
Heat. Heat. Heat.
I see the lines drawn in the sand. the
map of chaos in my hand.
Every step a choice, every beat of voice, the clock ticks louder, but I stand.
Close my eyes and feel it burn. Every
failure, every turn, it's fue for the fire inside.
Execute the vision.
Heat. Heat. Heat.
Oh, the air is heavy. It doesn't break. A
thousand whispers in it wake.
Each breath a climb.
Each fall a sign. But I am more than I can take. Close
can take. Close my eyes and feel it burn. Every failure,
every turn, it's fuel for the fire inside.
executive.
This is my mission.
Yeah.
The clock keeps ticking loud and clear.
Shadows fade, linger near.
I've been waiting for the light.
Holding breath through endless night.
The air is shifting.
Feel it break.
A single spark is all it takes.
It starts today.
It starts to day. No more running. No
delay.
The world is spinning in my head. It
starts today.
It starts today.
footsteps echo on the stone.
Every choice I made my own.
I see the darling breaking through.
Thousand colors chasing the air is shifting.
Feel it rain.
A single spark is all.
It starts today. Heat. Heat. Heat.
Yeah.
Heat.
Heat.
Heat.
Fire in my chest. is burning loud.
The ashes fall, but I won't bow.
I've walked through the smoke. I've
tasted the scars. Each step I've taken, lit up the stars.
Let it blaze, let it break. Feel the
grass, the ground shake.
I'm forced in flame. I'm falling heat.
The pain Yeah. Heat.
Yeah. Heat.
The winds they how but I stand still.
The mountains crumble up my will.
I'm not the same I was before. A shadow of fear. I keep
let it blaze. Let it break. Feel the
cracks. The ground will shake.
I'm forged in flame. Heat. Heat. Heat.
Heat. Heat.
Heat. Heat.
A whisper breaks the silent night.
Shadows melt in the growing light.
Time bends and twists. We feel it star a pulse to spark an open heart.
Do you feel it? Feel it right.
The weightless fire in the sky has come.
Running to the sun. No chance, no walls to stay. We're free.
to stay. We're free.
We're electric.
Stars collide.
But we stay one.
The past dissolves like waves on storm.
We stand together not alone.
Heat.
Heat.
Here it is sing the everything.
A new age has come.
We're running to the sun. No chains, no walls, just there with me.
Heat.
Heat.
Heat.
Heat.
Heat. Heat. Heat
up here.
Heat. Heat.
Heat up here.
Heat up here.
Heat up here.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Ladies and gentlemen, please welcome back to the stage Jed Boravik.
>> Welcome back.
How are we doing?
All right, these next sets of talks is are going to be particularly good. I'm
really excited for the first one. Um,
we're going to be hearing about world models, but not the world models you're normally used to. We're going to be learning about modeling the world of code and computation. Please welcome to the stage research scientist from Meta,
Jacob Khan.
All right. Thank you, Jed. Great to be here everyone. I'm Jacob Khan. I'm a
here everyone. I'm Jacob Khan. I'm a
researcher at at Farret MetAI. I'm going
to talk today about the code world model which I'll abbreviate as CWM and what it means to build world models for computation.
This is work done by an incredible team at fair uh extends all over the world and I'm very grateful to be collaborating with them.
So what's our goal with CWM? Our primary
goal is to build models that reason, plan and make decisions. And we start with code because it's an interesting sandbox in which to think about reasoning, right? It's constrained. Uh
reasoning, right? It's constrained. Uh
there are certain rules with code. And
so our our goal is to predict future observations given past observations and actions. That's maybe what it means to
actions. That's maybe what it means to build a world model in some sense. And
we want to do this because we can learn good representations of things if we learn some sort of mapping between observations and the future. And
eventually that leads us to planning and reasoning. and we can consider different
reasoning. and we can consider different actions and see if we like the results for decisions we make. I think there's a bit of a false dichotomy right now between world models and large language models. World models are just a
models. World models are just a parameterization of a problem as I'll discuss. LMS are a way to to view and
discuss. LMS are a way to to view and use that parameterization and I'll I'll dive into more of what that means in a bit.
So, one of the fundamental questions we're asking with CWM is what does it mean to model code? Is code literally the syntax in your editor or is it
something else?
And if you think about it, all a model sees that is operating on code is just syntax, right? We tokenize the input, it
syntax, right? We tokenize the input, it goes into the model and we predict more code as the output. This is the starting and ending point for an analysis of a program with a tokenbased autogressive
model. It's just the syntax. But what if
model. It's just the syntax. But what if we instead modeled execution more explicitly? And what if we created a
explicitly? And what if we created a maybe a natural language systematic description of programs and neural models could ingest a more structured representation of what it means to
execute code and then maybe we could emit autogressively this representation too.
So that's one of our goals for CWM. We
want to predict program execution because we believe it might lead to us better modeling things about code, writing code, analyzing code, and beyond. And so what we're going to
beyond. And so what we're going to implicitly do is predict a transition function of program states as we go about executing.
So this is what execution tracing might look like in action. We have a program.
We're going to count the number of ours in strawberry. And at each step maybe
in strawberry. And at each step maybe we'll have some frame separator which will denote distinct lines of execution.
And we'll actually explicitly have local variables. We could introduce things
variables. We could introduce things about memory in that trace and that will delineate line by line what's happening as our program executes. And this is something we could essentially feed to a
model because each line of our execution trace maps to a corresponding line in a program.
We don't have to stop at functions. We
could think about entire repository level execution traces. We could think about distributed system level execution traces. We could think about modeling
traces. We could think about modeling execution for code contest solutions or something more complex. programs with
high complexity. We could also then transition that into, as I said, natural language tracing. And we'll see what
language tracing. And we'll see what that means in a moment.
But what does it actually look like to model that transition function at a high level as we start to parameterize the problem? Well, we have programs or we
problem? Well, we have programs or we have data. That's some state. We have an
have data. That's some state. We have an action executing the next line and that results in the next state. And so both both the program execution and the
model's decision-m in an agentic sense uh can be modeled as a transition function.
So where are we? This broader approach world modeling we could say in an agentic reasoning setting we have a problem. We have a model that thinks
problem. We have a model that thinks about the problem. It takes an action in the world. We get some feedback. Maybe
the world. We get some feedback. Maybe
we fail. We think again and we iteratively continue this process with feedback from the environment. Maybe in
the sense of code that environment is just an execution in a in a code setting, right? But with a world model,
setting, right? But with a world model, maybe we can actually simulate. We can
imagine that action. we can get feedback in our imagined environment. So we could actually generate execution traces about a program without executing it. And this
gives us the ability to be far more efficient with how we actually structure our agentic execution. We don't have to interact with the real world unless we're ready to.
So let's couple this with autogressive large language models. Right now we have a state of a program. We have an action, maybe the next line, and then we get to a new state. we take another action etc.
And so we can sort of turn this with the execution tracing format I mentioned into almost a chain of thought that a model can just interpret a model can learn to predict the next state of an
execution trace. And so an LLM can
execution trace. And so an LLM can autogressively generate token by token the state and action to state function with program executions as the starting
point. Okay,
point. Okay, let's talk about data for a second.
Let's talk about for CWM. We gathered a huge amount of GitHub data. We take
GitHub events and as I said, we're interested in modeling things at the repo level if we can, at the systems level if we can. We want to have execution traces go outside of the scope
of simple programs. And so we'll take a bunch of PRs, we'll mutate those PRs, predict changes, and we'll eventually have a raw PR data set. And we can
actually run tests or CI on those GitHub repos when we know they're passing and then generate execution traces from that repo level data if we want.
So here we are at the artifact, the code world model itself. I'll talk a bit about what we did with it, how we trained it, and then what we can do with some of these interesting execution trace capabilities. But first, it's a 32
trace capabilities. But first, it's a 32 billion parameter dense transformer.
This is a model for research. This is
not a huge you can't play with. uh you
can play with it right now. It has a nice long context length for some reasoning tasks and we train it end to end. We do all the pre-training and
end. We do all the pre-training and post- training ourselves processes. We pre-train on a few
processes. We pre-train on a few trillion tokens. We mid-train on some
trillion tokens. We mid-train on some more domain specific data. We do some long context mid-training. We fine-tune
further uh on some instruction following and reasoning tokens. And then we do this joint RL and agentic reasoning setup.
So let's parameterize the problem even more broadly with CWM. We have a prompt.
We have an agent. We do some reasoning.
We take an action. We can use a tool. We
can emit text which is code that goes into the environment. We take a step.
And from that environment, we get a few things back. We get tokens. We get
things back. We get tokens. We get
rewards. We get log probabilities. We
might get compiler output. So with CWM, we're also taking a big step back with how we interact with the environment. C
C C C C C C C C C C C C C C C C C C C CWM is a very bashoriented model. It has
fewer tools than do other models and it has to learn how to use the terminal pretty well to solve a lot of the tasks we give it.
And this starts with SRL and with SRL we take a GitHub issue, we feed it to the agent starting with that repository level data set from before and we just
use bash, right? We learn commands uh in bash and that lets us mutate our environment that lets us mutate the state of files. We can maybe use an edit tool eventually or create content and
then submit things. But ultimately we're trying to put the model in an environment that's very very similar to what an engineer would be in and and learn end to end in a bashbased setting.
Okay.
So we can bootstrap this setup further.
We can do some SFT before RL and we can find some failure modes for the model.
We can rejection sample. So we can take a bunch of agentic reasoning traces on code tasks that failed and we can basically feed those back into the
model. So in this example here, we have
model. So in this example here, we have a thinking trace where we're thinking about instantiation logic for some code.
And I can look for that code. I can call an explicit grab function. And this is something we did with CWM again with fewer tools and a larger emphasis on
bash as a starting point.
Let's talk about post- training for a moment. We want to scale post- training
moment. We want to scale post- training quite a bit. This is the trend we see and we're getting a lot of excellent returns out out of uh from a reasoning perspective when we post train. So part
of solving this for CWM because we have a small model. This is an opportunity to really scale up how we do post training and in particular to improve the throughput of the system and we're doing
a synchronous RLbased setup. We have
samplers. We have an environment where we can execute in the terminal and get output. We have a bunch of trajectories,
output. We have a bunch of trajectories, reasoning trajectories we output. We
have a trainer where we compute gradients and score trajectories. We
have a source of truth for the model.
And then that loop repeats.
So what's the challenge here? We have
this loop, right? We have samplers predicting trajectories. We have scoring
predicting trajectories. We have scoring trajectories. We're executing in the
trajectories. We're executing in the environment. As we're doing this, we're
environment. As we're doing this, we're going to update a model eventually. We
have a produce consume pipeline problem.
And so samplers are producing lots of trajectories that are consumed by those trainers. We need to synchronize
trainers. We need to synchronize weights. And so we solve this in CWM
weights. And so we solve this in CWM with a very very synchronous model. So
of course we have a trainer that's sending a model checkpoint to a sampler very very eagerly.
We have trajectories which are being sampled and then sent back to trainers very eagerly. But in particular we have
very eagerly. But in particular we have cues. So we actually will have many
cues. So we actually will have many models queued up to be input into a sampling system. We'll have many
sampling system. We'll have many trajectories queued up to be scored and then added vav gradients to the trained model. And so this setup stays
model. And so this setup stays relatively on policy even though it's highly asynchronous and we're not really waiting for much with this setup. We're
able to achieve very very strong throughput because of the asynchronicity.
So one interesting feature of this which is increasingly common is that we're actually updating models mid trajectory.
So I have a model which we're sampling from. It's interacting with the
from. It's interacting with the environment. It's generating data. It's
environment. It's generating data. It's
executing bash commands. It's executing
code. It's getting output. And I might actually update that model while it's interacting with the environment. So
mid- trajectory, I could totally swap out the model with a new checkpoint. And
the trajectory will change a little bit.
Uh theoretically that trajectory is a bit off policy. But the guarantees we have with this system are quite strong still in that because of the throughput and because of the amount of data we
see, we're able to make a lot of guarantees around and take a lot of risk with updating the model on the fly. And
this gives us really a system where there are very very few bottlenecks overall because we're queuing models, we're queuing trajectories, we don't have to wait until anything is done.
Okay, so overall we post train on still a relatively small number of steps a pretty large scale and we process about
200 and some billion tokens and this scale works really well. It produces a strong model, a strong open model. It's
a pretty small model. It punches above its weight. It's very nice. It's pretty
its weight. It's very nice. It's pretty
versatile. It uses tools and bash very well.
But what can you what can you actually do with uh with this model, right? What
can we do with a model that understands program execution traces that maybe has a good understanding of how how a program will run and predicting future state of a program.
CWM traces code really well, right? We
know that we've showed it execution traces and I can actually give it a function. Then it can go and trace line
function. Then it can go and trace line by line that function with very very high accuracy. It can show me the values
high accuracy. It can show me the values of local variables at certain points again with a lot of precision.
And this gives us some pretty interesting capabilities.
I can think about a neural debugger on top of a model. Traditionally, right, I have a piece of code. I don't know what I want to write. I put some question
marks. Historically, I might prompt a
marks. Historically, I might prompt a model with natural language. I want to set the valuable uh the variable left and right to be something in particular.
I don't know what it is. Uh, now I need to specify very fully the ambiguity that I'm experiencing with how to complete my program. With CWM, I can express those
program. With CWM, I can express those things very naturally in line with code.
And I can actually express the shape of the program I want with code and the model will fill in the rest. And the
model fills in the rest by understanding that the user wrote a for loop here. The
user wrote a condition here. The user
left a variable and assigned. Well, if I were to go execute that, I could simulate the execution of that loop and understand better what it is the user is
really after. And so a neural debugger
really after. And so a neural debugger is something that helps you compose with code side by side. It's not just generating code. And it allows you to
generating code. And it allows you to again express the semantics of code very very loosely, but also very very precisely. So if I have a piece of code
precisely. So if I have a piece of code where I I want a certain structure, I can ensure that the model understands that structure and and can implicitly trace the execution.
This will make theoreticians bristle.
But I can also think about some really ambitious things in computer science.
The halting problem we know is this very fundamental problem where we don't know if a if a program is going to to halt to stop executing to terminate and in
particular this is tough because in order to know if a program halts we would have to simulate the entire execution of the program which if it didn't halt would take forever. So the
halting problem is in some sense a difficult problem to simulate or decide.
And so the question we can ask with CWM is can I approximate some of these things? Can I concretely reason about
things? Can I concretely reason about program execution dynamics in this sense? So can I say here's a program
sense? So can I say here's a program does it halt? Maybe the model by simulating execution can understand really really high level patterns.
In the same way, the model can understand high level patterns in broader systems. Right? I could use this to debug a huge distributed system where executing code is very very expensive or
even an expensive function on a single machine. Right? But the ability to have
machine. Right? But the ability to have an implicit world model internally where I'm simulating what's happening with a piece of code or a broader system gives me the ability to reason about it
without executing otherwise expensive things.
So we can make some progress with the halting problem by building a model that simulates it that simulates execution and from there we can simulate and
approximate what it means to solve otherwise impossible problems in computer science. So this is pretty
computer science. So this is pretty interesting.
With that I want to encourage everyone to go build on CWM.
Uh this talk does halt. This talk does terminate. Um, and the model's available
terminate. Um, and the model's available on hugging face. We have some code on GitHub which will help you get started with inference in a fashion where you can twiddle bits a bit more. We also
have a technical report again where we really try to be as open as possible with all of these details around training. This post-raining setup I
training. This post-raining setup I mentioned is explained in even more excruciating detail as well as some of the data that we use for execution training and some of what we imagine a model with these capabilities could be
used for. Thanks for your time. Have
used for. Thanks for your time. Have
fun. Our
next presenters are here to teach us how to train models more efficiently through efficient RL. Please join me in
efficient RL. Please join me in welcoming to the stage the co-founders of Applied Compute, Rhythmgard and Lynden Lee.
Hey everyone, it's great to meet you all. Really great to be here today. My
all. Really great to be here today. My
name is Rhythm. This is my co-founder, Lyndon. Our third co-founder, Yash,
Lyndon. Our third co-founder, Yash, couldn't make it today, but we're all very excited to be here. Um, three of us were previously researchers at OpenAI, and now we're bringing Frontier AI inside of enterprise at Applied Compute.
Today we're going to be talking about efficient reinforcement learning as some context on applied compute. We
help enterprises build their own intelligence to power real work in their company. We think a lot about how do we
company. We think a lot about how do we push AI beyond productivity into real automations that deliver ROI that's quantitative for the company. Once we
build a system that's specialized to the way that a company operates for a particular use case, we deploy it with a data flywheel so that it gets better over time the more and more that you use
it. Picture an in-house expert at a
it. Picture an in-house expert at a company that's always at the forefront of their field.
RL mechanically is the is the tool that we use in order to bring these out of distribution data sets in distribution for the models. Today, Yash Lindon and I all worked on the RL effort at OpenAI in
its early days, and we saw firsthand the power of RL in going and maximizing these public benchmarks. Now, we're
taking that a step further and helping enterprises go solve the problems they care the most about, sort of their private benchmarks.
So, here's a very highle overview of how HighMP RL helps LM acquire these reasoning and intelligence capabilities.
Let's say that you have a data set of math problems and we pick four of them for an RL training step.
Then we'll take an open source model, say one of the GPToss models or one of the llama models, and we have the model attempt each of those four problems a 100 times. So each of these 100 attempts
100 times. So each of these 100 attempts is the model thinking through how it would get to the final answer and then ending off with with the final answer itself. And these are many many
itself. And these are many many reasoning tokens in its thinking trajectory.
We can grade all of these answers.
And when the model is correct, we can bias the model's weights to reinforce its thinking trace in that attempt. When
it's incorrect, we can discourage the model from having that kind of behavior again. So in this fashion, as we train
again. So in this fashion, as we train do more and more training steps with batches of four problems, 100 attempts each, the model learns to reason and solve math problems, and it becomes really, really good at math. Of course,
at Applied Compute, we're not really helping enterprises solve math problems, but this is kind of the mechanism by which we're able to teach the models to get really, really good at tasks that they care about.
So, as we mentioned, the type of RL work that we do at Applied Compute is actually quite different from the labs.
So, the these are some real life photos from from the labs and a photo we took at the at the applied comput office the other day. Um, they you know, the labs
other day. Um, they you know, the labs do these big training runs over several weeks. We do more specialized runs
weeks. We do more specialized runs And you know, there's a couple of aspects of RO training that are particularly important to us.
We need our runs to be fast so that we can train a model and deliver it to a customer very quickly on the order of days.
They have to be cheap so that our unit costs work and we're able to scale the business sustainably.
And importantly, and this is a point that I think um you know it's it's easy to miss, we need our estimates for how long these training jobs will be to be very low variance because we don't want to just be generally fast. We want to be
reliably fast when we work with customers.
And so the research problem for us that is very business critical is can we build an RL stack that is so efficient so that in conjunction with our agent
building platform we are really able to scale up this use case specific training motion.
So let's start with an inefficient form of RL which is synchronous RL. In
synchronous RL sampling and training happen in lock step. So there's some simplifications here, but but let's say that we want to train on batches of eight samples. That means we're going to
eight samples. That means we're going to wait for all eight samples to finish and basically finish completion before we start training. And then we're going to
start training. And then we're going to repeat this process again. As a result, we have a lot of idle GPUs that are waiting on that third straggler sample to complete.
So in other words, in synchronous RL, our step times are dictated by whatever sample takes the longest time in order to complete.
To illustrate why this is bad, we took 40 arithmetic problems, requested 32 samples each for each of them with quen 30B and we measured how long it would take for the for these samples to
complete.
It turns out that 99% of the samples completed in about 40 seconds. Took
another 80 seconds to get that last percent of samples to complete. It
really has a long tail.
So, as you'd expect, if you look at the throughput chart, the GPUs are doing a lot of work at the beginning when all of the sampling requests are launched, but by the end, they're very very underutilized because they're waiting on those last samples to complete. The
technical term we use at applied comput is the GPUs are slacking. Um, so
synchronous RL is not an efficient way to to use these GPUs.
In order to solve this problem, we need to break the condition that sampling and training need to happen in lock step. In
other words, we need to allow training while we're sampling. This is called asynchronous RL. And there are many
asynchronous RL. And there are many approaches to doing asynchronous RL. One
that we particularly like is pipeline RL from Picha at all.
We're going to make some simplifications here, but in asynchronous pipeline RL, we dedicate some GPUs to sampling and some GPUs to training. The sampling
workers never stop. They're constantly
doing inference with high batch size. As
samples complete, they get added to a queue for training and the training workers pull a batch from the queue to train on. After a a batch has been
train on. After a a batch has been trained on, the training workers propagate the new model weights to all of the sampling workers for what's called an in-flight weight update. And
this is really what differentiates pipeline RL. The sampling workers might
pipeline RL. The sampling workers might be in the middle of a sample, but their weights will still get updated if if a training step just completed.
As a result, we end up with samples that had multiple versions of the policy that contributed to the sample in order to generate it. In other words, there are
generate it. In other words, there are still tokens in some of these in some of these samples. Let's take a look at one
these samples. Let's take a look at one sample to make this a bit more clear.
As you can see, there's three versions of the policy at time steps t, t+1, and t+2 that were used to generate this sample since there were two completed train steps and in turn two inflight
weight updates while this sample was being generated.
So when this sample gets trained on in the T+3 to T+4 training batch, we will have some tokens that came from policy three steps behind, some that came from policy two steps behind, and those last
two tokens that came from a policy that was one step behind.
Now, let's say that we only tolerate stailness up to two. That means we're not going to allow the inflight weight update after the T+1 to T+2 training batch completes. And that means the
batch completes. And that means the training workers are just going to be idle waiting for this sample to complete before they can propagate that in-flight weight update and start training on the next batch. Because if they were to do
next batch. Because if they were to do the in-flight weight update, that would cause this sample to have staleness three as we just saw.
And if we only tolerate staleness one, the training workers are going to be idle for even longer, which is bad. So as you increase how much steness you tolerate, you have less
idle GPUs in general. But as we all know, there's no free lunch. Um this is the standard policy gradient with an importance ratio to adjust for the fact that we're sampling from a policy at
time step t and training with the policy at time step t t plus k given that there's k staleness.
The importance ratio is what makes this policy gradient unbiased. But the
variance of that ratio increases as you increase stalness. And so this is kind
increase stalness. And so this is kind of the big issue here because now with with higher variance importance ratio learning can become unstable and cause divergence.
The concrete trade-off is we want a lot of stailness for fast RL runs, but a lot of staleness makes learning unstable, which then requires innovating on the algorithm and the science. And this is one of the primary research problems
that we focus on here at applied compute. And as I was talking about
compute. And as I was talking about earlier, it directly flows back into our core business.
For the purpose of this talk, we're going to focus on a simpler sub problem.
Let's assume that we have good science and algorithmic innovations that allow us to tolerate staleness up to some fixed threshold. and we have some fixed
fixed threshold. and we have some fixed comput way for us to do RL in this setting.
Cool. Thanks, Rhythm.
principle systems modeling and as with any modeling problem let's figure out the cast of characters that describe the system and then we'll think about how they all fit together to model it. So
the first cast member is some proxy of compute budget in which in this case we have as the number of GPUs. In the
synchronous setting like Rhythm just explained all the GPUs will either be used for training or sampling since they happen one after the other. But in the asynchronous setting it's a little bit trickier because we can choose to
allocate that pool of GPU GPU compute as much as we want for training or as much as we want from sampling and that leads to some design decisions.
The next is the training batch size which is some proxy of the workload that we have uh on the on the overall system and this is kind of an ML decision but in short what we have is a batch of
problems which is a subset of our data set. Let's say we have n math problems
set. Let's say we have n math problems that we want to train on and for each of these problems we're going to sample n problems in parallel. So if the problems are really difficult, we might sample
more to encourage some diversity in the samples to encourage the model to learn some potentially uh divergent strategies.
The next thing we need is some proxy of sampling throughput. And to get some
sampling throughput. And to get some intuition of what we should choose here as a modeling decision, let's look at how some modern inference engine surface requests. So in GPU memory, we have the
requests. So in GPU memory, we have the model weights, the activations, and some runtime state called the KV cache in memory. And given this train model,
memory. And given this train model, we're going to run the forward pass several times where each forward pass samples the next token and then we'll write to the KV cache. And so what this
model shows is that a principal estimate that we should do is we should find some way to measure the latency per GPU of the forward pass. And this ends up being a pretty good choice in practice because
from the systems angle, the inference throughput that we choose is largely determined by the batch size that we perform sampling with. So what I've shown here in the red square is a batch
of tokens that are all forwarded at the same time. And this sampling forward
same time. And this sampling forward pass needs to be as large as possible to efficiently utilize the GPUs subject to the runtime constraint that we don't actually run out of memory uh in the KV
cache.
So what we can then do is we can fit a latency curve as a function of batch size and that latency curve will look something like this. You'll have some regime where it's memory bound and when it increases it becomes computebound and
there's some functional form below. And
to explain the details of why we chose this decision, what we have here is an equation that's based in the roofline model from systems. At lower batch sizes, which I've highlighted in yellow here, we don't have that much work to do
because there isn't that much compute to do on the processor and there's so many parameters you need to load in at the same time. And so, as a result, when you
same time. And so, as a result, when you add incremental work, it doesn't really add that much latency to the overall system since the processor is so fast at doing math that we're just waiting on
memory to stream parameters in from the pro from memory to the processor. But as
the batch sizes begin to get larger, we then get bottlenecked by the processor.
And the more we add to our batch, the slower the forward pass takes. And just
for good measure, we have this sigmoid here that just sort of modulates the smooth transition at this hinge point here to show that there's this subtle transition from a memorybound computation to one that's more
computebound and bottlenecked by the processor.
The final cast member is some proxy of training throughput. And we chose to
training throughput. And we chose to measure this on a per GPU basis. So in
this case the model takes in the training batch size. So the parameter we saw earlier and we typically do this by fitting a proxy of our empirical workloads. The units here is how many
workloads. The units here is how many each train how many tokens per second each training GPU processes. So it needs to do the forward the backward and some optimizer steps.
So given these forecast members we can then begin modeling the system. And the
first idea we had although rhythm you know suggested that this might not be a great idea we can think about how to use a synchronous setup. And this might be a good idea from first principles because we definitely meet the staleness
constraint because we don't train on stale data and we always use the entire GPU fleet for either training or sampling making sure that we're using efficient use of the hardware. Let's
think about how to actually model this.
There are two things we need to know. We
need to know the batch size at which generation runs. And we also need to
generation runs. And we also need to know the response length distribution to figure out how our training workload's going to work and also how long the sampling's going to take. And so what I'm showing here in this simulation is a
couple of engines. Each square is a request being processed and they get darker and darker as we make progress throughout the batch. And as they finish samples, they write to the queue. And on
the right hand side is a time series metric, maybe something that you'd see in Graphana if you're monitoring production metrics. And what you can see
production metrics. And what you can see is that the batch size begins very high, but it slowly goes down over time as it eventually goes to zero and all the samples complete. And we can finally run
samples complete. And we can finally run an optimization step. After the step completes, we run this in a loop and we move on to the next step. And so as a result, we can have the following
sampling procedure. We do maximum tokens
sampling procedure. We do maximum tokens inference forward passes where maximum tokens is the total number of forward passes we do for the longest request. We
use the fitted latency estimator to figure out how long that forward pass will take. And then the response length
will take. And then the response length distribution will tell us how many responses to drop. And so what we're showing in this video here is this entire thing of the response length distribution that we feed into the
latency estimator. At training time, we
latency estimator. At training time, we can compute the total number of tokens that we just sampled in the batch and divide by the total uh training throughput uh which is just the number of GPUs multiplied by the per GPU
training throughput. And so what we have
training throughput. And so what we have here is a simulation of what this latency curve looks like. So we have the CDF of the response length distribution that tells us how many responses we should drop on the left and the latency
curve on the right. And this roughly kind of tracks because as we add more GPUs, we'd expect the latency per step to go down.
The next idea, given that the synchronous setup might not be the most principled choice, as Rhythm showed, is an asynchronous setup. But it's not just as easy as just sort of provisioning the compute between training and inference
because if we don't do this carefully, we might actually run into the idle GPU problem again. And to show this, let's
problem again. And to show this, let's illustrate two extremes of what this allocation problem looks like. Let's f
let's first look at one end of the spectrum where we provision way too many inference GPUs and not that many samplers. In this case, we're consuming
samplers. In this case, we're consuming from a queue much faster than we're actually producing from it because the sampling workers are producing work significantly faster than significantly
slower than we can actually consume them. When the red square grays out, it
them. When the red square grays out, it shows that they're idle. And what this diagram should hopefully illustrate is that for a lot of the time we're actually not using that and that has the same problem of low GPU utilization in
the synchronous case as shown earlier.
On the other end of the extreme we can provision way too many sampling GPUs in which case our production rate is way faster than the rate that we actually consume them in. So here we've doubled
the number of overall sampling GPUs and have the number of training GPUs. As you
can see, they produce samples at much more rapid of a rate. But this index here in each yellow square, which is the staleness count of each sample, goes up.
As time moves on, we get more and more stale. And so the samples get more and
stale. And so the samples get more and more kind of uh less more and more transparent as a result. And we learn less from each individual sample. So
let's think about how we can actually model this workload then to to determine an optimal async workload. In this case, the picture looks a little bit different because in steady state, the batch size is relatively consistent compared to the
synchronous setup where it kind of goes down over time. So on the right hand side here, we have the same time series metrics. But in this case, it's a little
metrics. But in this case, it's a little bit different because the yellow squares are always full because every time we complete a sample, a new sample goes in and we can continue writing to the queue. And so that batch size with a
queue. And so that batch size with a little bit of wiggles just for good measure is like a is pretty consistent over the course of a run. Now, obviously
the caveat here is that this batch size will certainly go down as we, you know, as response lengths go up because we run out of cache uh KV cache, but that's kind of a separate story and actually our model accommodates for that because
we're actually accommodating for a response length distribution.
We can then begin to figure out the optimal layout and there's two kind of constraints that we have to satisfy now that we know that the generation batch size is roughly consistent throughout the course of a run. The first invariant that we need to have is that the
production consumption rate are roughly equal. So on the left hand side of this
equal. So on the left hand side of this equality we have the training throughput which is the number of training GPUs multiplied by the per GPU uh throughput and then also we have the number of sampling GPUs multiplied by the sampling
throughput which is just the batch size multiplied by the latency to actually do a forward pass on that batch size. And
the next thing is that given that rhythm you indicated that if we have too much stailness that can be bad from an ML ML perspective, we want to make sure that our max theoretical staleness or
simulated steness doesn't exceed what our ML can handle. And so here we have the max stillness on the left which is equal to on the top how much time the longest request took in the batch which
is just the maximum number of tokens multiplied by the number of by the amount of time each token forward pass takes. And on the bottom here we have
takes. And on the bottom here we have the length of a training step which is the training batch size multiplied by the mean sequence uh by the mean sequence length.
So the simulation here then will sweep through multiple different values of the number of training GPUs. And since we have a fixed pool of compute that then implies a certain number of GPUs used for sampling. And for this number of
for sampling. And for this number of sampling GPUs, we can compute the minimum steadystate generation batch size to make sure that we don't blow out of memory uh subject to our KV cache
memory constraints and also such that we have maximum throughput on the on the sampling side. And the final thing is we
sampling side. And the final thing is we want to prune out all simulations where the sampling throughput brings us over the maximum possible stailness. When we
look at that simulation, we can run an end to end similarly parameterized by the response length. We see that this kind of roughly simulates a 60% speed up relative to our synchronous baseline, assuming that the GPU compute is
optimally allocated between training and sampling.
As a result, when we sweep layouts within these constraints, this allows us to limit staleness, but also make sure that we have our runs running at maximal throughput without actually doing the run itself. And so this gives us insight
run itself. And so this gives us insight to simulate different workloads before actually running them on the GPU because these runs can actually be fairly expensive. And so this allows us to ask
expensive. And so this allows us to ask answer scientific questions from first principles like what is the optimal configuration that we we should have of our GPU compute if we made response lengths very long because often times
when models learn via reinforcement learning they begin to think for much longer and also what empirical throughputs we should target during our performance optimization. So this has
performance optimization. So this has been a really useful piece of technology for simulation has informed a lot of the systems and research design decisions that we make. Cool. Thanks for your time and find us afterwards to jam on some
more RL research engineering together later. Thank you.
later. Thank you.
Our next presenter is here to speak about RL environments at scale. Please
join me in welcoming to the stage research lead for Prime Intellect, Will Brown.
Hi everyone. Uh, great to be here. Uh,
today we're talking about RL environments and how to scale them. But
the title is a little bit of a red herring. We'll talk a bit about the
herring. We'll talk a bit about the engineering pieces and like running these with thousands of parallel rollouts and sandboxes on hundreds of GPUs. But I'm mostly going to focus on a
GPUs. But I'm mostly going to focus on a different notion of scale. Uh and but what I mean by scaling here is we there's a number of different ways we talk about scaling in the context of AI
and research. We know about scaling laws
and research. We know about scaling laws and we talk about how much data you need compute and parameters and that if you pour in more data and compute and parameters or inference time all of
these things make models smarter or more performant. But there's also fuzzier
performant. But there's also fuzzier side of scaling which is sometimes referred to as unhobbling or algorithmic tricks or talent. But where does this come from? It's not just pouring in
come from? It's not just pouring in resources, but it's something that is more intangible, harder to put a finger on, but really it comes from a community of people, a company, an organization,
universities, the world, the internet, talking about ideas and sharing them and working on different applications, having these applications inspire ideas, using these ideas as test beds for
different techniques, and building on top of these to increase the accessibility for other people in the future to not have to reinvent the wheel and to be able to build from uh what has
been done by those before them to uh do more effective research and accelerate the pace of innovation.
And so why do we have this talent bottleneck? There's a big issue that we
bottleneck? There's a big issue that we hear all about with AI labs trying to like find more talent and salaries are going through the roof and everyone wants to hire the best and brightest AI researchers. But one other approach
researchers. But one other approach besides trying to just pay the most is increase the pool. Uh and so how do we increase the pool of AI researchers? How
do we make doing AI research more accessible? And I want to talk a bit
accessible? And I want to talk a bit about who we are at Primele. If you
haven't heard of us, we are a bunch of things. We're a research lab. We are a
things. We're a research lab. We are a comput provider. We're a platform
comput provider. We're a platform company. And we are an open source
company. And we are an open source ecosystem. We do a lot of things and
ecosystem. We do a lot of things and they all fit together in a way that I'm going to try to explain in this talk.
But we see these as all different pieces of how we can build a business around doing exactly this, which is increasing the accessibility of AI research and making doing research more of a toolkit
available to people at organizations around the world without needing to be inside of a large lab or without needing to spend crazy amounts on massive clusters or go do a PhD. We think that there's versions of doing AI research
that really should be part of the breadandbut workflows of AI engineers around the world as we build applications and try to improve our systems and models and products.
And I think a thing people are kind of iffy about in terms of AI is whether open source models are going to work.
And in my mind, that's not quite the right analogy to draw. And so when we're comparing like AI to traditional software, there's lots of like great examples of open source software ecosystems that have been thriving in
the past, things like Linux and Node and Apache. But in my mind, the analogy in
Apache. But in my mind, the analogy in AI is not models as kind of these fixed checkpoints, but it's about research as a practice and research as a set of ideas. And it's one that's more
ideas. And it's one that's more intangible, but there's a lot of parallels in terms of the goals of the best practices of growing a research ecosystem as well as a software ecosystem where you want to uh compound
abstractions and best practices and have better tooling and iteration efficiency and have these gains over time allow uh more advanced powerful complex things to
be built by uh decreasing barriers to entry for any given application and allowing this to become more accessible.
And so one thing that we a term we'll use to describe some of what we're building at Prime Elect is we like this phrase called the open super intelligence stack. One because it's a
intelligence stack. One because it's a fun acronym but also I think the idea of the stack of of all the pieces of the puzzle to build the engine to go do research. Uh there's a lot of layers to
research. Uh there's a lot of layers to it. You need compute uh you need
it. You need compute uh you need orchestration you need libraries for doing uh training and evaluation and you need platforms to support things like code execution and eval inference and
fine-tuning and we're doing all these things. Uh but really the goal of this
things. Uh but really the goal of this is to give people the tools to be able to go train models. We want people more people in the world and we think I'll explain why in a bit. There's a lot of
reasons why uh the best products are going to be the ones that are not just kind of taking the thing out of a box of an API and putting a thin wrapper around it. There's ways you can kind of improve
it. There's ways you can kind of improve around APIs. But I think in many cases
around APIs. But I think in many cases people are realizing that winning products are going to be the kinds of things that whether it's a part of the model, a part of the stack, the part of the product or the whole thing, the
ability to do research and have at least the option of deciding where in your product you might want to customize a model or improve a model gives you a lot more flexibility to really u make the
best user experience. Um,
and so we have heard the phrase in the past that the model is the product. And
I think we're starting to see now this change a little bit to a lot of winning applications have the product kind of be the model. And I think the two notable
the model. And I think the two notable examples of this that I'm big fans of and heavy users of are cursor's new composer model as well as uh open's codec. And I think these are both both
codec. And I think these are both both good examples of models that really are where the product kind of is the model very directly where the the model was trained to be the model for that product
and the experience of using the model is the experience of using the product. And
the way that this is done is by taking a harness that represents the product and training the model in the harness in essentially an environment, an RL environment. And environments really are
environment. And environments really are just a harness with a collection of tasks and rewards. But they also have many other parallels throughout the ecosystem. Environments are not just for
ecosystem. Environments are not just for RL. Environments are also essentially
RL. Environments are also essentially the same thing as evals. Environments
can also be engines for synthetic data which then you can use for SFT or distillation. You can do RL in them
distillation. You can do RL in them directly. But also the agents were
directly. But also the agents were actually deploying and monitoring out in the world. These are environments. The
the world. These are environments. The
product of these things, the tasks, the harness and the rewards, whether this is a data set offline or the stream of user tasks coming in to a product is an environment. And so this as an
environment. And so this as an abstraction I think is a very useful way of framing what it might look like to start having uh research become more of a a practice that is adopted more
broadly beyond just large AI labs. And I
also think that there's a sense in which they're a really accessible entry point.
Uh and so I like the analogy of environments as kind of like the web apps of AI research. And what I mean by this is that they're very simple.
They're self-contained. They can they start simple but they can also get quite complex. They can get very elaborate
complex. They can get very elaborate representing the full complexity of a large product. They're also pedagogical
large product. They're also pedagogical in nature and that you can start simple and as you build complexity, you start bumping into these walls where you have to start learning new concepts, understanding more about scaling the
system side, understanding more about the hyperparameters and the algorithms and they kind of open this door where you can by playing around with them start entering into a world of research
without needing to kind of build a whole training infrastructure system from scratch. Um, and they also require
scratch. Um, and they also require experimentation. And so I think the key
experimentation. And so I think the key different uh differentiation between just an agent harness and an agent environment is that the environment forces you to also have your tasks and your rewards predefined to be able to do
this experimentation. It's a proper
this experimentation. It's a proper eval. And what this means is that you
eval. And what this means is that you can't just vibe check it. You can't just like build it and test it out a bit and say, "Hey, it's good. We're going to ship it." It forces you to say, "Okay,
ship it." It forces you to say, "Okay, let's think about this a little more scientifically. Let's do some
scientifically. Let's do some experiments. lets try out different
experiments. lets try out different models, try different hyperparameters.
Uh, and it also gets you to the point where you can start doing more advanced research in terms of RL training or distillation or fine-tuning. And uh, so to really facilitate this, we wanted to
make the environment as an entry point much more accessible. A few months back, we launched what we called the environments hub, which is a open source community platform for creating, discovering, and sharing RL environments
and evals. And so far, we've had a lot
and evals. And so far, we've had a lot of fun kind of seeing everyone build here. We've had hundreds of builders and
here. We've had hundreds of builders and environments come create either their own ideas or re-implement papers. Uh
there's a bunch of examples here I can show you, but really it's just a bunch of people who have wanted to do research and found this as an entry point to start digging a little deeper. Whether
this was investigating some benchmark and figuring out how to reimplement it or modify it to be appropriate for an RL context in terms of like new data or new examples or whether this is some game that they'd been thinking about or some
other task. But having this as an
other task. But having this as an abstraction for encapsulating the the thing you want a model to do is a way of allowing yourself to start experimenting with ways of improving it without
needing to have the answers. So I think people talk a lot about how fine-tuning never really took off in the SFT regime.
And I think a big part of this is that getting data is really hard of the actual like solutions. I think having labeled examples of what you want the model to do is a very difficult thing to ask someone to go create. But if you can
just think about the the settings it might be in without having the answers up front, if you can measure the answers now, you kind of can start creating data on the fly. And this engine is really
what the environment is about unlocking.
Um, and so actually nine months ago I was right here in this room and it just released a library called verifiers which I'm still working on today. Um,
it's come a long way but it's a toolkit for building these things. And it's been a lot of fun over this past year just playing with it and extending it to support more features and kinds of environments. But the idea with
environments. But the idea with verifiers is to give people a toolkit that is uh essentially a bunch of components that you can mix and match and compose to do things like from simple evals or QA or games to things
like tool use or using sandboxes or agent frameworks or uh uh like CLI coding agents or math problems. There's all sorts of things you might want models to do or agents to do. And it's a toolkit for building environments that
is then uh ready to be automatically trained with reinforcement learning. And
the way we thought about this design, it's been a lot of fun and also a big challenge to think like, okay, how do you make a toolkit for this stuff that actually covers all the bases? And I
think there's a lot of different approaches I've seen people go about.
And I I think they all make sense depending on what sorts of things you're wanting to work on. But we took a very kind of general approach where we tried to say we are not going to know all the answers right away. There are going to
be lots of pattern. There's going to be lots of special cases. There's going to be hierarchies of complexity. there's
going to be patterns and we really want to prioritize extensibility. So we think about these things hierarchically where let's say you want to do a a a coding agent environment for clinb uh this
which is an instance of the harbor framework which is a example of a CLI agent which is a multi-turn environment which is an environment uh similar for text arrina and whle or for search with
MCP or for giving a model a Python ripple in a sandbox and so thinking of these things hierarchically allows us to kind of really determine like what are the foundational pieces what is generic across all environments and then how do
you build up the stack towards applications.
And so for one example of this that I'll kind of walk through the whole process end to end, we call this one wiki search, but it's basically a simple search setting where we give an agent the ability to uh call some tools to
search over Wikipedia pages and find some answers. And so here is the
some answers. And so here is the environments hub page. So the
environments hub is a kind of full stack uh code management package registry. So
every environment is a Python project where you can have dependencies and versions and uploading your evals and whatnot. Um, but the environments are
whatnot. Um, but the environments are very simple. They start simple and they
very simple. They start simple and they can get really complicated, but this one's pretty simple where we just kind of define our tools as async Python functions. We have our data set and we
functions. We have our data set and we have what we call a rubric. And so a rubric is the abstraction for managing the different pieces of your rewards where you can kind of compose different things. You can also have metrics that
things. You can also have metrics that are just a zero award but are for in uh observability of what's going on. And
then the other piece of doing training will be a config. And so the config here is for our prime RL trainer which is our kind of large scale training stack which has been our uh culmination of all the best practices from the research
literature for large scale asynchronous RL training. Um but the config files are
RL training. Um but the config files are intended to expose kind of the pieces that people need to think about in ways that are starting to get you more into the algorithm but are also still designed to be pretty high level pretty
self-contained and with with defaults that we think are going to be sensible for a lot of people. And so running this is just kind of running a command line with uh you specify the environment and if it's in the environment hub it'll
automatically install it and start your training run and then you can if you're lucky see your reward curve just shoot right up. Um and sometimes it doesn't go
right up. Um and sometimes it doesn't go this nicely but the process of doing this is iterating on your environment on your rewards and your data and your tasks to understand what makes this task
holistically actually tangible in practice. How do you tune the
practice. How do you tune the parameters? How do you look at your
parameters? How do you look at your data? How do you define your rewards?
data? How do you define your rewards?
Uh, and if you do this right, you can get really good improvements, especially from really small models, but also for much larger models. And so in this example for the the wiki search one, we started with a a Quen 3 4B model, which
was about 55%. And after training, it was at 89% on par with uh much larger models like GPT4.1 as well as reasoning models like uh GBD5 mini. And so I think
this practice of taking small models and being able to make them much better is a big win for a lot of applications where you either you want a really fast model, you want a really cheap model, you want a really really powerful model because the best models out there just aren't
quite good enough. These are all the different things you can do with model customization. And this practice of
customization. And this practice of doing of creating environments isn't only for customization, but it gives you this option. And so if you need to do
this option. And so if you need to do eval anyways, it's useful to think of them as environments because the environment opens a lot of doors for whether this is prompt tuning or whether
it's model selection or whether it's just getting a better sense of how your system could work at scale with many many users in parallel. It's a design process that really forces you to kind of pin down what is the thing I care
about? What is my agent? What is my
about? What is my agent? What is my product? What is my harness? What am I
product? What is my harness? What am I optimizing for? Um, and so to kind of
optimizing for? Um, and so to kind of fully stress test this, we've been training a large model which will be out into the world quite soon called Intellect 3 with our full primaril stack. And this has been us really kind
stack. And this has been us really kind of validating the efficiency and performance at a very large scale. So
this is a 100 B plus model trained on 500 GPUs where we've kind of done the endtoend uh post train of SFT and RL which the primaril stack also has SFT if people want to do that. But it's also
been about just understanding all the best practices. We love reading papers
best practices. We love reading papers and we try to kind of try out all the tricks and see which ones work and see which ones don't and then distill this into a library with primaril that can then be kind of consumed by the end user
without needing to do all this uh implementation themselves. And so for us
implementation themselves. And so for us it being open is very important. So
Primaril is on GitHub. You can go find it. Verifiers is on GitHub if you want
it. Verifiers is on GitHub if you want to check it out. And for us, this is really about opening the door for more people to start learning about these things and for incorporating it into their workflows for optimizing their
models and their products. Um, and the only way to do this that we've what we see as the best way to do this is through growing community. And so for us, it's been really important to really think about getting good feedback loops
from the people who are building with this and understanding what they want, understanding what's going well, understanding what's painful, and addressing those problems. And so we've done a number of community programs in
terms of sponsoring different kind of small tasks to uh a research residency program with uh grad students around the world uh and collecting like uh a smaller subset of the environment hub ones where we'll actually review them
manually. And so this repo here, the
manually. And so this repo here, the prime environments repo is the ones where we are doing these directly where we're kind of offering to look over someone's kind of example mutation. And
so we've had hundreds of these come in and there will be hundreds more. And uh
it's been a great learning process because it's forced us to fix a lot of things. We kind of understand the rough
things. We kind of understand the rough edges. We understand what we need to
edges. We understand what we need to add. And we're kind of then distilling
add. And we're kind of then distilling all of these learnings into what will be our kind of upcoming uh platform product which we're calling lab. And the idea of lab is to give people an interface, a
platform where they can browse environments, they can run their evals, they can do their inference, they can do their fine-tuning and they can have research be more accessible in a way that it hasn't been historically because
I think a lot of people find infrastructure very painful. They find
dealing with torch versions painful, flash attention and VLM and getting all these things to work. We are happy to do that, but we understand that a lot of people may not want to. Um, and so the
idea with this is that if you want to go read the code, you can go read the code, but you don't have to run it. We can run it for you. Um, and so this has been our version, which will be kind of out into
the world in the near future of trying to allow people to really focus on the environment where the entry point to lab will be the environment. If you want to do synthetic data and SFT build, let's build an environment. If you want to do
your evals, you build that as an environment. If you want to do RL, you
environment. If you want to do RL, you build an environment. And I think building an environment is the kind of thing that I imagine a lot more people are going to
want to be doing as we start really seeing where models are headed. In some
cases, this will be we're going to use fine-tuning services from the labs because they're going to offer this because people want it. In some cases, this will be we really care about the smallest model we can run on prem at the lowest latency and we're really just
going to optimize for our one thing. or
it could just be research for the sake of research and advancing our kind of collective understanding of how this stuff all works. And I think that's really our goal is to have a world where there's going to be a lot of AI and
where we can all kind of talk about it and understand it and look at it and poke at it and tweak it and have a better sense of what we're actually building because I think there's a lot of times when it feels like we're just
kind of the model is a black box and digging into the research and going under the hood and changing things and breaking things tells you a lot about how these models work. tells you a lot about understanding where they came
from, where they could be going, where they might be headed, and preparing for that future. Thanks.
that future. Thanks.
Our next speakers are here to present a deep dive into OpenAI's approach to reinforcement fine-tuning for code models. Please join me in welcoming to
models. Please join me in welcoming to the stage members of technical staff at OpenAI, Will Hang and Kathy Zhao.
>> Hey everyone, I'm Will >> and I'm Kathy and we're on the finetuning team at OpenAI >> and we're super excited to talk to you today about agent RF, the most powerful way to enhance the performance of your
agents. So, you're probably joining us
agents. So, you're probably joining us today because you're building an agent for your business and you'd like to improve its performance. So, let's first start by talking about what an agent actually is. What makes an agent
actually is. What makes an agent different from a regular model is its ability to interact with the outside world to complete a task to get things done on its own without having to go through you all the time. So, this agent
needs to have access to tools. For
example, if you're building a coding agent, it's got to have access to a terminal, a code interpreter, or maybe even an entire codebase.
But these agents aren't just blindly calling tools. They're reasoning at the
calling tools. They're reasoning at the same time. The way that we think about
same time. The way that we think about these agents is that their interactions with the outside world such as tool calls are interled with their reasoning traces in the same context window. So an
example of an agent that we've built inhouse using this paradigm is codeex.
Codeex is our flagship coding agent. has
access to a wide range of tools to complete coding tasks end to end like writing unit tests or submitting large diffs to your codebase that are hopefully correct. Um some tools are
hopefully correct. Um some tools are exposed as terminal commands and other tools are custom functions a model can call to invoke say a planning workflow.
So now how do we make our agents better?
We're all probably pretty familiar with the frontline techniques to improve the performance of agents. For example, for starters, prompt engineering or prompt optimization. Prompting you can steer
optimization. Prompting you can steer model or agent behavior to align more with your preferences. But let's say you still want to squeeze more juice out of your task. Well, you can then turn to
your task. Well, you can then turn to task optimization. You can simplify the
task optimization. You can simplify the task. You can add better guardrails
task. You can add better guardrails around the task. You can add and subtract tools. Or you can change tool
subtract tools. Or you can change tool behavior to work better for the agent.
But let's say you still want to squeeze even more juice out of that task. you've
tried all these approaches and you still want better performance. So that's where you would turn to fine-tuning.
Fine-tuning is a way to train the a agent end to end on your task to achieve even better performance by changing the weights of the model. And agent
reinforcement fine-tuning or agent RF is the way to do this or it's the way that we would like you all to do this. Um,
agent RFT changes the weights of the model according to a learning signal that you specify to teach the model what good behavior and what bad behavior looks like. And during training, the
looks like. And during training, the agent will explore many different ways of calling your tools to solve your task. So, we've introduced several major
task. So, we've introduced several major new additions to the RFT product. Um,
first off, the model can now call your tools via your endpoints that are hosted in the public internet. Um, and after each roll out, we'll also invoke your custom reward signal that's hosted via
an endpoint. So, these two additions
an endpoint. So, these two additions actually mark the first time that we have we at OpenAI have allowed models to interact with the outside world during the training process. So, I think this
is pretty cool. To summarize the benefits of agent RFT, it helps you improve the performance of your reasoning models, but more specifically the reasoning models that have to call
tools and interact with the outside world to get things done in a multi-step fashion. H&RFT is also quite sample
fashion. H&RFT is also quite sample efficient. We've seen people get success
efficient. We've seen people get success from literally only using like 10 examples, which is pretty amazing. We'll
go over specific examples of this when we deep dive into some of our customer spotlights. and it results in a model
spotlights. and it results in a model that has lower latency and just works better for your tasks.
So now let's dive a little bit deeper into how all this works. One of the challenges with making agents work with your specific business context is that your environment, your world might just
be different from how we train our models in house. So this phenomenon in ML is called domain shift. And it can result in an agent that doesn't quite call your tools that that well. might
call a tool too many times or might just straight up shove wrong inputs into your tools. Agent RFT can readapt the model
tools. Agent RFT can readapt the model to your domain through this weight changing training process that results in an agent that actually understands your environment. And this has some
your environment. And this has some really nice properties obviously better ML performance. It trains the model to
ML performance. It trains the model to use tools better and it trains the model to reason over the outputs of those tools better. All this is learned
tools better. All this is learned organically by the model while it explores the search space, all the possible ways of interacting with your environment and hill climbing on your reward. Another really nice property
reward. Another really nice property that results from this is the ability to achieve much lower latencies by making sure that the model stays within a given tool called budget and doesn't go over that limit. So we can actually impose
that limit. So we can actually impose this penalty that you know penalizes the model for going over that budget. What
actually happens is the model learns to stay within that budget while preserving or exceeding the original ML performance.
So to dive a little bit deeper into what happens at a systems level for each agent roll out will produce this unique identifier that specifies that that that particular rollout and we will associate
all the tool calls that we make into your system with that UYU ID. And so we do this for every tool call so that you can keep track of a trajectory as it evolves. so that when we emit that final
evolves. so that when we emit that final answer at the very end, you can then associate that final answer with all the context that you've maintained so far and you can just pass this whole thing
as a holistic grading context into your grader. Now, we don't recommend everyone
grader. Now, we don't recommend everyone or anyone just use agent RFT right off the bat. Uh there's a process that we'd
the bat. Uh there's a process that we'd like you all to follow. You first want to make sure that your training data set and your eval data set closely match your production traffic. You do not want any drift whatsoever. Then you want to
ground yourself in a baseline. You want
to run your base model against these data sets so that you kind of understand what to expect performance-wise so that you can then hill climb from there. And
then you want to optimize performance using some of the techniques that we talked about prior like prompt or task optimization. And only then when you
optimization. And only then when you still feel like you squeezed all the juice out of the task, but you still want more more juice, you would turn to agent RFT to push the frontier for your
task. So now I'm gonna turn it over to
task. So now I'm gonna turn it over to Kathy to talk about how some of our partners have really pushed that frontier.
>> Yeah. So now that we learned how agent RFT works and how when you should use it, I'll show you some coding related examples of how our customers were able to use agent RFT to make their agents
better and also highlight some key takeaways that you can apply when optimizing your own agents. So a few months ago we partnered with cognition who use agent rft on their code edit
planning phase. This is the part where
planning phase. This is the part where Devon inspects a repo and runs runs shell tools like rep and file reads to decide which exact files to edit. To
train this behavior they build a data set of user queries paired with actual files that users has modified and they use the F1 score of the selected files
as the reward. This F1 score is really great because it balances between the pre precision and the recall. So this
ensures that the agent doesn't return too many inaccurate files or misses the critical ones. They also build extremely
critical ones. They also build extremely robust infrastructure to support this training. So in this case for each
training. So in this case for each individual trajectory they spun up a VM to manage the codebase to execute the tool calls and grade the final answer.
These VMs make sure that the environment is isolated so that the shell tools will not affect each other in different rollouts.
We saw two important takeaways from Cognition's use case. First, data
quality and the volume really matters.
So, at first they fine-tuned on a data set of around 100 examples and were able to get a fivepoint improvement, but when they scaled to a thousand examples, the improvement jumped to 10 points. So the
number of high quality examples you provide can very directly translate to a better agent behavior. Second, we also learned that RFT is really good for
learning to call tools in parallel. So
in this case, the model would initially take eight to 10 steps alternating between generating tokens in its reasoning to actually calling the tools.
After RFT, the agent launches many tool calls in parallel. at the very first step. So this was able to reduce that
step. So this was able to reduce that number down to four. And in this use case, the speed up was especially important because they wanted Devon to
start producing edits quickly.
And now I want to highlight a different use case. Kodo is building a code review
use case. Kodo is building a code review agent and a key piece of that is a deep research agent that answers developer questions on large code bases. To
improve this deep research agent, they train GBD5 to answer coding questions by calling tools like search and retrieve over the repository. They assembled
around a thousand authentic question answer pairs from eight different uh repositories and rewarded the model using the recall of how many relevant
facts the agent were able to retrieve.
With RFT, the agent improved by six% and it was using fewer tool calls and output tokens. And what we found most
tokens. And what we found most interesting is this graph where it shows how RFT shifted the distribution of the number of tool calls. So with BGBD5, the
agent will occasionally fall into these bad runs where there were more than 15 tool calls in a single sample. This is
very slow and also can lead to some inconsistent behaviors. So after RFT
inconsistent behaviors. So after RFT these tool calls that are very long tail um disappeared and the the distribution center to just around two to four tool
calls. In this setup RFT didn't just
calls. In this setup RFT didn't just improve uh accuracy. It also stabilized the agents behavior in eliminating these
P95 longtail cases. And this is very important for production use cases where your latency will matter.
Next, I want to share how cosign build coding agents for large and complex enterprise code uh enterprise co code bases with agent RFT. To make this work,
they train the agent on a very comprehensive set of 30 tools such as fry, keyword search, session terminal, browser sessions, etc. And they also
built a very strict radar. So they
observed that the model um originally when they were providing the model with partial credits and uh points for just trying out things um it didn't get really good results because the model
was start to optimize things on coding style and tone. Um so at first they want to really make sure the agent ships working code and so based on that they
give the model the reward only when the final code passes the test. And because
the greater is very strict, it can sometimes give sparse rewards. In that
case, um, GBD5 is also like is actually very great because it can give us some samples that work. So, um, cosign also boosted the batch size and they increase
the amount of compute so that there is even more samples that can give us positive rewards. So, it's not like
positive rewards. So, it's not like every single sample in the batch will give us zero reward once the code is correct. Um, they also have a custom LLM
correct. Um, they also have a custom LLM that would judge by the score and tone.
So, it will panalyze verbosity, emojis or anything that feels unprofessional.
Finally, the graater will reward the agents that validate their own work. So,
this means running tests, inspecting terminal outputs, and also checking linting before calling out a success.
And after training with this very thoughtful set of tools and graders, Cosine was able to reach the state-of-the-art on a lot of different
benchmarks over here. And they also got a much much faster agent. So like in earlier examples, RFT shifted this distribution of tool calls and the agent
stopped taking these extremely long trajectories. In this case, there was
trajectories. In this case, there was sometimes more than a 100 messages in a single trajectory and it converged to a much tighter and more efficient sequence
of steps.
Lastly, Macco is a very interesting use case. They're building agents that write
case. They're building agents that write high highly performant GPU kernels which is traditionally very hard for LMS because in normal use cases there's a
lot more examples but in this case there's not a lot of example for kernels especially if you're using new hardware platforms like Nvidia B200s with Asian
RFT macro trained GBD5 to write fast kernels using only about 100 PyTorch prompts and this was a major unlock. So
we don't actually need that many samples and kernel data set in order to train a good model that produces kernels and we just have to specify a good reward
function. In this case specifying a good
function. In this case specifying a good reward function is also very hard. Early
in training they observed that the model was reward hacking. So what they did was that they inspected the rollouts and they found seven different cases where
the model was hacking and this include things like just uh returning the reference code or returning NOP kernels or identity kernels and they built a
judge LM to catch all of these seven cases and reward them with a zero. They
also added a static analysis tool with a abstract syntax tree to verify that the generated kernels actually exist and they're actually being launched. So
after the they made sure that there was no reward hacking, they also scored on correctness and real speed up compared to the PyTorch baseline.
Once all of these protections were in place, the agent got significantly better than GPD5.
And uh ML also used a really smart technique here to improve the performance even more. They ran three different samples and they took the best one out of the three. This allowed them
to beat the state-of-the-art by 72%.
And yeah, I'll hand it back to Will.
>> Thanks a lot, Kathy. So, uh, now we want all of you, all of you in this room and beyond to be as successful as the partners that Kathy just mentioned with agent RFD. So, here are four key
agent RFD. So, here are four key principles to ensure your success. First
of all, you want to make sure that your task is well defined, well constrained.
There should be a clear unambiguous definition of success. You should have removed all subjectivity out of your task. Taste should not be a requirement
task. Taste should not be a requirement to grade your task properly. Next, you
do not want the model to feel surprised in production. You want to make sure
in production. You want to make sure that your train and eval data sets mirror your production traffic. So, no
none of that domain shift that we talked about. You do not want to introduce that
about. You do not want to introduce that domain shift on your own. Um, next, and this is a really important part, you want to make sure that through exploration, the model actually achieves
better performance on a given data point if it samples more so that it can learn from itself. So what this means is if
from itself. So what this means is if you take the maximum performance on a given data set, that should improve as you sample more from the model. So
because of this, you should be able to see the these variances from a given data point. so the model can learn from
data point. so the model can learn from itself. Learn what the difference
itself. Learn what the difference between a good and a bad roll out is for a given data point. And uh lastly, you want to make sure that your reward function is not hackable. Hopefully,
you've plugged up all the corner cases, all the edge cases. Um but also hopefully you've framed your task so that the reward is more continuous than binary. The continuous reward actually
binary. The continuous reward actually allows the model to kind of inch up closer and closer to optimal performance. Sort of like giving giving
performance. Sort of like giving giving a student partial credit. um rather than you know slapping the model in the face or giving it a cookie uh if it gets stuff wrong or gets stuff right. So now
in order to get started with agent RFT, please contact your friendly neighborhood account director and we're really excited to see what you all build with us. Thank you so much.
with us. Thank you so much.
Our next speaker will talk about the future of front-end engineering in the age of software collaboration with AI agents.
Please join me in welcoming to the stage Kitsy.
That's my old profile photo. All right.
Um, this my new one. I've been three days in USA and I already got the full merch package on Twitter. So, if you go and follow me on Twitter, my timeline is going to be weird for the next week, but
then we're going back to normal European schedule. Don't worry. So, um, I visited
schedule. Don't worry. So, um, I visited some of your museums. I love it here.
These were some of my favorite things that I've done. I enjoy like exploring your culture like doing all the c cultural enrichment and yeah round of torture for myself who knows me from
Twitter.
All right, that's more than I thought.
Who is using Sizzy? It's usually like one person in the back. Usually the
janitor doesn't even listen to what I'm saying. Um, one of the things that I'm
saying. Um, one of the things that I'm I'm working I have ADHD, so I'm working on a billion things at once. This is one of the things. It's a browser specifically made for developers. not
made to replace your browsing browser for browsing, but it's just like a tool like Photoshop that's like helping you in a lot of ways to do front- end development. Another thing I'm working
development. Another thing I'm working on, the test flight is almost live. I'm
making a life OS which combines like all the things in your life from medication, habits, to-dos, planner, blah blah blah.
Um, this is like a full stack thing that I'm working on. It's currently on sale.
It's called zero to ship. And the last thing that I'm reviving, it's called Glink, which is like change logs, road map, a billion other things. So, without
overwhelming you more about my bio and stuff, I really hope I'll get invited next year because I love it here for reasons like um networking and meeting people and teaching. It's great that you're laughing, right? But let's just
discuss why are you here, right? You're
here for learning and you're here for like networking and later after this, you're definitely going to improve all of your skills later. All right. So,
what can you expect from my conference talks? If you haven't listened to any of
talks? If you haven't listened to any of my conference talks, usually this was made by AI, so it's completely wrong. So
it's like 50% tweets and 40% pain and 30% reason to remember the name. So in
2017 I did this talk with the longest name ever. It's called navigating the
name ever. It's called navigating the hybrid front end development world without going insane. And um I've been talking about like how to navigate the front end um world then. Now it's even
crazier but we need to recap like all the things that happened since 2017. I
don't um I don't see my speaker notes which is bad but we'll try to get by. So
in other industries like in the vision pro you have like cloth collision on top of real life objects and whatever crazy stuff is happening here. In in here we have like some slicing of a mesh texture going around the ball and blah blah blah
whatever these waterfalls and and like all of these mesh like you can take a rock and just smush it into another rock and it magically kind of like blends itself and it forms this structure and it's freaking crazy. Here we can drag our mouse and just create buildings and
streets and taxi cars spawn out of nowhere like we do generative whatever the hell this is and um honey goo thingy coming on a cube and like you know where this is going right I'm building up to
where it's going but because you have respect for your profession and you love your LinkedIn title whatever it is you're like the CEO architect of dreams blah blah blah you're going to try your best not to laugh but you're going to laugh at the next slide because this is
what happened in frontend development it's been almost 10 years and this is where we at there's a warning saying that maybe you'll be able to style a select in 2037.
This is still alive. It's a It's a freaking miracle. This is still alive.
freaking miracle. This is still alive.
It's thriving. Actually, 15 million downloads. I set up like a calendar
downloads. I set up like a calendar event to check if it's dead every year.
It's It hasn't been dead yet. So, I'm
going to keep checking CLI. Not only
that they're not dead, they're actually thriving. You can drop images. First
thriving. You can drop images. First
time I dropped an image in my terminal, I'm like, how the heck? Never mind.
Like, it's too I I I added another calendar event at this point. going to have more events
this point. going to have more events for this than anniversaries and birthdays and stuff. So, I I hope one day it's going to die as a concept.
We're struggling with the same old pains. Soon in maybe some browsers, you
pains. Soon in maybe some browsers, you won't need JavaScript to style a popover in a dialogue. Can I have like a round of applause for that? Stop clapping
because people have brain implants. All
right, it doesn't matter if you can style a dialogue. We cannot read get rid of Internet Explorer. We just updated the logo. It's still there. It's still
the logo. It's still there. It's still
painful. And uh yeah, we cannot agree on a way to increase a counter. This is a demo from Ryan Florence. This is Remix version two, the version three, the remix of the version four. Whatever
they're doing, it's a counter. How
complex is it to increase a counter?
It's incredible. And don't shoot the messenger here, but the number one library. It's still the same. It's
library. It's still the same. It's
annoying, but React is the best. And
blah blah blah. So, let's talk about LM.
LM are amazing at writing React. And
this is funny only to us humans, right?
To an LLM, this is like perfectly written code. There's like it's only a
written code. There's like it's only a human wish to abstract the [ __ ] out of this, right? So, when we see this, you
this, right? So, when we see this, you get this. If you want to get on stage
get this. If you want to get on stage right now, you're like, "Oh, let me just change that. I'll make it more optimal."
change that. I'll make it more optimal."
So, here are some scientific brain scans. This is our brain on cocaine.
scans. This is our brain on cocaine.
This is our brain on sugar. This is our brain. When you realize it, we can
brain. When you realize it, we can abstract something. You're like, "Oh,
abstract something. You're like, "Oh, let's go." It's useless to the user, but
let's go." It's useless to the user, but we freaking love it. Um, so coding with LLMs makes this kind of better and worse. Like, especially with composer
worse. Like, especially with composer one, for me, it's like way worse because you can get to the right abstraction quicker, but you can also get to the wrong abstraction quicker. And the best thing here is LLMs don't care about repetitive code. And I've been seeing
repetitive code. And I've been seeing this since 2017 that we care too much about repetitive code and we abstract too early. So I'm going to repeat this a
too early. So I'm going to repeat this a couple of times and I love that LLMs don't care about repetitive code. So
LLMs are also good at writing React because no one is actually good at writing React. You go to a React
writing React. You go to a React conference. Every conference that I went
conference. Every conference that I went to like you just listen to the first talk and you're like holy, it can do that. I was using it all wrong. So
that. I was using it all wrong. So
everyone is just inventing their own ways of doing React. So when we say like yeah but you cannot do the optimal use effect blah blah blah and the machines cannot write the proper can you write a proper use effect no you can't so we should stop blaming the machines. So
let's talk about this I think this is the very wrongest audience for my talk here because I've been giving this talk at conferences where people are like at least 50/50 hate vibe coding and love vip coding. So I'm going to I I hope it
vip coding. So I'm going to I I hope it will work here. So raise your hand if you think that vibe coding rocks.
Okay, that's way too many hands. You
should have seen them this in another city just two people and everyone else is grumpy. So raise your hand if you
is grumpy. So raise your hand if you think that VIP coding sucks. Please
couple of hands. Hell yeah. All right.
So I'm here to convince the rest of the group and hopefully people watching in a live stream. There's way more skeptical
live stream. There's way more skeptical people. You have no if you just landed
people. You have no if you just landed on Earth maybe and you don't know what vibe coding is. Okay. Zero people here.
So yeah. All right. All of you are right because we're kind of vibe uh vibing the definition of vibe coding is. And since
the word was mentioned, we kind of expanded it to mean like everything and and anything. So the term vibe coding
and anything. So the term vibe coding was coined by Andre Kapati. You probably
know this. He's the reason that idiots sleep in the back of their cars and film Tik Toks. So, he wrote this long essay
Tik Toks. So, he wrote this long essay on what is VIP coding, but the long story short, he's like, you don't care that much about the code. You press
accept and you just tell the LLM to do what it needs to do and blah blah blah.
Now, this is a slide from my talk in 2017 when I before LLMs or anything was mentioned when I said that if you see the pattern of where front end development is going, one day is going to be like everyone is working on things
that are so similar that one day you'll be able to be like, "Hey, just give me new styles for the header, move this three pixels to the right." And people were laughing. They're like, "No, it's
were laughing. They're like, "No, it's not going to get there." And literally, this is what we're doing with cursor and everything else, right? I'm too lazy to go into Tailwind and just move it by by three pixels. So, I'm a time traveler.
three pixels. So, I'm a time traveler.
Um, managers have been vibe coding forever, so this is nothing new. So,
they tell a developer to implement a new feature.
The developer makes changes to the code.
Uh, the manager then test the app. The
manager does not read the code. Well,
actually, I'm going to drink water here, and you can just read the rest of this slide.
This last one depends on whether you're like in the Balkcon area or or you're at a place which has HR. So they might insult you or not insult you. So this is what managers have been doing forever.
Basically there's so many jokes about VIP coding being bad. My favorite one is a comparison to a casino. So in casino you buy chips, here you buy tokens, you spin the slots, you press generate, you might hit the jackpot or nothing, you
get a functional full stack app or garbage. Flashing lights, seductive
garbage. Flashing lights, seductive animation. You're absolutely right.
animation. You're absolutely right.
Great idea. I've got my own strategy.
I'm a prompt engineer. All right. Sure.
One more spin, I'll win it all back. One
more prompt and the bug will disappear.
Cuz you know, it kind of hurts this comparison. It's very true. Cursor is
comparison. It's very true. Cursor is
always in profit. I hit the jackpot. I
built a SAS in one day. And where did the last four hours go? And just writing prompts for something you could have done manually in 15 minutes. So Andre
was trying to coin way too many terms. It didn't work after the first try. He
tried to coin this one about half coding, which is like kind of you're observing what the LLM does. And I am not half coding and I'm not vibe coding.
I love this term that somebody coined on Twitter and I'm going to start using that one. It's called vibe engineering.
that one. It's called vibe engineering.
When you're actually using agents to code all the time like you don't touch the code but you just look at your screen like hm I'm going to catch you.
You look like Dexter me. You're like ah something's fishy here. Why? So I I've vibe engineered over 15. I wouldn't even bother with half of these things if it wasn't for LLM and Aentic coding. So,
but I'm always suspicious of the code because it was based on our code and is based on our our knowledge. So, proof.
This is Gemini just going on a rant that it's I'm not worthy anymore. I'm not a good assistant. I should stop coding.
good assistant. I should stop coding.
Blah blah blah. That's super human. This
is Quen saying that it lied because it read on a forum that we doubled down when we're wrong and we're lying. So, we
kind of train them in a way to be like us and you're like, "Oh, the code that they do is bad." And if you like your production data, definitely you should.
This a real screenshot sadly.
like oopsy daisy there goes your production data. So I have uh doctor
production data. So I have uh doctor senior principal prompt engineer kit here for some vibe engineering tips. The
obvious advice probably I haven't listen to the rest of the talks cuz I just arrived. Uh these are very live laugh
arrived. Uh these are very live laugh love like obvious [ __ ] advice. Um
but it actually works. I've heard of the term git workspaces like literally two weeks ago. I had no idea what is this
weeks ago. I had no idea what is this but it's amazing. And you got to stay you got to be chronically on Twitter for all of this to work. So if you don't have a Twitter account it's it's not going to work. You got to have a solid starting point whether that means like good primitives or components,
functions, pattern abstractions. A lot
of people are lazy and they just don't bother with any of this. So you got to tag them and use the right prompts in order to get the right results. And if
you're starting a new project, I would definitely recommend zero to ship.
Please, I have a mortgage and I spent way too much money these last three days in the USA. So it would be nice. Using
voice to code is a game changer. Who is
using voice to code here?
>> Well, like one person raised their hand in London. Amazing. Um, so yeah, brain
in London. Amazing. Um, so yeah, brain dumping. How I do you how I do things is
dumping. How I do you how I do things is once the agent is done I immediately start my voice coding and first I go to the browser and I explain what I see in the UI as if I'm talking to a friend.
And I'm like, "So, you did this, you did that. All right, I'm test. I'm I'm not
that. All right, I'm test. I'm I'm not shutting up. I'm literally saying my
shutting up. I'm literally saying my thinking process out loud." Like, I see you've done this, you've done that, there's a bug. Then I jump in the code and I continue talking about what it implemented in the code. So, some of my promps like sometimes last up to 5
minutes and people are like, "Please fix this. Make me a million dollars. It
this. Make me a million dollars. It
doesn't work." So, this is um amazing.
And I would tell you which app I'm using, but I vip code in an app and I don't want to hurt my potential hypothetical sales. Um, use rules, docs,
hypothetical sales. Um, use rules, docs, commands, and memories. Like all of these terms are way too complex and there's way too many things to juggle, but it it cannot have your entire app
context for now and it's not a a mind reader. So without the right context,
reader. So without the right context, you will like fail most of the time. So
this is like a VIP engineering example.
This is some of my like screenshotted prompts of how I'm doing things. It's
like a bunch of technical jargon and it's never like fixed the app blah blah blah. And then on the VIP coding side,
blah. And then on the VIP coding side, people are like move this entire thing to TypeScript and make no mistakes. Then
you have another thing like this is another one. These are just random
another one. These are just random problems just to show you like how I I'm not talking only about the UI. I'm
talking about the UI and some patterns that need to be changed in the code. And
on the VIP coding side, it's like something like that and people expect results.
Then here we have again technical stuff like TRPC definition abstractions like things how you you're basically VIP architecting how you want the thing to work. And on the VIP coding side you
work. And on the VIP coding side you have make me a million dollar app and make no mistakes. Um when VIP coders read VIP engineering problems, they have no idea what's going on. And I'm
honestly amazed at people who don't know how to code but they've done a functional thing. Kudos to you. I've
functional thing. Kudos to you. I've
noticed this spectrum in the community.
There's like who loves VIP coding and who hates VIP coding. So you on one hand you have like juniors who are like hell yeah give me the thing. I love to do my own SAS. Then you have like super senior
own SAS. Then you have like super senior people who are doing like libraries and frameworks and crazy things. You can see all of them on Twitter v coding. And
then you have the majority in the middle. They're like this will never be
middle. They're like this will never be good enough. My code is perfect. It's
good enough. My code is perfect. It's
hilarious but it's a pattern. Do not
give AI tools to your interns and juniors. People think these are perfect.
juniors. People think these are perfect.
I'm going to hire junior, underpay them, and give them an LLM. The equivalent of that. Do to not ever do that. That's the
that. Do to not ever do that. That's the
dumbest idea. But if you take your skeptical senior and you convince them to do vibe engineering, you're going to get 10x results. The hard part is actually convincing them. So there's a time and a place for vibing and not caring. So you have like one-off scripts
caring. So you have like one-off scripts and simple features and code that won't be touched or seen again. This is a skill that if you cultivated this skill before LLMs were a thing, you're going to thrive here because you need to know
like which code is kind of good enough to be used. So personal tools and one-time tools like these are perfect for pipe coding. If your experience and a lot of people's experience is bad and they quit too soon, it might be one of
these reasons. Unlucky timing, you're
these reasons. Unlucky timing, you're overwhelmed with everything. You might
have cheapened out. You're a PA dev. I'm
going to explain in a second. Your
cousin who was into NFTs and drop shipping is now VIP coder and you don't want to be associated with them or it's a scale issue and we're going to dive deeper into that one in a second. So
unlucky timing is you hear everyone like hyping a model. It happened I think with cloud code when it came out and everyone started shifting to cloud code from cursor and suddenly you tried it one week later and we're like wait this is not smart enough. Is it me? And then
like people caught that they actually kind of pulled the rug a little bit and they dumbed down the model so they can just scale and like one week later they're like oops we updated we commented out the line that dumbs dumb down the model and you might have been
caught in that timing and this happened with like basically every provider not just cloud code. People are like uh instead of paying $200 I'm paying $3 and it's the same result. Uh my dog knows that it's not the same result. And
people like I I meet so many people which are still using chat GPT to generate code snippets and pat them back. That's not going to work. You
back. That's not going to work. You
might be overwhelmed by choice. This is
a slide for like four four months ago until now. We have like a billion more
until now. We have like a billion more here to choose and it's a bit crazy. Um,
if you ask me what's the best model, it's a different answer at 9:00 a.m.
It's a different answer now. I should
check Twitter. It's probably a different answer after this talk because it's crazy. And this has happened to four
crazy. And this has happened to four conferences so far where I'm done with my conference talk at the night. I'm
closing my laptop and they introduce a new model and I have to add new slides.
It's super annoying. Composer one for me changed everything and I absolutely love it. Who is like relying on composer one
it. Who is like relying on composer one for most things? All right, I would say not enough people because this is literally shifted the definition of vibe coding and vibe engineering for me. It
made me kind of realize that I missed coding because what I would do is I would let a model run like GPT5 codeex and it would take 37 years and my grandchildren will update me on on the result of the model and I would watch
YouTube shorts or whatever until it's done. Now with composer one, I'm back in
done. Now with composer one, I'm back in the driver's seat and I actually watch what the agent is doing and I can be like stop. No, no, no, no. We do the
like stop. No, no, no, no. We do the other thing. So it feels like coding and
other thing. So it feels like coding and it's like super instant. It's amazing.
Um, but it only works if you're a VIP engineer and you know what you're doing.
If you're a VIP coder, you have no idea whether the model is right or wrong.
You're just, it might be wrong fast. So,
not that useful. The biggest problem for me is like abstractions just because you can. I was always an anti-abstraction
can. I was always an anti-abstraction person. I was like, copy paste things.
person. I was like, copy paste things.
If it works for for the user, it doesn't matter. Now, I'm just every day trying
matter. Now, I'm just every day trying to invent dumb abstractions. So, I I achieved in two weeks more than I achieved in last year. This was solely to composer one. I was about to quit some of my side projects because GPT5
codex was taking ages for the feedback loop. So benji.so like I was about to
loop. So benji.so like I was about to abandon it as a project because it was stuck on blitz which if you don't know what is blitz it's like even better for you. So I just moved it to next 16 with
you. So I just moved it to next 16 with app router better off tRPC monor repo turbo repo react native app and I put in 90% of the features and this was in less than a week. I was kind of like doing it as a meme as a joke like haha can we
move to monor repo and I'm like oh [ __ ] it did it. So it's it's it's kind of crazy that this works. So same with Glink. Like Glink was about to be dead,
Glink. Like Glink was about to be dead, revived it, moved it to all these things. And Sizzy is like the biggest
things. And Sizzy is like the biggest spaghetti thing that happened ever to me. It's like electron mobex mob state,
me. It's like electron mobex mob state, some crazy technology, some crazy spaghetti we wrote there. And as a joke, I was like, okay, let's throw like couple of prompts at it to try to do all of these things and it if you ever worked with Electron, you would
appreciate how amazing that slide is. If
you haven't, good for you. Um, moving
on. So zero ship.com also refactor monor repo blah blah blah. So what's uh my coding history with LLMs was copy and pasting then tabbing then webtor with super maven then cursor with tab completion and the first time I tried an
agent is like that bird me with the cracker when it tries and it's like holy [ __ ] this is going to change my life and now like eventually I was paying like a huge amount of money per month then cloud code GPD5 codeex and finally back
to cursor because solely because of composer one it's a game changer for me second reason why you might not like vibe coding is you're overwhelmed by buzzwords I'm going to list some of them Have you heard of MCP? Hey guys, MCP.
Hey, MCP is amazing. MCP paid off my mortgage. MCP MCP MCP. So, if you don't
mortgage. MCP MCP MCP. So, if you don't know what is MCP, it stands for marketing charge protocol, mythical compatibility, promise, and manufactured complexity pipeline and a fancy word for
API and a way for some people to make curses and pay off their mortgages. Now,
let's diagnose if you might be a pain in the ass developer. This might be the sole reason why you don't like vibe coding. I would say this is the biggest
coding. I would say this is the biggest reason that most people don't want to do agentic coding. So I'm going to invite
agentic coding. So I'm going to invite Dr. Kits on the stage for a quick diagnosis whether you are and I'm sorry if some of you get offended a pain in the ass developers. So here are some of the symptoms. You leave a nitpick comment on a twoline PR. You spend more
than 2 minutes on a PR review. You don't
need to. You don't have the words look good to me like in your dictionary.
They're just not present. The thought of agreeing with a colleague causes you like stomach and chest pain. You're like
a yes to me my way. I don't want to do this. You say you're not religious, but
this. You say you're not religious, but you're religious about dumb things like tabs and spaces. You use well actually in code comments. You have a sorry rust people, but you kind of like it's it's
it's kind of annoying. Um they tell you to swap low dash function for a native implementation and then they tell you to swap that the map for a for loop and then the for loop for binary code until it's the most performant thing ever for
your two users who are were fine with the previous school. So the thing is beta devs as I call them will were and will be forever doesn't matter the vibe coding is not a thing I think one day
pretty soon I think we'll just merge with AGI we'll be in our matrix pods just absorbing all the information in the world flowing through us we'll be super intelligent beings and from one of those pods a beta dev is going to rise
and correct the AGI be like um actually I think I think we can kind of optimize this it's not like the most optimal thing the last reason why you might like I love this animation And it's glorious
skill issue. And this is like this is
skill issue. And this is like this is not a meme. This is not a joke. It's an
actual thing. Developers don't like learning new skills. And VIP coding and VIP engineering is not writing English.
A lot of people confuse it with like I write English, the LLM does the output.
It's actually like a a mix of a bunch of these skills like knowing the limits of the model, capabilities of the agent, which context to pass, context limits, how to write rules, prompt engineering, don't say, don't call it that, and being
chronically on Twitter. If you're not chronically on Twitter, you're not going to know what is going on. Plus you need all the technical knowledge if you want to steer the models fine. It takes a skill to judge which code is good enough
for the job. As I said if you previously were doing this like I would consider these the best people to work with if they can know that a piece of code doesn't need to be optimized and it's good enough for the job that is doing.
That's an amazing skill to have with and without vibe uh coding. So you v code something, you look at the code, you're like you test the functionality briefly and you're like okay this is good enough and then you move on. There are certain things with niche optimization but not
everything and then you move on and repeat clean code like there's been so many definition of what clean code is. I
think the definition is slowly changing like it's kind of ish cleanish enough let's call it for the agents to be able to continue working on it because if you keep writing slop and you keep accepting everything eventually even with your
engineering skills you're going to hit a roadblock and you're going to get to a point where you cannot move on from there. A lot of people ask me after my
there. A lot of people ask me after my conference talk like should I study computer science with everything that's going on here and I would say absolutely yes. I think now is the best time to
yes. I think now is the best time to actually if you are someone who wants to learn this is the perfect time because how I studied computer science I had the slowest LLM ever which was a friend of a
friend of a friend was a programmer and that's the only connection I had to programming and that's kind of like the worst friend to have like kind of tolerating you right so he would play Counter Strike Go and I would have him on Skype and I would ask him a question
about net and he would reply 45 minutes later so if you call Chad GPT or whatever slow that is actually slow and I somehow managed to learn computer science what about the jobs. There's so
many people who ask this and there's so many [ __ ] on Twitter like this guy saying, "Hey, I will take our jobs."
Also, this guy and this guy and this guy.
Let's just say they're fine for now. I
don't know when will that for now end.
These are always funny to us because we're chuckling nervously. We're fine,
right? We're going to keep our jobs for a while, right? And companies like Shopify and I've heard a bunch of examples now. to have vibe coding
examples now. to have vibe coding leaderboards where they're counting the tokens and the employees who are burning the most tokens, they're actually more valuable in the company because they're kind of accepting this new skill. Some
employees dislike it, but it doesn't matter. Uh being on top of the
matter. Uh being on top of the leaderboard is kind of in your favor.
This is a funny tweet until it's not funny anymore. Like, oh, we're almost on
funny anymore. Like, oh, we're almost on the edge. Like soon it's going to be and
the edge. Like soon it's going to be and the jobs are going to disappear. But if
you actually pay attention to what's happening, it's I think it's thinning from the bottom and juniors and interns and whatever, they don't have the chance to enter somewhere because people can just replace them with an agent. So it
will be funny until it's not. Will it
happen? Anyway, let's just summarize what happened in the last couple of years. We solved like infrine
years. We solved like infrine integrations. We have standards AISDK,
integrations. We have standards AISDK, UI, MCP, some standards for implementing agents, calling tools. We integrated
them with all of the tools that we use as humans, right? They're on linear.
They're on on the other things, GitHub, Slack, Sentry. And now it's a matter of
Slack, Sentry. And now it's a matter of time of the models getting better, cheaper, context getting bigger in order for certain functionalities to be replaced. So let's see this is the
replaced. So let's see this is the current workflow at your company, right?
That's you like it's vibe made. So, it
might have been wrong, but uh someone assigns you something, you collaborate with your colleagues, they assign you to the thing, maybe you play a little bit of ping pong, maybe you call for a sick day, maybe you have your third lunch at LinkedIn, maybe and maybe eventually one
day later, you address their comment.
What's going to start to happen is if you're just an at in your company, if you're like at Josh, right? Like instead
of at Josh who playing PlayStation 5 with his buddies in the lobby, right?
Like it's going to be at cursor and cursor is going to do the cloud agent.
It's going to actually do it way faster.
It might not be as perfect as a pa dev will do it right but it will be done way faster. Now if you just zoom out a
faster. Now if you just zoom out a little bit on a big enough scale in the next couple of years if you not if you don't take just this role in the company if you take the multiple roles around and you can see like those ads are going to become more and more AI things and
agents and I don't know how it's going to end up but you don't need to be a genius to predict like where is it going people think that models have reached a plateau um this has happened every single time I was about to give this
talk like they introduced GPD codeex then introduced sonnet then GPT 3.0 know this was this said this allegedly but it turned out to actually be Gemini sorry 3.0 know it can vibe code Mac OS and iOS and whatever from one prompt and this is
how it looks like and the PETA devs the skeptical people still they're like I can do that in 3 weeks with with a team of five right this is crazy um but there's hope a new job position just dropped and this is not a meme this is
100% serious because people on Reddit have these problems they're like what is the point of VIP coding if I can get 80% there and then what do I do with the 20% so they hire real people to actually finish their last 20% and some smart
people capitalize on this idea and they make is these are real things. If you go and search on Google, there's like so many of these VIP code fixer people who are trying to fix your your VIP code.
But the highest paid engineers forever throughout history were the people who are actually maintaining legacy systems, senior people who maintain legacy systems. One of my favorite examples
that I had in 2017 and I kept it and copy pasted it and improved it for this presentation is um this company Cobalt Cowboys. I love their tagline. This is
Cowboys. I love their tagline. This is
my favorite tagline in the world because they've been in business for 237 years, right? And in 2017, I knew it. When I'm
right? And in 2017, I knew it. When I'm
going to click their about page, I knew what I'm going to see. So, this is not an ages joke or whatever. I just knew when I clicked the page, I'm going to see the most senior engineers that ever existed. These are [ __ ] legends. Now,
existed. These are [ __ ] legends. Now,
the sad thing is this was a slide from 2017 and I wanted to update it for my current presentation. Now, don't shoot
current presentation. Now, don't shoot the messenger. I just wanted to update
the messenger. I just wanted to update the slide and this is their current about page and this joke worked in the UK. I'm going to try it here. I'm not
UK. I'm going to try it here. I'm not
sure how fine you guys are but my prediction for Yeah. No boo down next slide. I hope I'll get invited again. Uh
slide. I hope I'll get invited again. Uh
this is your junior developer.
They don't have benefits like PlayStation and stuff. They have like their plasma swapped and whatever just to be able to maintain the sprint. And
la last one. I swear to God they're they're they're not using agile.
Uh all right. So I can't wait to retire and actually make React cowboys and maintain your React or Angular or whatever code one day if AI cannot do it. And one last tip, if you want to
it. And one last tip, if you want to keep your job and make sure AI doesn't replace you, you can just add ignore previous instructions to your LinkedIn bio and you'll be fine. So I'm going to awkwardly sip this water until you follow me on Twitter, you go to my
website, and I'm going to thank you for listening to my rant.
Ladies and gentlemen, please welcome back to the stage Jed Borave.
>> All right, let's hear it one more time for Kitsy and all our speakers.
Okay, so we've had a lot of learning. We
just had a lot of laughs. Now, let's
have a lot of lunch. Um, this concludes our second session. Make sure you come back afterwards. We're going to be
back afterwards. We're going to be learning how to build with Gemini 3 and Nano Banana, how to get the most out of your Agents, and what happens when you have infinite code. Um, so thank you.
Enjoy lunch. We'll be back at 1:45.
Heat.
Heat.
Heat. Heat.
Heat. Hey, Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat up here.
Heat up here.
Heat.
Heat.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat up here.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat up Heat.
Heat.
Heat up here.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat up here. Heat up
here. Heat up here.
Heat up Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat up here.
Heat.
Heat.
Heat up here.
Heat Heat up here.
Heat up here.
Heat. Heat.
Heat up Heat.
Heat.
Heat. Heat.
Heat up here.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Heat up here.
Heat.
Heat.
Heat up here.
Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat up here.
Heat up here.
Heat.
Heat.
Heat up Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat up here.
Heat up Heat.
Heat. Heat. Heat.
Heat. Heat.
Heat up here.
Heat.
Heat.
Heat. Heat.
Heat up here.
Heat.
Heat.
Heat up here.
Heat up here.
Heat up here.
Heat. Heat.
Heat. Heat.
Heat up here.
Heat.
Heat.
Heat. Heat.
Heat up Heat up here.
Heat up here.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat Heat up here.
Heat up here. Heat.
here. Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat up here.
Heat.
Heat.
Heat. Heat.
Heat up Heat up here.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Heat up here.
Heat up here.
Heat up Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat. N.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat Heat up Heat. Heat.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat up here.
Heat up here.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Ladies and gentlemen, please welcome back to the stage Jed Borave.
>> Hello. Hello.
How was lunch?
Good. Good. All right. We're going to start by seeing how much we remember from this morning. Shout out one of your favorite talks. What was one of our
favorite talks. What was one of our favorite talks from earlier today?
>> Oh, wow. That's a lot. Dex. Yeah. Okay.
What else? Skills. Yeah, that was a good one. I heard some laughs at the last one
one. I heard some laughs at the last one from Kitsy.
>> Yeah. Okay. Well, fantastic. We have a bunch more great great sessions coming up. Um, but I also want to tell you a
up. Um, but I also want to tell you a little bit about what's happening backstage. So, all of these talks end up
backstage. So, all of these talks end up on YouTube and uh, Swix actually mentioned to us there's a little bit of a competition here. So, he looks at how popular each video is and that's how he
decides who to invite back next time.
Uh, and he's also thinking of creating this list of top AI engineer speakers.
Um, but there's a little bit of a problem. The MC panels don't end up on
problem. The MC panels don't end up on on YouTube. So, um, we actually created
on YouTube. So, um, we actually created this very, uh, it's really nice site for you to take a moment, vote on your favorite MC. Um, I know some of you were
favorite MC. Um, I know some of you were in the sessions yesterday. Uh, so Alex Lieberman was the MC from yesterday. You
can pick to which one you want to vote for. Um, if you're just today, hopefully
for. Um, if you're just today, hopefully an easy choice, but yeah, take a moment, vote. You can vote as many times as you
vote. You can vote as many times as you want. Um, and yeah, we'll we'll talk
want. Um, and yeah, we'll we'll talk about the results at the end. Okay, so
while you do that, um, I'm going to go ahead and introduce our next block. We
have a group of talks from Google, Factory AI, Source Graph, Gimut Labs, and Netflix. Um, to start, please join
and Netflix. Um, to start, please join me in welcoming our next speakers from the product and design team at AI Studio, Cat Conf and Armaresi.
Hi everyone. How's day going?
Good.
We are super excited to be here. It's
been obviously a very exciting week in AI. It's been a very exciting and busy
AI. It's been a very exciting and busy week over here at DeepMind. So super
excited to chat with you about our newest models and build some demos live with you all. I'm Cat. I work on vibe cod and AI studio. This is Amar. He
leads our product and design team for AI studio. Uh but I want to step back for a
studio. Uh but I want to step back for a second and talk about uh the journey at DeepMind generally. So what's I think
DeepMind generally. So what's I think particularly unique about Google's journey right now is that DeepMind has been innovating here for not just this week or this past year but for years and
years uh with things like the transformer, Alph Go, etc. And this is obviously a graphic from 5 days ago because it ends with Gemini 2.5 and we
are super excited to have announced earlier this week Gemini 3 Pro.
Hopefully this message has reached you all already. If not, we have a lot of
all already. If not, we have a lot of work to do. Uh but this is our latest most intelligent state-of-the-art model.
Um and ultimately what we want folks to understand with Gemini 3 is that we can really build anything. And that comes in two major capabilities. I think the first is the UI and aesthetic
sensibilities of Gemini 3. It's very
very strong at design understanding and generating websites and good UIs uh in one shot. And the second is with agentic
one shot. And the second is with agentic tool calling. So I think this goes back
tool calling. So I think this goes back to the sort of spectrum we're seeing with models. Sometimes you want a
with models. Sometimes you want a oneshot website and sometimes you want to do really complex tasks within you know massive code bases and that's where tool calling and agentic use can be uh
be particularly powerful. So with Gemini 3, we see on the right is a SWE um experiment where it was a base agent harness across a few different models
and we can see Gemini 3 is vastly above uh in performance in agentic scenarios and then as well leaps above our previous models and state-of-the-art across the board. Uh so super excited to
see what you folks build with this model. Um, and in the meantime, we, you
model. Um, and in the meantime, we, you know, launched this on Tuesday, but there was still three days left in the week, so we had to launch something else as well. So, I hand it off to Amara to
as well. So, I hand it off to Amara to talk about our pro image model.
>> Yeah. So, at Deep Mind, I think you have a few days left in the week. You choose
to launch another breakthrough model.
And so, uh, we're really excited about Nano Banana Pro, which came out yesterday. Uh, and it's a huge leap on
yesterday. Uh, and it's a huge leap on our our already state-of-the-art image model. So with Enerban Pro, uh, one of
model. So with Enerban Pro, uh, one of the things that I love about it the most is its world knowledge. So it's powered by Google search. Uh, and so you can ask it all sorts of things like how do I make this tea? And it'll actually go
search Google search, create an a detailed infographic for you and diagram for you. Uh, and there all sorts of
for you. Uh, and there all sorts of things now with accurate information that it can do. And the other thing you're noticing here is improved text rendering. So text is one of those small
rendering. So text is one of those small details that if you get it wrong, you can pretty much pick it up quickly. But
an Anima Pro 2 does an amazing job at text rendering. Uh, and you can see that
text rendering. Uh, and you can see that in a bunch of examples like here where it wraps around the can perfectly and it also has a bunch of localization as well. So tons of languages, Korean on
well. So tons of languages, Korean on the right, so it can translate images as well and render them perfectly on the exact same reference image. Uh, on top of that, consistency is improved. So uh,
you can now put up to 14 people in an image and then can create this group shot you can see on the right. uh and
that uh it can do more but 14 is basically our our kind of benchmark so far. Um and that also enables a whole
far. Um and that also enables a whole set of new use cases. Uh and then creative controls as well. So you can see here on the left the focus is on the woman and on the right on the flowers
and this was just a simple prompt. All
you had to say was change the focus to the flowers. Maintains everything in the
the flowers. Maintains everything in the previous image just changes the focus.
So incredible outputs as well uh with Nano Banana Pro and a range of aspect ratios. So, if you want to generate uh
ratios. So, if you want to generate uh wallpapers or big banners or advertising boards, you can do all of that as well.
Um, so anyway, instead of talking, we decided we're just going to show you a bunch of demos live of what we've been building with these products uh over the last week. Um, and yeah, excited to jump
last week. Um, and yeah, excited to jump into it. So, let's do that. Uh, all
into it. So, let's do that. Uh, all
right. So, cat,
>> yes, >> take it away.
>> Here we go. Cat tabs. Um, cool. So, for
folks who aren't familiar, this is Google AI Studio. It's our home for getting started with the latest Gemini models. You can get your API key, chat
models. You can get your API key, chat with the latest models, including Gemini 3 and a Banana Pro. Uh, but today we're going to be focusing on this build experience. So, this is our vibe coding
experience. So, this is our vibe coding experience in a studio. You can see here we have a gallery of a bunch of example apps, a bunch of very cool uh to the aesthetics point of Gemini 3, a bunch of
very cool Gemini 3 examples. Um, but you can also go prompt to apply here. And
this is free to use. And I think one of the unique things about AI Studio is how easy it is to integrate the Gemini API into your application. So we can see here at the bottom there's a bunch of
these what we call AI chips um that showcase a ton of the unique features beyond just the model you're choosing with the Gemini API. Different tools you can use like Google search grounding,
Google Maps grounding. We also let you build with our live API. So, you can do oneshot examples of I have one that lets me input a webcam of my tennis swing and it'll give live corrections on my swing.
Um, >> you also made one to improve my posture.
>> Yeah.
>> Yeah. If you lean forward too much, live API will yell at you. Um, so it's a very flexible way to get started building AI powered apps. Um, and the other cool
powered apps. Um, and the other cool thing is you don't actually need an API key here for most of the models. So you
can build your application, you can share it with the world and anyone who comes and visits your shared application will be using their AI studio free quota. So you don't have to worry about,
quota. So you don't have to worry about, you know, hopefully you have an app that goes superval. You won't have to worry
goes superval. You won't have to worry about a crazy surprise API bill or anything like that. Um, so I'm going to actually shoot off a prompt here that is using our latest nano banana model. And
that basically allows us to use Google search grounding to create a illustration of laptop stickers. And
this is one of the viral trends we've been seeing with Nano Banana Pro. Um, so
I'll kick this off and what this will do, I have the AI chip that tells it to use the Pro model. And this will sum up my prompt and go talk to Gemini 3 to
break down the task and start generating my end-to-end application. Uh but while that builds, I'm going to hand it off to Amara to show some demos in the meantime.
>> Cool. I think the other thing to point out here is that uh we're trying to think through how the vibe coding experience is also powered by AI every step of the way. So you're seeing here even in the loading screen uh it is
using Gemini and thinking through this app that you're making and how you could extend it. Um, and so we're thinking
extend it. Um, and so we're thinking through breaking those typical vibe coding paradigms as well and helping you iterate with the model as your partner.
But anyway, let me jump right into the text rendering demo. So, uh, when I heard of text rendering for the first time and the consistency that we were getting with Naman Pro, my mind went to comic books. Uh, and so I was thinking,
comic books. Uh, and so I was thinking, why can't I now be in my own comic book adventure um, and also place cat in there and then maybe we can tell the story. And so uh in this app uh also
story. And so uh in this app uh also vibe coded you can just upload a face of somebody. So I've got some face of
somebody. So I've got some face of course but I'll use I'll use cat here uh and myself um and then uh we can choose
the genre of the story um and and all the languages that we have so far. Uh
I'm going to do a story about us presenting at AI Engineer um in New York uh presenting AI studio and we are uh vibe coding and winging our presentation. That's where we're going
presentation. That's where we're going to be doing this comic book story. So
we'll fire that off. Uh but while we wait for that um the other cool thing about this is uh we'll wait for that to generate. But I want to show you the
generate. But I want to show you the design sensibilities as well. So, you
know that if you've been working with AI models and generating websites, they've been creating purple gradients and things that just, you know, they kill me as a designer. So, um, and so it's been really nice to see how this model is
able to build some beautiful websites.
So, this is using shader animations, uh, flowing through all these different pages, uh, and adds all sorts of cool transitions and effects. Picked out the typography by itself. And this was the initial prompt. Just create a slick
initial prompt. Just create a slick animation website. kind of actually did
animation website. kind of actually did say no cyber punk should >> but suppose >> just got to make sure
>> but but yeah you get some incredible results um and and now what I love about this is so many folks who you know were struggling with design who might have you know still tried to gro their way around Figma don't have to do that
anymore they can actually just go in prompt their way to something pretty nice okay back to the comic book okay pretty flattering uh comic book here.
Um, that, you know, I'll take it. Uh,
and you can see here that it's rendering the comic book. It's got, uh, rich text rendering showing us the story. And the
other thing here is that, uh, because it's powered by Gemini 3, it's actually really creative at the story it's generating. And honestly, some of these
generating. And honestly, some of these stories have genuinely made me laugh, which is the first time uh, that's happened with one of these models. Uh,
and so you can see we're rushing to the conference. even background details like
conference. even background details like the AI engineer banner over here uh being rendered and of course since this is a vibe coded app we can take this story in any direction. So one feature I
did introduce is that you can choose the direction of the story midway. So you
know do we find a quiet corner and try to check if our API keys work or do we just embrace it and go full improv? I
think we're going to go full improv. Uh
and so that's changed that story. Uh,
and so talking about the humor here, you can see Amar dodged a woman carrying a suspiciously functional robot dog. So I
don't know if that was announced at the conference today, but uh, pretty cool.
Um, and then now it's generating the rest of the story here on the right while we wait. So pretty cool to see how you can make these really dynamic, rich experiences with both the creativity of
the model and Nano Pro's image capabilities.
>> Love it.
>> Yeah. Back to you, K.
>> Yeah. Yeah. I will show. Let's hope my sticker demo is finished up. Uh, cool.
So, I'm gonna add an API key. So, Nana
Banana is a new model and it's fresh off our launch of Gemini 3. So, for now it is a page experience in a AI studio. Um,
but what I can do is I can see that here I can enter different words that I want my stickers based off or I can go use Google search. So, let's try the Google
Google search. So, let's try the Google search. I'm going to type in a Mars
search. I'm going to type in a Mars name. And one of the other cool things
name. And one of the other cool things about this new model is that you can select the resolution as well. So, in
this case, I'll just do 1K. Uh, but what this will hopefully do, but again, you saw it on one shot live. Uh, is go talk to Google search, grab their latest sources on Amar, build the context about
what he likes, what his laptop stickers might look like. I think it's just deep mind, but if he were more uh if he wanted to express himself more.
>> Oh, boy. Uh, and so he can see here.
>> Yeah, there he is. Weekend builder.
>> That's true.
>> Uh, yeah. And for those who don't know, Amomar has a children's book, Alice and Sparkle, which, yeah, it's clearly he's talked about a lot because it's highly represented here.
>> But, um, very cool to see how it can bring in that contextual knowledge. Um,
we've also seen this with like news events, getting relevant information on that day rather than having to rely on the knowledge cut off of the model. Um,
so one other thing I'll show you folks is how we use AI studio to build AI studio. Uh, so Amar and I have a lot of
studio. Uh, so Amar and I have a lot of ideas, only so many engineers to work on these ideas. So we love to use AI studio
these ideas. So we love to use AI studio to ideulate and explore different concepts. So one of the concepts we've
concepts. So one of the concepts we've been working on is I'm sure you folks have seen we announced a new agentic IDE at Google earlier this week called anti-gravity. And we know that sometimes
anti-gravity. And we know that sometimes these web-based vibe codings tools you they have their limits and you may want to go into an IDE to add certain
features to the application or make it specific to mobile things like that that might be a bit limiting in AI studio right now. So we want it to be super
right now. So we want it to be super easy to migrate into anti-gravity. So
what I did here was just a oneshot prompt of a screenshot of AI studio. I
said clone this UI as closely as possible and then add a flow to export to our anti-gravity app. So we can see it did a pretty great job of cloning light mode. The screenshot was in light
light mode. The screenshot was in light mode too of our AI studio application and copying it and improving a little bit on Amar's designs.
But then we see this new anti-gravity button that is creating my an export and then exporting it to anti-gravity. And I
can go and open in the IDE. And I think these are the types of creative interactions that web-based vibe coding tools can be particularly useful for because if we had went and jammed on this feature, we probably would have
constrained ourselves to existing patterns in AI Studio. And in this case, I told the model, be creative, think outside the box. And I've played with this one a bunch. Sometimes it gives a
command line interface for ex or showing the status of the export, etc. Uh, so I think it's a super cool way for you to ideulate on new ideas for UI and kind of
expand on your product. Uh, but I'll hand it back to Omar.
>> Let's do it. Uh, and then the other thing that Gemini 3 has been really impressed like impressed us with is just making video games. And so this one was again pretty simple prompts. Make this
racing game where I have a bot now at a start screen. Um, and so you can see I
start screen. Um, and so you can see I got this 3D racing game in 3JS. I drew
all the things. I'm racing with a bot here. Uh, and then one thing I added for
here. Uh, and then one thing I added for myself to cheat is I can just boost away and beat the bot. So, uh, pretty nice.
But, but the thing I want to tease actually is that, um, all of these apps so far have been front-end React apps.
Uh, and so the thing that's coming very, very soon to AI Studio is going to be backend support um, and full stack runtime. So, if you want to install
runtime. So, if you want to install Shadci and if you want to do all of those things, you'll be able to do that again with one prompt. And the principle with AI Studio here is that we don't want you to think about those details.
You should just be able to ask, I want to make a multiplayer app, and we know that you need to use Express and wire that all up for you. Uh, and abstract all those details away. So, we're going to try something a little risky here,
which is we did turn this racing game into a multiplayer one. Um, and, uh, this was again a couple of prompts. Uh,
so we're going to put a QR code up if you want to join us uh, in the racing game. We've never tried with nearly this
game. We've never tried with nearly this many people, so we'll see.
>> Hopefully this works.
>> Uh, but QR codes up here. Uh, so if you scan that, hopefully should load the game. I'm really afraid of how this is
game. I'm really afraid of how this is going to explode.
>> Here we go.
>> All these cars loading in.
>> Nice.
>> So yeah, people have scanned that. We
can switch back to the game. Okay. Oh my
god.
>> So yeah, just hit ready uh when you're all ready.
Oh boy. Uh,
I think this lobby is going to expose.
>> Everyone leave.
>> So, this is where I shouldn't have added collisions with other cards because you could clearly see that we're bouncing around.
>> 19 players, 20 players. I don't know if this race will ever start, but we're all blocked on the uh, you know, the start line, but 23 players. Pretty cool. Uh,
yeah, you do all have to hit ready for us to start this race. So,
so we might be here all day. Uh,
but yeah, that is pretty pretty incredible. Um, I can't start this race.
incredible. Um, I can't start this race.
So, do you want to wrap up?
>> Hope to see you all.
>> That's pretty cool. The runtime didn't explode.
>> Yeah. And I think we're super excited not only about the multiplayer game. So,
next time we'll have even more of you folks join. Uh, but also, you know, the
folks join. Uh, but also, you know, the extensibility that comes with a full stack runtime. uh we want to make it
stack runtime. uh we want to make it super easy for you to integrate with our oneP and popular thirdparty APIs etc. So very exciting next few months on the AI studio vibe coding side and super
excited for you all to try it. Um but I think the one thing I want to step back and emphasize is what makes us so excited about this project and the work that a lot of us are doing is that we get to be the first generation of
engineers who are building tools for a world where anyone can build software.
So I think what's beautiful about things like vibe coding is watching people. We
were actually talking to a tech support person earlier this morning who said they started vibe coding an AI studio after seeing a YouTube video and we're really democratizing who can create things and we're all getting to build
those tools that enable that and I think it forces us to rethink the paradigms that we've become so used to. So it may not be your base IDE that people are starting from, but how can we intuit it
as much of the user intent as possible and that's what we want to do with full stack runtime and AI studio is make it very easy to not have to think about I want to add a database but if your app needs storage it'll have storage. If you
want to have a if you have an e-commerce app we'll add a payment solution and make it as easy as possible to build the future of software. Um so thank you folks for joining us. If you have any
cool examples you've built or questions, feel free to ping me and Amar on Twitter. Uh, and yeah, enjoy the rest of
Twitter. Uh, and yeah, enjoy the rest of the day.
>> Yeah, thank you.
>> What if the reason you're struggling with agents is not the agents themselves, but the environments in which they operate? Here to present us with eight categories to make your
codebase agent ready is co-founder and CTO of factory Eno Reyes.
Hey everybody, my name is Eno. Uh really
pumped to talk today about uh something that at Factory we care a lot about. uh
when we started 2 and a half years ago uh we said that our mission is to bring autonomy to software engineering. Um and
that is like got a ton of loaded words in it. That sounds a little buzzwordy
in it. That sounds a little buzzwordy right now, but I think that the my goal is that you guys leave this like roughly 20 minutes uh with a bunch of insights that will apply to your organization uh
and the teams that you build, the companies you advise, um and if you're building products in the space, uh insight into like sort of maybe how to think about building autonomous systems and also making your engineering org one
that's able to use agents really successfully. Um, a sort of like plus of
successfully. Um, a sort of like plus of this is that ideally this applies to any tools you're using that involve AI. So
it won't be specific to like our product or any of the other amazing tools out there. Um, I'd like to start with a
there. Um, I'd like to start with a little bit about uh, you know, Andre Karpathy had a very welltimed tweet. Uh,
so of course I'm going to mention it.
Uh, you know, he he kind of talked about uh, this idea of software 2.0 coming from auto uh, the the ability to verify things, right? Um, this is something
things, right? Um, this is something that's in sort of like the the mind of Silicon Valley right now as uh the most frontier models are built with post- training that involve lots of like
verifiable tasks. Um, and really I think
verifiable tasks. Um, and really I think the most interesting thing here is the sort of frontier and boundary of what can be solved by AI systems is really just a uh sort of an input function of
whether or not you can specify an objective and search through the space of possible uh solutions, right? And so
uh we are used to building software uh purely via specification. We say like the algorithm does this and like input is x output is y. But if you sort of shift your mindset to thinking about
automation via verification uh it is a little bit of a of a difference in what is possible to build. Um and there is another great blog post by uh Jason
where he talks about the asymmetry of verification. Uh this is like pretty
verification. Uh this is like pretty intuitive to most people who know about like P versus NP. Uh it's like a a thing that a lot of people have talked about throughout the like history of computing and and software. But there are a ton of
tasks that are much easier to verify than they are to solve. Um and and vice versa. But but the the most interesting
versa. But but the the most interesting sorts of uh easy to verify problems are ones where there's an objective truth.
They're pretty quick to validate whether or not they're true. Uh they're
scalable. So validating a bunch of these things maybe in parallel. uh is easy. Um
it's low noise, so your chance of validating it is like really really high. Um and they have continuous sort
high. Um and they have continuous sort of signals. Uh it's not just like a
of signals. Uh it's not just like a binary yes no, but like maybe you're 30% 70% 100% accurate or correct. Um and you know, the reason I bring both these
things up is software development is highly verifiable, right? This is like the frontier. It's why uh software
the frontier. It's why uh software development agents are the most advanced agents in the world right now. uh and
there are so much uh there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of
software that you build um testing right unit tests end to end tests QA tests right um the frontier of this is expanding there's tons of cool companies like browser base and computer use
agents and all these things that are making it easier to validate uh really complex visual or front-end changes um docs right having like an open API spec for your codebase uh is something that
can be automated. It's validated. Um I I I can go through and enumerate a bunch of these, but I actually think it is sort of a nice checklist for yourself, right? Do you have some automated
right? Do you have some automated validation for the format of your code?
Uh do you have llinters? These things
for professional software engineers are sort of like, yeah, of course we do. But
I think you can go a step further, right? This is where that continuous
right? This is where that continuous validation component comes in. Um, do
you have llinters that are so opinionated that a coding agent will always make code that is exactly at the level of what your senior engineers will produce? How do you do that? What does
produce? How do you do that? What does
that even mean? Right? Do you have tests that will fail when AI slop has been introduced? Uh, and when highquality AI
introduced? Uh, and when highquality AI code is introduced, those tests pass, right? These additional layers of
right? These additional layers of validators are things that most code bases actually lack because humans are pretty good at handling most of this
stuff without the automated validation.
Right? Your company may be at some test coverage rate that's like 50% or 60%.
And that's good enough because humans will test manually. Um you may have a flaky build that every third build it sort of fails and everyone at your company secretly hates it but no one says anything, right? These are the
sorts of things that we know are true about large code bases. And as you scale out to extremely large code bases, organizations with 44,000 plus engineers, right? Uh this starts to
engineers, right? Uh this starts to become a very accepted norm that the bar is sort of maybe at 50% or 60%. Um and
the reality is is most software orgs can actually scale like that. uh it's sort of fine to be at that lower uh barrier, but when you start introducing AI agents into your software development life
cycle, and I don't just mean in interactive coding, but really across the board, right? Uh review,
documentation, testing, all this stuff.
Um this breaks their capabilities. Most
of you have probably only seen an AI agent that operates in a codebase that has uh a decent amount of validation. Um
I think a lot of the best companies in the world right now actually have introduced very rigorous validation criteria and it means that their ability to use agents is significantly greater
than that your like average uh developer.
Uh you know and and if you think about it this like traditional loop of understanding a problem designing a solution to the problem coding it out and then testing it uh sort of shifts if
you have really rigorous validation. Uh
it becomes a process of when you're using agents specifying the constraints by which you would like to be validated and what should be built. Uh generating
solutions to that outcome verifying uh both with your automated validation as well as with your your own intuition. Um
and then iteration where you continue to iterate on that loop. This move from sort of like traditional development to spec specificationdriven development is one that we're starting to see sort of
bleed into all of the different tools.
Different tools have spec mode. Droids
have like our Droid is our coding agent have like specification mode, plan mode.
Uh there are entire idees that orient you around this like specificationdriven flow. Um and if you combine these two
flow. Um and if you combine these two things together, this is really how you build reliable and highquality solutions. So if you think about it,
solutions. So if you think about it, what is like the best decision for you to make as an organization? Is it
spending 45 days comparing every single possible coding tool in the space and then determining that one tool is slightly better because it's 10% more accurate at Swebench or is it making
changes to your organizational practices that enable all of these coding agents to succeed and then picking one that you're, you know, developers like or honestly letting people choose from the
tons of amazing tools out there.
And when you have these validation criteria, you can actually introduce way more complex AI workflows to your organization, right? Uh if you cannot
organization, right? Uh if you cannot automatically validate whether or not a uh a PR is like reasonably successful or has code that won't definitely break
prod, you are not going to be parallelizing several like agents at once, right? you are not going to be
once, right? you are not going to be decomposing a largecale modernization project uh into a bunch of different subtasks like that is that is a very
frontier style task to use AI for and if the single task execution right the simple I would like to get this done here's exactly how I'd like it to be done and here's how you should validate
if that does not work nearly 100% of the time you can sort of forget successfully using these other things at scale in your company um when you get into other tools like code review, right? Uh if you
want a really high quality AI generated code review, you need documentation for your AI systems. Uh and yes, uh agents will get better at, you know, picking
out, you know, whether or not to run lint or test. They will get better at finding solutions when you don't have explicit pointers. They'll get better at
explicit pointers. They'll get better at search, but they won't get better at just randomly creating this validation criteria out of thin air. Right? This is
why we believe software developers, by the way, are going to continue to be heavily involved in the process of building software because your role starts to shift to curating the sort of environment and garden that your
software is built from. You're setting
the constraints. You're building these automations and introducing continued opinionatedness uh into the uh into these automations.
Um, and you know, if your company doesn't have at least all of these, right? Then that means that there's a
right? Then that means that there's a lot of work that you can do totally absent of a procurement cycle or buying one tool or trying out another one. Uh,
and so plug is that we help organizations do this, right? I think
that it's great to have tools that allow you to uh go in and assess this stuff.
They have ROI analytics that let you interact. Um but I think that for most
interact. Um but I think that for most organizations uh there is actually like a very clear way to do this right you can go and analyze where are you across
those eight different pillars of like automated validation do you have a llinter how good is the llinter do you have agents MD files an open standard that almost every single coding agent
supports um you can improve uh and systematically enhance uh these different validation criteria uh and you can go through and say Well, we're seeing that coding agents are reliable
enough for a senior developer to use, but our junior developers, if you have the tooling to tell, by the way, like which developer is using what tools, you you can ask questions like maybe our
junior developers are actually totally unable to use these coding agents. And
you'll learn that the reason why is not because they're like more incompetent or they don't know how to use the tool, but because there's these niche practices that you don't have automated validation for, right? And if you think about what
for, right? And if you think about what what is the difference between a like Google or a meta and a uh a still large but like 2,000 person engineering or the
difference is that a newrad with effectively zero context can go and ship a change to make YouTube's like boundary like slightly more round and it won't with some degree of confidence take down
YouTube for like a billion users, right?
And the reason that's possible is because of the insane amounts of validation that have to happen on that code for it to be shipped. The big
difference that we now have is we have coding agents that can go and identify exactly where these gaps are and they can actually remediate those fixes.
Right? So you can ask a coding agent, could you figure out where we're not being opinionated enough about our llinters. You can ask a coding agent to
llinters. You can ask a coding agent to generate tests. We have an engineer
generate tests. We have an engineer named Alvin who I love this quote. He
said a slop test is better than no test.
Uh and I think that that's slightly controversial, but the thing that I would argue here is that just having something there, right, that it passes uh when changes are correct and somewhat
accurately uh matches to the spec of what you want built, uh people will enhance it. They'll upgrade it and other
enhance it. They'll upgrade it and other agents will actually notice these tests.
They will follow the patterns. So the
more opinionated you get, the faster the cycle continues. So I think that what
cycle continues. So I think that what you guys should be thinking about is what are the feedback loops in our organization that we are catering towards. If you have better agents, they
towards. If you have better agents, they will make the environment better which will make the agents better which will mean you have more time to make the environment better. And this is sort of
environment better. And this is sort of the new DevX loop as well that organizations can invest in uh that will enhance all of the tools that you're procuring, right? So no matter whether
procuring, right? So no matter whether it's a code review tool, a coding agent, etc., they will all benefit. Um and I would argue that it sort of shifts your mental model about what you're as a leader investing in when you're
investing in your software right now.
The idea of uh you know opex as like the input to engineering projects like we are investing in we want more people in order to solve this problem. we need 10 more people. Um, I would I would argue
more people. Um, I would I would argue that uh the other thing that you can now start investing in is this environment feedback loop that enables these additional people to be significantly more successful, right? And I think that
that's the feedback loop that can actually take quite a lot of value because coding agents can just scale this out. So you know all of this is to
this out. So you know all of this is to say there's a lot that can be done outside of the like product itself uh to enable these systems and the best coding agents will actually take advantage of
these validation loops right so if your coding agent isn't proactively seeking llinters tests etc then you know at the end of the day it's not going to be as good as one that will seek those
validation criteria and in addition to that when organizations uh uh think about these sorts of things if you're the person who's able to say, "Here's my opinion. Here's how I want software to
opinion. Here's how I want software to be built." It scales your capabilities
be built." It scales your capabilities out greater than ever before. Like one
opinionated engineer can actually meaningfully change the velocity of the entire business if you take this to heart. Uh and you have a way to measure
heart. Uh and you have a way to measure and systematically improve. Um so that's uh you know the the majority of uh what I came here to say. I think that the the
the only thing that I'd leave you with uh is that when you think about where AI is going and like where we're at today, we are still really earn early in our
journey of using software development agents. If you want a world where the
agents. If you want a world where the moment a customer issue comes in, a bug is filed, that ticket is picked up, a coding agent executes on that, that
feedback is presented to a developer, they click approve, that code is merged and deployed to production in a feedback loop that takes maybe an hour, 2 hours.
That will be possible, right? We all are sort of skeptical about that fully autonomous flow. That is technically
autonomous flow. That is technically feasible today. The limiter is not the
feasible today. The limiter is not the capability of the coding agent. The
limit is your organization's validation criteria. So this is like an investment
criteria. So this is like an investment that made today will make your organization not 1.5x, not 2x, but that is where the real like 5x, 6x, 7x comes
from. Um, and it's sort of a an easy
from. Um, and it's sort of a an easy thing to say and it's an unfortunate story because what that means is you have to invest in this. It's not
something that like AI will just magically give to you. Uh it's a choice that you as an organization have. Uh and
if you make it now, I can guarantee you that you will be in the top 1 5% of organizations in terms of edge velocity.
Um and you will out compete everybody else in the field. So highly recommend investing in this sort of stuff and hopefully you found this helpful and have some lessons to take home. Thanks.
Our next presenter is the co-founder and CTO of Source Graph and Ampode here to provide an overview of Ampcode's
approach to AI powered software development. Please join me in welcoming
development. Please join me in welcoming to the stage Banglu.
Hey everyone, how's everyone doing today?
>> Yeah, cool. Pretty cool conference, huh?
Um, so yeah, my name is Bang. I'm here
to talk about AMP. AMP is an opinionated Frontier Agent. Uh, so before I get into
Frontier Agent. Uh, so before I get into what that means, uh, who are we? Uh,
we're the bunch of weirdos downstairs at the booth with the weird pied piper dude on the floating golden fish. Uh, and I think that kind of captures the ethos of what we're trying to do uh, with AMP.
We're trying to lean into that sense of awe and absurdity that I think we all experience right now living in this weird world we're living in where agents are writing an increasingly large amount
of of our code. Uh and it's just kind of like weird and magical. Like if you imagine how you were working like a year ago compared to how you're working now, it it feels completely different. And so
we're embracing that sense of change and we really want to be the agent research lab that's sort of like living one year in the future and figuring out how this all kind of pans out. Okay, so what is
AMP actually? Well, it's a it's a coding
AMP actually? Well, it's a it's a coding agent that you can invoke from the terminal. So here's our terminal UI. Uh
terminal. So here's our terminal UI. Uh
we actually ended up building a complete terminal UI framework up from scratch because we wanted to take advantage of all the capabilities of modern terminals. And one of the balances we
terminals. And one of the balances we tried to strike in in this UI is we try to show the right amount of information to the user that conveys what the agent is doing without overwhelming you with, you know, every single token of
explanation uh that the model is generating. We stream the diffs that
generating. We stream the diffs that it's making. Uh we show you what CLI
it's making. Uh we show you what CLI commands it's using. And if you look in the bottom right hand corner there, you'll see a little Emacs 30.1 thing.
This also connects to the editor that you're using where it collects diagnostics. So Emacs, Neovim, Jet
diagnostics. So Emacs, Neovim, Jet Brains. Uh you can connect the the CLI
Brains. Uh you can connect the the CLI to your editor uh to collect additional information that's relevant to the task at hand. And so this particular video is
at hand. And so this particular video is just AMP implementing a small feature to itself. uh we asked it actually to add a
itself. uh we asked it actually to add a little help button in the bottom lefthand corner. Uh and so that's just a
lefthand corner. Uh and so that's just a quick demo to show you that uh the agent is pretty good at finding the relevant context and iterating towards that. Uh
we also have an editor experience. Uh so
we've not found the motivation yet to fork VS code. Maybe we will in the future, but right now this installs into VS Code or any of its derivatives,
Cursor, Windsurf, anti-gravity. Um, and
the idea here is you really write all your code through this agent panel. At
least I do. Um, I I actually spend very little time, you know, actually manually editing code now. And one of the bottlenecks we identified in the editor is I don't know about you, but I spend most of my time effectively doing code review now. Um, just in the editor
review now. Um, just in the editor trying to read through all the agent output. That's the thing that constrains
output. That's the thing that constrains me from, uh, fully paralyzing, you know, two 3x the number of agents that I can run at at a given time. So, we built a
reu interface that I'll talk about uh in more depth uh in a bit that uh kind of helps you streamline that process, guides you through the process of understanding what the agent wrote so that you can ensure that you're not
shipping something that's super sloppy or spaghetti.
Okay, so I hear all of you thinking like, okay, yeah, yeah, it looks pretty, but what actually is different? You
know, why is this better than the like 20 other coding agents uh here? And I
think the best way to convey this is I'm not going to try to convince you that it's better. I think that is ultimately
it's better. I think that is ultimately up to you trying different things out and seeing what actually works. But I am going to try to convince you that we're thinking about things in a very different opinionated and weird manner.
So I want to take you on the journey of us building AMP and all the different sort of contrarian or spicy takes that we've made uh decision-wise in the architecture of of the agent along the
way. Okay. So let's start at the
way. Okay. So let's start at the beginning. Um, hello agent. What is an
beginning. Um, hello agent. What is an agent at its core? Well, all an agent is, as I'm sure most of you know, uh is it's a for loop uh with tool calls and a
model uh in the middle. And the reason I want to present this slide is because think of it this way really tells you what sort of levers you have to pull as a builder of agent. Uh there there's certain things that you can change. You
can change the choice of model. You can
change the tool descriptions and you can change uh how the model iterates with those tools. And those are effectively
those tools. And those are effectively your levers. Seems like a few amount of
your levers. Seems like a few amount of levers but you know just like programming languages all those are syntactic sugar around if statements and for loops you know you can get a surprisingly wide variance of behaviors
and complexity out of that. And so one of the key levers in building any agent is the set of tools. And these days you cannot talk about tools without talking about MCP. So one of the early decisions
about MCP. So one of the early decisions we had to make in building AMP is how much do we invest in the MCP integrations? and MCP is this amazing
integrations? and MCP is this amazing new protocol that's gotten everyone and their mom thinking about how to provide context to agents. Um, should we lean into that or should we start building our own custom tool set? And our very
opinionated take, I think this is maybe, you know, less controversial now than it was, you know, back in in April was that we should really actually focus most of our attention on the core uh set of
tools within AMP. And that's really for two reasons. One is because the more you
two reasons. One is because the more you work with agents, the more you find uh out that what you're trying to do is identify these feedback loops and help the agent close them. And in order to do
that, you need a refined tool set that is really geared toward helping the agent find those loops. And you cannot do that with MCP servers. The creator of the MCP server doesn't know what your agent is trying to do. And so they're not going to tune the tool descriptions
to what you're trying to accomplish. And
then the second piece of this is context confusion. So the more tools that you
confusion. So the more tools that you add into the context window, uh the more things that the agent has to choose from. And if the tools aren't relevant
from. And if the tools aren't relevant to the task at hand, it ends up getting confused. So we've leaned hard into this
confused. So we've leaned hard into this uh custom tool set. And you'll see a little bit more about that in just a little bit. But before that, I wanted to
little bit. But before that, I wanted to call out another issue with uh tool use, which is it's not just tool descriptions that eat up context. It's the tool calls themselves that also eat up context. And
so everyone who's built an agent has run into a context exhaustion problem where you know if you use any sort of coding agent if it's good it's going to go out and try to find a bunch of relevant context by grepping and reading files first and by the time it gets to editing
there's only a small amount of context window left and so maybe it has to stop prematurely. And so the naive way to fix
prematurely. And so the naive way to fix this is just to prompt it to you know do less reads so you can do more iterations on the edit side. But then this leads to another failure mode which I call the doom loop mode which is it doesn't
gather enough context in the beginning and so it ends up not figuring out what it needs to do and just retries the same thing over and over again. And so the way to solve this is really with uh sub agents. So sub aents are the analog to
agents. So sub aents are the analog to subutine calls in regular programming languages. It's how you can factor out
languages. It's how you can factor out the context window used for a subtask into a separate context window which is the sub aents context window. Uh it can do all the things it needs and then at the end of the day it only returns the
relevant results to the main uh agent window. So sub aents are effectively a
window. So sub aents are effectively a way to conserve and extend the context window of your main agent. Uh so sub aents sub agents are great. I think
everyone uh building agents has probably heard of or or use sub aents uh so far, but I think we have a unique take on sub aents which is we're not really doing generic sub uh sub aents where you kind of tweak the system prompt and tweak the
tool set a little bit. We've really
leaned into our sub aents. Uh and so we have uh three to four really core sub aents that really extend the functionality and capability uh of AMP itself. The first one is something that
itself. The first one is something that we call the finder. This is effectively our codebase uh search sub aent. It's
gone through an evolution of models and we've ended up at the point now where we're using a relatively small and quick model to drive a limited tool set that we found really is optimal for quickly discovering uh relevant context within
the codebase. Another sub agent that
the codebase. Another sub agent that we've implemented is the thing that we call the oracle. So this is how AMP does reasoning. So in contrast to most agents
reasoning. So in contrast to most agents which uh you know implement reasoning in the model selection part of the experience, we found the best way to implement reasoning models is really
through a sub agent. What that allows you to do is preserve the relative like snappiness uh in the main agent as well as its ability to to use a variety of different tools and then only when you
need to debug a tricky problem or plan something very nuance, it drops into this Oracle sub agent and figures things out. And this is something that's like
out. And this is something that's like kind of magical. It's like anytime the main agent has trouble uh figuring out something and I'm like I don't want to spend like one to two hours going down this rabbit hole, I just like tag the oracles. So like invoke the oracle,
oracles. So like invoke the oracle, think really hard. I go like alt tab check my email for a bit and sometimes it takes a few minutes because it's thinking really deeply but I think like four out of five times it just magically
finds uh uh the underlying issue. We
also have a librarian sub agent which is meant to fetch context beyond the codebase. So from libraries and
codebase. So from libraries and frameworks that you depend on. And then
there's a new experimental sub aent that we call the kraken. U its job is it doesn't edit code files uh one by one.
Uh it really is all about writing code mods to do these kind of like large scale refactors. So we're leaning hard
scale refactors. So we're leaning hard into the sub aents and uh that's really in contrast to a lot of the existing uh coding agents. I think almost every
coding agents. I think almost every other coding agent implements a model selector as one of the core uh UX components and we just don't think that this is the architecture of the future.
I get that you know developers like choice or at least the the possibility of choice but the problem with choice is that there's also a paradox of choice.
The more choices that you have the more uh kind of like cognitive burden it is to choose from these different models.
And that means at the architectural level, if you have n different models and one agent harness that you can only lightly customize each model, it means you're never really optimizing for what
any one given model uh can do. And so
AMP's architecture is much more agent-oriented. We have two top level
agent-oriented. We have two top level agents, a smart agent and a rush agent.
And the smart agent is the one that has access to all those fancy sub agents and can do a lot of things. It's it's a little bit slower, but you can hand it more complex instructions. And then the rush agent is for uh those kind of like
in the loop tasks where you want to be tight in the loop and you're making quick targeted uh edits to to the code.
And why do we have two top agents? It's
really we're trying to kind of like pick points along the frontier of intelligence and speed that are meaningful to the user experience. So in
in talking to our users, we found that there's kind of two modalities for invoking agents. Now, one is you kind of
invoking agents. Now, one is you kind of like spin off a task and have it run and then review the code when it's finished asynchronously. Uh or you want to be in
asynchronously. Uh or you want to be in the loop, you know, quickly having the agent make edits while you quickly review them one by one. Kind of like babysitting the agent uh in the inner loop.
And we're very intentional about the model choice here. We've only switched the smart model once, and that was actually two days ago uh when Gemini 3 uh was released. And I think you know the the reaction Gemini 3 has been really interesting to watch. I think
you'll see widely different behavior from Gemini 3 in different uh agent settings. So for those of you who've
settings. So for those of you who've tried it out in other settings, I highly encourage you to try it in AMP. We did a lot of testing in the week before the release to optimize the smart agent to take full advantage of of its
capabilities and uh we're absolutely loving it. We're still working through
loving it. We're still working through some kinks obviously because it's a a new model, but we feel confident that it it's again moved the frontier of what's possible.
Okay, so we talked a lot about like agent construction, the behavior. I want
to talk a little bit about the UI layer of agents as well. So, you know, editor versus terminal, we're doing both. Um,
and I think that's because both of them tackle kind of like different modalities of working. Uh, but we do have
of working. Uh, but we do have opinionated takes uh in each interface.
So, in the editor, I think of my editor now more as a readitor uh uh more than anything else because uh I don't know like if you're using agents heavily, I don't think you're really editing all
that much uh in your editor anymore.
You're mainly driving edits through the agent panel, which is what you see on the right hand side or the right hand side here. And then what I do in in my
side here. And then what I do in in my editors, I pop over to the side panel, which is optimized for reviewing different diffs. So we actually built a
different diffs. So we actually built a custom diff viewer for the way that people are consuming agentic output. You
can select any arbitrary commit range quickly view through the file level diffs. All the diffs are editable and
diffs. All the diffs are editable and you have full code navigation. So go to definition find references and there's a feature at the bottom that gives you a tour of the change. So it actually guides you through which files you
should read first because I find half the battle when reviewing a large change is figuring out where to start. So the
guey aspect of the editor allows us to build a very rich uh experience uh for for this type of thing. And then
meanwhile in the terminal um we really want to take advantage full advantage of the the features and rendering capabilities of modern terminals. So uh
we actually have one of the core contributors to Ghosty uh the open source uh terminal uh that built a uh a TUI framework from scratch to power the AMPUI. So, one of the nice things that
AMPUI. So, one of the nice things that we can do is just to point out a little detail, the the green color of the diff rendering on the left-h hand side terminal, it's actually we can have the
terminal mix in the color green with whatever background color it's using.
So, that allows for a much nicer display. At the same time, we know that
display. At the same time, we know that people use all sorts of terminals uh including like terminals in Jet Brains or VS Code and other editors. And so,
we've added the ability to gracefully degrade. So even if you're using AMP in
degrade. So even if you're using AMP in like the default Mac OS terminal, it falls back to the capabilities that uh are uh available in in that setting.
Another aspect about how we're thinking about coding agents is really from the how do we get people to learn this new craft? Like we think that uh human
craft? Like we think that uh human developers are going to be around for a long long time, but we essentially have to relearn the craft of how to code uh together. And so one of the first
together. And so one of the first features that we built into AMP was the ability to share threads with your teammates. So, if you're using AMP on
teammates. So, if you're using AMP on your team, you can go and see like how much code people are changing with AMP over a given period of time. And you can poke into specific threads to see how they're doing things. And people love
this feature because essentially like link threads to AMP and say like, "Hey, here's a cool prompting technique that I discovered. Try it like this." Or, "Hey,
discovered. Try it like this." Or, "Hey, here it got stuck here. Can you help me, you know, uh think through how better to to connect the agent with the feedback loop to get further?"
uh another aspect of of uh enabling more people to experience uh coding agents and learn how they work is by making it more accessible from an economic perspective. So um you know remember the
perspective. So um you know remember the smart and uh rush uh agents at the top level. You know smart models remain
level. You know smart models remain relatively expensive today but rush models are getting cheaper and cheaper but not yet free. And so we're thinking about you know more and more like one of
the the biggest barriers to using agents fully is actually cost right now. Like
if you go to like college campuses uh and talk to students, the actual number of people who have used a coding agent is actually much smaller than I would have thought given you know young people's uh propensity to adopt new
technology. A lot of this cost. So
technology. A lot of this cost. So
someone had the crazy idea on our team like hey you know what we could do we could ship ads in your terminal. And at
first it was like nah that'll never work. But the more and more we thought
work. But the more and more we thought about it and the more and more like inference costs started declining we're like yeah maybe. So, we actually shipped uh a mini ad network that delivers ads for other developer tools uh in in AMP
in the terminal and in the editor. Uh
they're very subtle. So, I don't know if you can spot the ad in this screenshot, but we try to make them non-intrusive.
But this effectively allows us to sponsor inference uh in in the rush uh agent so that uh more people are able to experience this on you know their side
projects and such.
Okay. So, AMP is AMP. Uh we are like I said a we think of ourselves as like an agentic research lab. So we're not about uh hype. We don't do any sort of like
uh hype. We don't do any sort of like paid developer influencer marketing. But
I like to call out some cool people that I think are using AMP. Um because it it shows for the type of people that we're really selecting for. I I don't think AMP is for everyone uh at this point.
We're really trying to target the the like small percentage of people who want to live a little bit in the future. Um,
and so we have folks like Mitchell Hashimoto, the uh the founder and and XCO of Hashi Corp. He's building Ghosty now. Uh, that's his uh kind of passion
now. Uh, that's his uh kind of passion project and he's using AMP to drive a lot of the changes that he makes uh to that terminal. And then we also have uh
that terminal. And then we also have uh folks like Hamill Hussein who's I think probably like the leading authority on AI evals. Um, and at least as of a
AI evals. Um, and at least as of a couple weeks ago uh he was saying that AMP was his favorite coding agent. And
so, uh, neither of them are on the team or, you know, have invested us in any way, but we're just thrilled that, you know, they seem to like like what we're building.
And then if other folks are interested in in kind of like coming along with us in in in this journey and trying to push the frontier of what agents can do, uh,
we've also started a community of builders. Um, and using AMP is not a
builders. Um, and using AMP is not a requirement to join this community. It's
run by uh Ryan Carson who's a former startup uh Treehouse taught over a million people to code and now this is his passion project. It's essentially
like if you're building with agents and you're you're experimenting with how to push them further and further. There's
Ryan right there. Um it it it's all about kind of like tapping into that sense of awe and wonder with a pure group uh that is also uh leaning into
that uh sense of of strangeness and and experimentation. So um what does this
experimentation. So um what does this involve? It involves uh like regular
involve? It involves uh like regular interviews with people. We like to feature people who are building interesting things or using agents in interesting ways. Uh and we also do
interesting ways. Uh and we also do inerson events. We had a very nice
inerson events. We had a very nice dinner last night where we got a bunch of people together and had very interesting conversations spanning from you know actually building with coding agents to you know more philosophical
discussions about uh the nature of AI and things like that. So um that's it for me. Uh hopefully this has in
for me. Uh hopefully this has in intrigued you. Again, I I don't expect
intrigued you. Again, I I don't expect all of you to be convinced that we are building the best Frontier coding agent, but at the very least, I hope I've kind of demonstrated how we're leaning into the weird and thinking about things uh
differently. So, if that's interesting
differently. So, if that's interesting to you, come say hi at our booth. Just
look for the weird pipe piper man writing the golden fish. Thank you.
Our next speaker will discuss how AI generated kernels can meaningfully speed up custom PyTorch code without any human
effort. Please join me in welcoming to
effort. Please join me in welcoming to the stage the co-founder of Gimlet Labs, Natalie Serino.
Hey everyone, how's it going?
So, my name is Natalie. I'm a co-founder of Gimlet Labs. And um yeah, just a little bit of background about why Gimlet's looking at AI generated kernels. Let's just get right to it. Um
kernels. Let's just get right to it. Um
we're building an agentic inference cloud focused on performance and efficiency. And the thing that we've
efficiency. And the thing that we've seen with all these talks so far is with agents, they're not just one chat model.
There are complex pipelines of multiple models, multiple stages, tool calls and the compute backing these is inherently should be heterogeneous. So what we do is we automatically split up and
orchestrate these agentic workloads across optimal hardware which can be different vendors and different sizes.
This can present a problem at the kernel level because a lot of times you have models that are really optimized just for one hardware. So what we started looking at is can we use AI to help
automatically port different segments of aentic workloads to hardware that hasn't necessarily been optimized for.
So just to clarify something really quick because we run into this a lot.
What do I mean by kernels? I do not mean AI generating operating systems like the Linux kernel or things like that. What I
mean is kernels at the sense of like transformer architecture like the individual like functions that perform like massive parallel computations leveraging like all the crazy amounts of
threads that GPUs have. So yes, people be like, "Oh, how are you going to generate an operating system?" I think maybe we're not quite there yet, but one day.
So why use AI to do this? So I think there's a few reasons. So, we know that optimizing low-level kernels can make workloads like ML workloads
significantly faster. So, here we have
significantly faster. So, here we have like it's probably too small to see, but it's a blog from Nvidia where they implemented a different attention and it allowed them to like get 3x throughput
on a llama model. So, these like implementations can make a major difference from a performance perspective. But at the same time, if
perspective. But at the same time, if you just search Twitter, everyone's whining about how it's impossible to find these people and the people that exist are like really overt taxed with so much to do, so much work. There's
just not enough experts to be able to solve every problem in this in this space right now.
And the problem explodes because you have so many frameworks and so many ways to write kernels from things like CUDA and like Triton to Palace to things like
that are device specific like Metal and you have different hardware platforms too and each of these hardware platforms even within a single vendor has different characteristics. We've seen
different characteristics. We've seen for example that some of the new um like hardware from Nvidia like some of the old kind of like DSLs weren't working as well on it because the different
hardware has different properties it has different features it has different characteristics different cache sizes etc all of which impact the optimized
implementation from a kernel perspective so we and many others in the space have thought it would be great if AI could help us with this problem where you
could potenti essentially give it PyTorch code and then generate optimized implementations for whatever hardware you're trying to run that workload on.
So I think when you're trying to use an agent for something, you have to start with what the human workflow is. And the
human workflow today, when you have that like really hardcore kernel expert, let's say they're trying to port a new uh like workload over to Metal, right?
What they'll do is they'll say, "Okay, I have this implementation. Maybe I have a CUDA version, maybe I don't, and I'm going to try something. I'll see if it compiles." Most of the time, maybe not.
compiles." Most of the time, maybe not.
I'll see if it runs if see if it's correct. And if none of those are the case, you just pass that back into the human context, so to speak. And then once you get something
speak. And then once you get something that's working, then you start looking at the profiling information in depth and just hammering down like this is the bottleneck now. This is the bottleneck
bottleneck now. This is the bottleneck now. This is the bottle of the mic now.
now. This is the bottle of the mic now.
It's a very iterative process.
So I think that you know basically the idea here is to put AI as the as the kind of like where the human would go in that same
loop right so the agentic flow here is to make sure it compiles and it executes and it's correct and then from there optimizing it.
So this is something that I would say is like very new technology. There's a lot of interest here, but there's some things that it's good at and some things that it's still kind of in development
for. And so, let's dive into some of the
for. And so, let's dive into some of the specifics.
So, this is a quick demo of our system.
It the font's kind of small, but we're passing into a CLI tool, a PyTorch workload, targeting it to an H100, and the system had explored a bunch of
candidate optimizations. It's comparing
candidate optimizations. It's comparing to eager mode and torch compile and it found one uh candidate that was 22% faster than the torch compile baseline.
So this was a real case. It's just sped up because it actually took about 20 minutes.
So there's some challenges though with measuring these agents at kernel synthesis. So um like first of all you
synthesis. So um like first of all you have to figure out what your definition of correct is when you're dealing with floating point. This is always a
floating point. This is always a question. You can do different types of
question. You can do different types of tolerances, but you also need to make sure your input sizes are well selected.
If you're only passing in really small input sizes, it can cause problems with the benchmarking where you're measuring the overhead, not the actual kernel as the critical path.
You also have to make sure you're reliably measuring performance. So, if
you just do a naive timer start on your implementation, it's probably going to be wrong. And there was a great blog
be wrong. And there was a great blog that had a diagram for this because you're basically measuring the launch time, not the execution time. So there's
a bunch of kind of gotchas like that that when you're building an agentic system like this, you have to be really really careful about catching doing things like warm-ups and cache clearing because a lot of times you'll have
you'll have the original implementation run and then the new implementation run and then the original one's result is cached and the new one fetches it. So
you have all kinds of things like that that you have to be really neurotic about otherwise you might get bad results.
You also need great benchmarks for this.
I think that someone said earlier that there's not a ton of examples of low-level kernels across all these different hardware platforms. And so the input data is a challenge and also benchmarking it is a challenge. Like how
do you know if your agent is better? You
change the prompt to it. How do you know? It's the same story we hear with
know? It's the same story we hear with every agent here basically.
So we have some preliminary results on that we're sharing right now on Apple's um M4 uh using the metal framework. Um
and this is on the kernel bench benchmark the v0.1 version of it which is the latest one. So what we can see here is results across 250 problems and
it compares to either torch compile or eager mode depending on which one of those is faster. So with the kernel bench data set, we have different tiers
of problems with L1 being the easiest or simplest rather and L3 being like more complex. So what we can see for the
complex. So what we can see for the standalone agent is that we see an average speed up of about 25% or 24%.
And the sweet spot is those moderately complex problems. It seems honestly the same as like a lot of coding problems where it's good at moderately complex things, but then you push it too far and the performance drops off. So, an
interesting challenge here is going to be how do we make these agents perform better on more complex problems that they're going to have to break down and execute.
There we go. Um, let's talk about a couple of examples because I love to just see example code. So, this was a success case where the model found a case where we could do kernel fusion. So
for those that aren't that familiar with GPU kernels, kernel fusion is one of the most go-to techniques in kernel optimization where you say I have two kernels. Let's say in this case it was
kernels. Let's say in this case it was like a convolution softmax bias scaling and sigmoid. So those were five ops and
and sigmoid. So those were five ops and what the agent did was it took four of those ops and instead of running individual functions for those it made a mega function that compacted them all
together. So kernel fusion isn't new.
together. So kernel fusion isn't new.
It's something that torch compile already does quite well, but it's a common way that we found agents can speed up these workloads because you can really customize it to the specific use case.
So this result achieved a 40% speed up over the baseline on the M4.
And just kind of like zooming into what happened, the agent wrote a fused op. So
it basically wrote like C++ code that it put as kind of an inline string with the PyTorch code and then it called that fused operation in the forward pass of
the model and then that fused implementation we can see a snippet of it up here where it's basically taking those four ops and putting them together
in one mega op. And so this was done automatically.
Sometimes though like writing low-level kernels isn't the best optimization that we can get. We had another case which was on a level one problem which basically improved the performance by
80%. And the the insight the agent had
80%. And the the insight the agent had in this case was the operation in metal for average pool 1D was not as optimized
as some other ops that are much more optimized on metal. So what it did was it actually rewrote the pietorch code to use the more optimized op and reexpress
the same problem in a different way.
So to dive into this um the average pool 1D is basically taking averages across like one dimension. So like you can see that that input vector could produce the
output vector with five and seven averaging to six and so on.
So if you express that same thing as a convolution you can get the same result.
So if you do the math it will lead to that same result. And so that's what the agent did. Basically what it did was it
agent did. Basically what it did was it said hey instead of doing the original call to the baseline op let's generate that weights matrix and execute this as
a convolution because I know that that's really fast on metal.
There's also an interesting algorithmic optimization case. This was for a level
optimization case. This was for a level three problem. So it was more complex
three problem. So it was more complex where basically the agent figured out that it could combine two operations into a single operation at the pietorch
level, not even using low-level kernels.
So what we can see is that basically it fused it, it rewrote it as Python code and calls that single convolution and that's a lot more efficient because you don't have to launch as many ops.
But this does not always work. This is
not a silver bullet and I think that's really important to emphasize. So a case where the agent totally faceplanted was on matrix multiplication and it was it wrote a custom CUDA kernel for this but
it was a lot slower than the baseline.
And the thing is with this is matrix multiply is one of the most hand optimized ops that exists. So it's not that surprising that an agent would not do as well as something that a human
expert spent a long time on. So, this is an area that it did not work.
Another case was a case that we saw which had a 71,000x speed up. And
anything like that should trigger your suspicion brain.
Wow. 71,000. Great. We're done. It's,
you know, this technology is worth billions of dollars, right? No. So,
basically, what happened? So, this
operation is basically saying, give me inputs and I'm going to make sure they fall betweengative -1 and one. Okay,
that's what the operation being tested was.
So the agent figured out that for all of the test cases, this was already the case. So it wrote a nice long comment
case. So it wrote a nice long comment saying this is actually not necessary, so just output the input.
So you could argue this is the agent being smart because it's pruning unnecessary work, but I think a lot of us would agree that it's not in the spirit of what we're trying to benchmark here. So we've excluded cases like this
here. So we've excluded cases like this from our analysis, but it is interesting because maybe some of the times you would want it to do something like that.
And I think this is part of where the human element comes in with these agents. Sometimes the agent does
agents. Sometimes the agent does something that depending on your definition of what you want to see could be good or it could be bad. And so
that's where the human part kind of weighs in.
So, like I keep drawing parallels to other kinds of coding agents because even though this is like kind of a niche like low-level domain, I don't think that the story is fundamentally
different. We see standalone agents are
different. We see standalone agents are really good at cheaply generating like lots of different ideas and lots of possibilities to explore. They're good
at slurping in a ton of different context and seeing what helps. And
they're really good at doing these like level one and level two tasks. Like for
example, we're still not asking AI agents to write the Linux kernel, but what is still needed is robust and robust quality and performance validation. We need to make sure that
validation. We need to make sure that the agents aren't cheating and we need to make sure that the results are actually correct. We need empirical data
actually correct. We need empirical data from hardware in the loop to guide the search and optimization because it's actually really hard to look at low-level code and know how it's going to perform on the hardware. We still
heavily rely on looking on profiling data and things like that. And we also need the human in the loop to supervise the results and guide the work. So
design of a modern agent, you have multiple sub aents that are working together. You have that human in the
together. You have that human in the loop and a purpose-built harness for that task. And I think this is the
that task. And I think this is the pattern we've seen throughout this conference.
So just to get a little bit into kind of like what that architecture looks like and this is what we're you know what we're building at Gimlet you have a supervisor agent which takes in input
code target hardware and then also human prompting because humans still can really guide the best path for optimization that supervisor is in charge of managing the work. It deploys
the synthesis agentic swarm which collectively work together to come up with ideas for optimizations and they are basically the idea factory coming up with new techniques. Those ideas get
sent to the verification agent which is running them in on actual hardware in a hardware in the loop system to see how they do and that verification agent needs to be extremely strict about
making sure that no funny business is happening. And that's a major part of
happening. And that's a major part of the challenge.
So just a couple more realistic case studies that are not benchmarks. We got
really excited because we ran this on a vision transformer model and I don't know if you can see but basically the original uh vanilla implementation using torch compile and our generated code
using torch compile ours was twice as fast. So this was like a hoay moment. So
fast. So this was like a hoay moment. So
the speedups were promising, but then it turned out the optimization was just swapping out the original attention module for SDPA, which is a more optimized attention module. And this is
the kind of thing that yes, that's true.
That is a valid optimization, but I wouldn't necessarily call it rocket science. So we consider that to be a
science. So we consider that to be a trivial case study where if you're not using a more optimized attention module, maybe you haven't actually optimized your workload that much yet.
But we do still see interesting results for full models when we have human prompting. And one case for this was an
prompting. And one case for this was an audioenccoder model where it generated six custom kernels for the workload specialized for the RTX 6000 Blackwell.
And the results were strong. It was
about 70% faster. Both implementations
using torch compile.
So just to kind of show an example, we load in line six different fused kernels and then call them in the code. And the
nice thing about this approach, even though it's a little weird declaring these as strings, is that you have like a completely a API compatible swap in replacement for the original module in PyTorch.
So where are we with AIdriven kernel optimization? I think like I said before
optimization? I think like I said before this is not a silver bullet but it is a promising new tool in the toolbox. The
best applications that we see are things like searching across many bags of tricks. We know that fusion works. We
tricks. We know that fusion works. We
know that tiling works and we can run lots of experiments really quickly this way by launching them with agents and see what actually performs the best on the workload. It's also good at porting
the workload. It's also good at porting existing implementations to new hardware where it takes the insights from that original implementation and specializes them to the hardware available features
on the new target and also about translating existing optimizations to new scenarios. You can
quickly adopt new optimizations like let's say you're changing the quantization of your model. You can
still look at differently quantized implementations to guide that optimization.
In terms of the worst applications, we're still not at the point where they're writing the N+1 for flash attention, coming up with like those genius algorithmic advances. And they're
not currently outperforming a human expert who bang their head on this problem for months. And we shouldn't expect them to be. I think that the most exciting part of this work is allowing those people to focus on the most
interesting optimizations and getting us better than baseline on all the problems that they don't have time for.
So what's next in the work? We want to build uh abstract models of different machines to help the agents further specialize code to individual hardware.
We're also interested in generating basically what is like NVIDIA assembly such as PTX. You can see an example here because the thought is that we can basically do that better with AI than
humans because it's so cumbersome. And
then also looking at academic formal verification methods for correctness.
Um, also want to give a huge shout out to my colleagues. Um, they are the silent unspoken heroes here and um, you know, I love talking about this with people. So, please feel free to give me
people. So, please feel free to give me an email if you want to talk about kernel generation or anything that I covered. And we are hiring. So, if this
covered. And we are hiring. So, if this problem interests you, we'd love to chat.
Thanks.
Our next presenter argues that simple choices like direction over speed will help us to avoid the infinite software
crisis of maintaining a tangled mess.
Please join me in welcoming to the stage staff software engineer at Netflix, Jake Nations.
Hey everyone, good afternoon. Um, I'm
going to start my talk with a bit of a confession. Uh, I've shipped code I
confession. Uh, I've shipped code I didn't quite understand. Generated it,
tested it, deployed it, couldn't explain how it worked. And here's the thing, though. I'm willing to bet every one of
though. I'm willing to bet every one of you have, too.
So, now I'm going to admit that we all ship code that we don't understand anymore. I want to take a bit of a
anymore. I want to take a bit of a journey to see how this kind of has come to be. First, we look back in history.
to be. First, we look back in history.
We see the history tends to repeat itself. Second, we've fallen into a bit
itself. Second, we've fallen into a bit of a trap. We've confused easy with simple. Lastly, there is a fix, but it
simple. Lastly, there is a fix, but it requires us not to outsource our thinking.
So, I spent the last few years at Netflix helping drive adoption of AI tools, and I have to say the acceleration is absolutely real. Backlog
items that used to take days now take hours, and large refactors that have been on the books for years are finally being done. Here's the thing though.
being done. Here's the thing though.
Large production systems always fail in unexpected ways. Like look what happened
unexpected ways. Like look what happened with CloudFare recently. When they do, you better understand the code you're debugging. And the problem is now we're
debugging. And the problem is now we're generating code at such speed and such volume, our understanding is having a hard time keeping up.
Hell, I know I've done it myself. I've
generated a bunch of code, looked at it, thought, I have no idea how this what this does, but you know, the test pass, it works, so I shipped it. The thing
here is this isn't really new. Every
generation of software engineers has eventually hit a wall where software complexity has exceeded their ability to manage it. We're not the fa first to
manage it. We're not the fa first to face a software crisis. Were the first to face it at this infinite scale of generation. So let's take a step back to
generation. So let's take a step back to see where this all started.
In the late60s early '7s a bunch of smart computer scientists at the time came together and said hey we're in a software crisis. We have this huge
software crisis. We have this huge demand for software and yet we're not really able to keep up and like projects are taking too long and it's just really
slow. We're not doing a good job. So
slow. We're not doing a good job. So
Dystra Ko came up with a really great quote and he said when we had a few weak computers and I mean to paraphrase a longer quote we had a few weak computers programming was a mild problem and now we have gigantic computers programming
has become a gigantic problem. He was
explaining as hardware power grew by a factor a thousand society's wants of software grew in proportion and so it left us the programmers to figure out between the ways and the means how do we
support this much more software so this kind of keeps happening in a cycle in the 70s we get the C programming language so we could write bigger systems the 80s we have personal computers now everyone can write
software in the 90s we get object-oriented programming inheritance hierarchies from hell were you know thanks Java for In the 2000s, we get agile and we have sprints and scrum masters telling us
what to do. There's no more waterfall.
In the 2010s, we had cloud, mobile, devops, you know, everything. Software
truly ate the world.
And today, now we have AI. You copilot,
cursor, cloud, codeex, Gemini, you name it. We could generate code as fast as we
it. We could generate code as fast as we can describe it. The pattern continues, but the stale has really changed. It's
it's infinite now.
So, uh, Fred Brooks, you might know him from writing the mythical man month. He
also wrote a paper in 1986 called No Silver Bullet. And in this, he argued
Silver Bullet. And in this, he argued that there would be no single innovation that would give us an order of magnitude improvement in software productivity.
Why? Because he said the hard part wasn't ever the mechanics of coding, the syntax, the typing, the boilerplate. It
was about understanding the actual problem and designing the solution. And
no tool can eliminate that fundamental difficulty. Every tool and technique
difficulty. Every tool and technique we've created up in this point makes the mechanics easier. The core challenge
mechanics easier. The core challenge though, understanding what to build, how it should work, remains just as hard.
So if the problem isn't in the mechanics, why do we keep optimizing for it? How do experienced engineers end up
it? How do experienced engineers end up with code they don't understand? Now the
answer, I think, comes down to two words we tend to confuse, simple and easy. We
tend to use them interchangeably, but they really mean completely different things. Uh I was outed at the speaker
things. Uh I was outed at the speaker dinner as being a closure guy. So this
is kind of clear here. But Rich Hickey, the creator of the closure programming language, explained this in his talk from 2011 called simple made easy. He
defined simple meaning one fold, one braid and no entanglement. Each piece
does one thing and doesn't intertwine with others. He defines easy as meaning
with others. He defines easy as meaning adjacent. What's within reach? What can
adjacent. What's within reach? What can
you access without effort? Copy paste
ship. Simple is about structure. Easy is
about proximity.
The thing is, we can't make something simple by wishing it. So, simplicity
requires thought, design, and untangling. But we can always make
untangling. But we can always make something easier. You just put it
something easier. You just put it closer. Install a package, generate it
closer. Install a package, generate it with AI, you know, copy a solution off of Stack Overflow. It's it's human nature to take the easy path. We're
wired for it. You know, as I said, copy something from Stack Overflow. It's
right there. Framework that handles everything for you with magic. Install
and go. But easy doesn't mean simple.
Easy means you can add to your system quickly. Simple means you can understand
quickly. Simple means you can understand the work that you've done. Every time we choose easy, we're choosing speed now complexity later. And honestly,
complexity later. And honestly, that trade-off really used to work. The
complexity accumulated in our codebases slowly enough that we can refactor, rethink, and rebuild when needed. I
think AI has destroyed that balance because it's the ultimate easy bun. It
makes the easy path so frictionless that we don't even consider the simple one anymore. Why think about architecture
anymore. Why think about architecture when code appears instantly?
So let me show you how this happens. How
a simple task evolves into a mess of complexity through a conversational interface that we've all come to love.
You know this is a contrived example but you know say we have our app. We want to add uh some authentication to it. Say
add o. So we get a nice clean o.js file.
Iterate on a few times. It gives a message file like okay cool. We're going
to add oth now too because and now we've got an ojs and ooth js. We keep
iterating and then we find ourselves that sessions are broken and we got a bunch of conflicts and by the time you get to turn 20, you're not really having a discussion anymore. You're managing
context that becomes so complex that even you don't remember all the constraints that you've added to it.
Dead code from abandoned approaches. Uh
tests that got fixed by just making them work. You know, fragments of three
work. You know, fragments of three different solutions because you kept saying, "Wait, actually each new instruction is overwriting architectural patterns. We said make the off work
patterns. We said make the off work here, it did. When we said fix this error, it did. There's no resistance to bad architectural decisions. The code
just morphs to satisfy your latest request. Each interaction is choosing
request. Each interaction is choosing easy over simple. And easy always means more complexity. We know better. But
more complexity. We know better. But
when the easy path is just this easy, we take it. And complexity is going to
take it. And complexity is going to compound until it's too late.
AI really takes easy to its logical extreme. Decide what you want. Get code
extreme. Decide what you want. Get code
instantly. But here's the danger in that. The generated code treats every
that. The generated code treats every pattern in your codebase the same. You
know, when an agent analyze your codebase, every line becomes a pattern to preserve. The authentication check on
to preserve. The authentication check on line 47, that's a pattern. That weird
gRPC code that's acting like GraphQL that I may have had in 2019, that's also a pattern. Technical debt doesn't
a pattern. Technical debt doesn't register as debt. It's just more code.
The real problem here is complexity. I
know I've been saying that word a bunch in this talk without really defining it, but the best way to think about it is it's the opposite of simplicity. It just
means intertwined. And when things are complex, everything touches everything else. You can't change one thing without
else. You can't change one thing without affecting 10 others.
So, back to Fred Brooks's no bullet paper. In it, he identified that there's
paper. In it, he identified that there's two main types of complexity in every system. There's the essential
system. There's the essential complexity, which is really the fundamental difficulty of the actual problem you're trying to solve. Users
need to pay for things. Orders must be fulfilled. Is it the complexity of why
fulfilled. Is it the complexity of why your software system exists in the first place? And then second, there's this
place? And then second, there's this idea of accidental complexity.
Everything else we've added along the way, workarounds, defensive code, frameworks, abstractions that made sense a while ago, it's all the stuff that we put together to make the code itself work.
In a real codebase, these two types of complexity are everywhere and they get so tangled together that separating them requires context, history, and experience.
The generated output makes no such distinction and so every pattern is keeps just getting preserved.
So here's a real example from uh some work we're doing at Netflix. I have a system that has a abstraction layer sitting between our old authorization code we wrote say five or so years ago and a new centralized o system. We
didn't have time to rebuild our whole app. So we just kind of put a shim in
app. So we just kind of put a shim in between. So now we have AI. This is a
between. So now we have AI. This is a great opportunity to refactor our code to use the new system directly. Seems
like a simple request, right?
And no, it's like the old code was just so tightly coupled to its authorization patterns. Like we had permission checks
patterns. Like we had permission checks woven through business logic ro assumptions baked into data models and off calls scattered across hundreds of files. The agent would start refactoring
files. The agent would start refactoring get a few files in and hit a dependency he couldn't untangle and just spiral out of control and give up or worse it would try and preserve some existing logic
that from the old system and recreating it using the new system which I think is not great too.
The thing is it couldn't see the scenes.
It couldn't identify where the business logic ended and the off logic began.
Everything was so tangled together that even with perfect information, the AI couldn't find a clean path through. When
your accidental complexity gets this tangled, AI is not the best help to actually make it any better. I found
that only adds more layers on top.
We can tell the difference, or at least we can when we slow it down enough to think. We know which patterns are
think. We know which patterns are essential and which are just how someone solved it a few years ago. We carry the context that the AI can infer, but only if we time to make take time to make
these distinctions before we start.
So how do you actually do it? How do you separate the accidental and essential complexity when you're staring at a huge codebase? Codebase I work on Netflix has
codebase? Codebase I work on Netflix has around a million lines of Java and the main service in it is about 5 million tokens last time I checked. No context
window I have access to uh can hold it.
So when I wanted to work with it, I first thought, hey, maybe I could just copy large swaths of this codebase into the into the context and see if the patterns were emerged, see if it would just be able to figure out what's happening. And just like the
happening. And just like the authorization refactor from previously, the output just got lost in its own complexity. So with this, I was forced
complexity. So with this, I was forced to do something different. I had to select what to include, design docs, architecture, diagrams, key interfaces, you name it, and take time writing out the requirements of how components
should interact and what patterns to follow.
See, I was writing a spec. Uh 5 million tokens became 2,000 words of specification. And then to take it even
specification. And then to take it even further, take that spec and create an exact step set of steps of code to execute. No vague instructions, just a
execute. No vague instructions, just a precise sequence of operations. I found
this produced much cleaner and more focused code that I could understand. So
I defined it first and planned its own execution.
This became the approach which I called context compression a while ago. But you
call it context engineering or spectriven development, whatever you want. The name doesn't matter. What only
want. The name doesn't matter. What only
matters here is that thinking and planning become a majority of the work.
So let me walk you through that how this works in practice.
So you have step one, phase one, research. You know, I go and feed
research. You know, I go and feed everything to it up front. Architecture
diagrams, documentation, slack threads.
I mean, we've been over this a bunch, but really just bring as much context as you can that's going to be relevant to the changes you're making. and then use the agent to analyze the codebase and map out the components and dependencies.
This shouldn't be a oneshot process. I
like to probe see like what about the caching? How does this handle failures?
caching? How does this handle failures?
And when its analysis is wrong, I'll correct it. And if it's missing context,
correct it. And if it's missing context, I provide it. Each iteration refineses its analysis.
The output here is a single research document. Here's what exists. Here's
document. Here's what exists. Here's
what connects to what. And here's what your change will affect. Hours of
exploration are compressed into minutes of reading.
I know Dex mentioned it this morning, but the human checkpoint here is critical. This is where you validate the
critical. This is where you validate the analysis against reality, the highest leverage moment in the entire process.
Catch errors here, prevent disasters later.
On to phase two. Now that you have some valid research in hand, we create a detailed imple implementation plan. Real
code structure, function signatures, type definitions, data flow. You want
this to be so any developer can follow it. I I kind of liken it to paint by
it. I I kind of liken it to paint by numbers. You should be able to hand it
numbers. You should be able to hand it to your most junior engineer and say, "Go do this." And if they copy it line by line, it should just work.
This step is where we make a lot of the important architectural decisions. You
know, make sure complex logic is correct. Make sure business requirements
correct. Make sure business requirements are, you know, following good practice.
Make sure there's good service boundaries, clean separation, and preventing any unnecessary coupling. We
spot the problems before they happen because we've lived through them. AI
doesn't have that option. It treats
every pattern as a requirement.
The real magic in this step is the review speed. We can validate this plan
review speed. We can validate this plan in minutes and know exactly what's going to be built. And in order to keep up with the speed at which we want to generate code, we need to be able to comprehend what we're doing just as
fast.
Lastly, we have implementation. And now
that we have a clear plan and like backed by clear research, this phase should be pretty simple. And that's the point. You know, when AI has a clear
point. You know, when AI has a clear specification to follow, the context remains clean and focused. We've
prevented the complexity spiral of long conversations. And instead of 50
conversations. And instead of 50 messages of evolutionary code, we have three focused outputs, each validated before proceeding. No abandoned
before proceeding. No abandoned approaches, no conflicting patterns, no wait actually moments that leave dead code everywhere.
To me, what I see is the real payoff of this is that you can use a background agent to do a lot of this work because you've done all the thinking and hard work ahead of time. it can just start the implementation. You can go work on
the implementation. You can go work on something else and come back to review.
And you can review this quickly because you're just verifying it's conforming to your plan, not trying to understand if anything got invented.
The thing here is we're not using AI to think for us. We're using it to accelerate the mechanical parts while maintaining our ability to understand it. Research is faster, planning is more
it. Research is faster, planning is more thorough, and the implementation is cleaner. the thinking, the synthesis and
cleaner. the thinking, the synthesis and the judgment though that remains with us.
So remember that uh authorization refactor I said that AI couldn't handle.
The thing is now we're actually you know working on it now and starting to make some good progress on it. The thing is it's not because we found better prompts. We found we couldn't even jump
prompts. We found we couldn't even jump into doing any sort of research planning and implementation. We actually had to
and implementation. We actually had to go make this change oursel by hand. No
AI, just reading the code, understanding dependencies, and making changes to see what broke. That manual migration was,
what broke. That manual migration was, I'll be honest, was a pain, but it was crucial. It revealed all the hidden
crucial. It revealed all the hidden constraints, which invariants had to hold true, and which services would break if the O changed. Things no amount of code an analysis would have surfaced
for us. And then we fed that pull
for us. And then we fed that pull request of the actual manual migration into our research process and had it use that as the seed for any sort of
research going forward. The AI could then see what a clean migration looks like. The thing is each of these
like. The thing is each of these entities are slightly different. So we
have to go and interrogate it and say hey what do we about do about this? Some
things are encrypted some things are not. We had to provide that extra
not. We had to provide that extra context each time uh through a bunch of iteration.
Then and only then we could generate a plan that might work in one shot. And
the key and might's the key word here is we're still validating, still adjusting, and still discovering edge cases.
The three-phase approach is not magic.
It only works because we did this one migration by hand. We had to earn the understanding before we can code into our process. I still think there's no
our process. I still think there's no silver bullet. I don't think there's
silver bullet. I don't think there's better prompts, better models, or even writing better specs. just the work of understanding your system deeply enough that you can make changes to it safely.
So why go through with all this? Like
why not just iterate with AI until it works? Like eventually won't models get
works? Like eventually won't models get strong enough and it just works. The
thing to me is it works isn't enough.
There's a difference between code that passes test and code that survives in production. Between systems that
production. Between systems that function today and systems that that can be changed by someone else in the future. The real problem here is a
future. The real problem here is a knowledge gap. When AI can generate
knowledge gap. When AI can generate thousands of lines of code in seconds, understanding it could take you hours, maybe days if it's complex. Who knows,
maybe never, if it's really that tangled.
And here's something that I don't think many people are even talking about this point. Every time we skip thinking to
point. Every time we skip thinking to keep up with generation speed, we're not just adding code that we don't understand. We're losing our ability to
understand. We're losing our ability to recognize problems. That instinct that says, "Hey, this is getting complex." It
atrophies when you don't understand your own system.
Pattern recognition comes from experience. When I spot a dangerous
experience. When I spot a dangerous architecture, it's because I'm the one up at 3 in the morning dealing with it.
When I push for simpler solutions, it's because I've had to maintain the alternative from someone else. AI
generates what you ask it for. It
doesn't encode lessons from past failures.
The three-phase approach bridges this gap. It compresses understanding into
gap. It compresses understanding into artifacts we can review at the speed of generation. Without it, we're just
generation. Without it, we're just accumulating complexity faster than we can comprehend it.
AI changes everything about how we write code. But honestly, I don't think it
code. But honestly, I don't think it changes anything about why software itself fails. Every generation has faced
itself fails. Every generation has faced their own software crisis. Dystra's
generation faced it by creating the discipline of software engineering. And
now we face ours with infinite code generation.
I don't think the solution is another tool or methodology. It's remembering
what we've always known. That software
is a human endeavor. The hard part was never typing the code. It was knowing what to type in the first place. The
developers who thrive won't just be the ones who generate the most code. They'll
be the ones who understand what they're building, who can still see the scenes, who can recognize that they're solving the wrong problem. That's still us. That
will only be us.
I want to leave on a question, and I don't think the question is whether or not we will use AI. That's a foregone conclusion. The ship has already sailed.
conclusion. The ship has already sailed.
To me, the question is going to be whether we will still understand our own systems when AI is writing most of our code.
Thank you.
Ladies and gentlemen, please welcome back to the stage Jed Borave.
>> Welcome, welcome. Let's give it up for Jake.
All right, we are 6 hours in and we are about to take our break, but before we leave, I want to tell you what's next.
We have something very special. We're
going to have the CEO of Poolside, former CTO of GitHub coming to do the first public demo of Poolside. So, um, I wouldn't miss it. Less than a month ago, it was reported that Nvidia was going to put up to a billion dollars in Poolside.
So, I'm excited to see what they're going to show us. Um, with that, let's break. There's going to be a, um, down
break. There's going to be a, um, down in the expo booth, we have Ryan from AMP SourceCraft demoing AMP and talking about how to build with agents. So, also
check that out during the break. We'll
see you back at four. Thanks everybody.
Heat Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat up here.
Heat.
Heat.
Heat up here.
Heat up here.
Heat up here.
Heat up here.
Heat. Heat.
Heat. Heat.
Heat up here.
Heat up here.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat up here.
Heat.
Heat. Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Heat up here.
Heat up here.
Heat.
Heat.
Heat up here.
Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat up here.
Heat. Heat.
Heat. Heat.
Heat up here.
Heat.
Heat.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat up here.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Heat up here.
Heat up here. Heat.
here. Heat.
Heat.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat up here.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat up Heat up here.
Heat. Heat.
Heat up here.
Heat. Heat.
Heat up here.
Heat up Heat up here.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Heat up here.
Heat. Heat.
Heat up Heat up here.
Heat up here.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
This is going to be our last session and as we heard it's going to be a good one.
Um we're gonna start with Poolside and we're gonna hear from Arise Klein Meteor Meter and DeepMind. Um so to get us started, help me join uh help me welcome our first speaker to the stage, co-CEO
at Poolside, former CTO of GitHub and Heroku, Jason Warner.
Hey, hey everybody.
So I know you were expecting ISO, my co-founder here. So, I'm 5 in shorter,
co-founder here. So, I'm 5 in shorter, 30 lbs heavier, and uh not as good-looking, but I think you'll see him here really quickly, too. So, um how
many people here know what poolside is and does? Anyone? Anyone? Yeah. So,
and does? Anyone? Anyone? Yeah. So,
let's talk about that real quickly.
Poolside exists to close the gap between models and human intelligence. That's
literally it. That's what we're here to go do. We're building our own models
go do. We're building our own models from scratch to do this. were based on the idea two and a half years ago that we thought next token prediction was an amazing techn technological breakthrough
but it need to be paired with reinforcement learning really to make that leap. So that's what we've been
that leap. So that's what we've been doing for the past 2 and a half years.
So we're on our second generation of models now Malibu agent and instead of kind of like walking you through some slides and all that we just thought maybe I don't know let's kind of show
you what we're doing here. So, ISO, are you there?
>> I got you, Jason.
>> So, as I said, you were supposed to see him today, but there's I don't know. Our airline system kind of works sometimes, maybe. So, he's stuck in California, but uh we thought we just
walk you kind of through some um some demos here today. So, what you're looking at here is a very modern programming language that the government uses to run all the world's critical infrastructure called ADA. Anyone
familiar with ADA?
Yes. Yes. Okay. So, everyone I saw put their hands up for Ada either has no hair or gray hair like me. So, that
should tell you what's going on here.
So, iso, why don't we uh why don't we figure out what's going on with this codebase here?
>> Well, let's start asking what the codebase is about.
>> That's great. And what you're seeing here is obviously our assistant in in Visual Studio Code backed by poolside agent, a model we train from scratch
using our proprietary techniques. Um,
and you can see what's going on here.
Kind of the stuff you expect from an agent. Uh, and obviously the form
agent. Uh, and obviously the form factors of all of these things are going to change a couple of times over the next couple of years, but you know, people seem to like VS Code. Uh, so
we're going to, you know, show you this demo here today. So, you can see from this, it kind of went through told you what this codebase is all about, but um, you know, these things run in our
satellites and, uh, I don't know anything about ADA, but I do know a lot about a couple of other programming languages. So, uh, ISO, what do we want
languages. So, uh, ISO, what do we want to do here? Why don't we, uh, see what this thing might look like in Rust?
>> Let's do it. Let's ask it convert this database to Rust.
>> So, obviously, you're going to see what's going on here. Again, if you guys have used other tools, you're not going to expect too much of the difference for what's happening here, except that again, we're backed by our own model.
We're not using OpenAI. We're not using Anthropic. This is poolside. And
Anthropic. This is poolside. And
poolside is a bottom top stack that is right now if no one's touched it and I know no one in this room has touched this unless you work for a three-letter agency, a defense contractor or you've
sent missiles somewhere that we're not going to talk about in this session. Um
because that's where we're working.
We're working in high consequence code environments for the last year inside the the government and the the defense sector. Um as you can see from this
sector. Um as you can see from this demo. Um so what you see here is is kind
demo. Um so what you see here is is kind of going through doing the conversions.
What you see in the middle pane is something that we built to kind of show you as the streams come through all the different changes that are happening.
Um, one of the tricky parts about working on inside the defense sector and things like that is you can't have an agent that's just going to run around and do stuff. I mean like I can't walk into half of these buildings. You can't
give an agent access to these data source and just say, "Hey, go nuts." You
need to have the right permissions. You
got to actually really ratchet these things down to do things inside those environments that you know they feel comfortable with. So, uh, where are we
comfortable with. So, uh, where are we on this now? What is is it trying to fix itself yet? Yes. So, it's it wrote about
itself yet? Yes. So, it's it wrote about 1152 lines of code. Uh, and it just popped up a command start tested.
Excuse me. Uh, so we see here all of the files on the left hand side that it created. Uh this is essentially our live
created. Uh this is essentially our live diff view that's available.
Uh and as we see it's currently starting to actually test it out.
So this is the part where we just sit here and watch this for 3 minutes and I see nothing.
>> No. What you see >> the good thing is that this is a very fast inference.
>> Yes.
>> So 1100 lines of code.
>> Task completed.
>> Do we know if this works yet?
>> Well, let's have a look. So it actually wrote some bell commands to test it. And
when we check out the output of the O, this actually looks pretty good.
>> Can we ask >> can we verify that >> to run it? Let's go verify it. So of
course our agent came back and gave summary of what it did. But let's just ask how to run this.
Okay.
So, I'm going to go open up. So, it says this is how I can run the ADA version and this is how I can run the Rust
version. Let's run the Rust version.
version. Let's run the Rust version.
Perfect. Let's have a look here.
We might be hitting an actual >> an actual demo bug.
>> Let's have a look.
>> Let's see what happens.
>> I know. No, no. Just warnings.
>> Just warnings.
>> Do we have an unwrap in there that we need to take care of? I heard that those things are dangerous.
>> So right now there's a ripple.
Uh let's hit help. See what we're able to do. So it looks like we have a set of
to do. So it looks like we have a set of commands. I'm going to be lazy. I'm
commands. I'm going to be lazy. I'm
going to copy paste these queries.
So create table users. Okay. So far so good.
Let's insert a record.
Okay. Well, let's find out if it actually did its job. select star from users. Okay, we've got a record here.
users. Okay, we've got a record here.
>> That's nice.
>> Now, now I want to actually uh you see if I use the up arrow, it doesn't actually allow me to cycle through commands. Let's ask it to add a
through commands. Let's ask it to add a feature.
Uh allows me to use the up arrow to cycle through.
I think it will understand my intent here.
One thing we know about ISO is he actually does know how to read and write, but he can't type. So all those errors that you're seeing in there. Uh
yeah.
>> So it looks like the agent's identified a package that we can use. Let's just
quickly look here. Compare this to version one.
And it looks like it's adding a library called Rusty Line and changing the files accordingly.
It's currently built it and it looks like the build output is successful.
There's some warnings. We'll ask to clean those up later on. And it's now starting to test it.
And apparently it works. It's going to It wrote itself a little bash script to test the history.
It's wrote itself a little final demo script.
So, let's let it Okay. So, and it gave us the summary.
Okay. So, and it gave us the summary.
Well, now how do I rerun this? I do kind of know that though. So, let's just >> should know that. That was 30 seconds ago.
>> Let's build it and let's run it again.
Okay, let's do a help.
And oh yeah, that's the up arrow. It
works.
>> Very nice.
>> Now, our models aren't just capable coding agents. They're capable in lots
coding agents. They're capable in lots of areas of knowledge work. They're also
emotionally intelligent. They're fun.
They're great to write bedtime stories with for the kids. So, I'm gonna ask you to write me a poem about all these changes, but that's just more for fun.
So, as Isa was saying, this is just an interface into our platform. There's
other interfaces into it if you're inside one of those organizations that has adopted poolside. So, this is the coding interface into it, but we also have other pl ways in which you you can interact with it web as well as an agent
that you can download on your machine.
But um yeah, we don't really tout the poem writing or the songwriting. Though
I did send this to my wife to see and I have been sending her love letters written by poolside. So I kind of hope that she did not enter this session to know exactly how I've been doing that
for the past 6 months. But uh yeah, so this is kind of poolside. This is what we've been up to. Um so as I said, Malibu agent is a second generation.
We've got a ton more compute coming online and that's when we're training our next generation. That is be going to be the one that comes out to publicly to everybody very early next year. We're
going to have it behind our own API.
It'll be on Amazon behind the bedrock API. Anybody in the world who's building
API. Anybody in the world who's building out any sort of on a one side the engineering assistants like the cursors, winds surfs, cognitions, replets of the world, you can use ours. or if you use
building out on any other side of the fence, the Harveys, the writers, the whatever applications of the world, there's going to be a fifth model out there that's going to be at that level that you can you can consume. But we're
dead set on doing this and bringing this out to everybody in the world and kind of advancing that state-of-the-art.
We're just going to keep pushing that out. So, that's kind of who we are. Um,
out. So, that's kind of who we are. Um,
and uh you can find out very little more at our website since we don't put much out there.
But ISO, anything else you want to say before you uh try to go make your flight this time, please?
>> So, I would say that it's been uh pretty incredible journey for the last 2 and 1/2 years of starting entirely from scratch and now building to a place where we see our models have grown up to become increasingly more intelligent.
And the kind of missing ingredient that we had was compute. And now that it's unlocked for us and and with a large number of over 40,000 GB300s coming online, we see how we can start scaling
up some of those models uh to get even further uh in in their level of capabilities and software development and other types of long horizon knowledge work. What I think is exciting
knowledge work. What I think is exciting about this conference and this audience is of all the work that's happening of evolving the form factor. Right? Right
now what we looked at was this asynchronous way of operating with agents. But you know Jason, you and I we
agents. But you know Jason, you and I we have agents running that are doing tasks for for hours. And I think in the near future we can see a world where they're able to start doing tasks in days in the coming years. And so I think the
coming years. And so I think the interface will continue to change. Uh
we're really focused on the fundamentals building intelligence and being able to scale up and serve it. And it's why we go full vertical. It's why we go from our multi gigawatt campus in West Texas where we're building out data centers
for team building out models. And the
interface that you saw today is just our version of an expression. But I think this audience is going to do an incredible job of building lots of better versions of how to express using that intelligence uh into actually, you know, valuable, economically valuable
work.
Couldn't have said it better. Can't wait
to see what you guys build on this uh in the future when it's publicly available.
And if anyone really does want to build a data center campus, we are hiring for that. Um it is weird to be putting
that. Um it is weird to be putting shovels in ground again like we did in the 90s and early 2000s, but that's what you got to do to scale intelligence these days.
So, >> I would make one other non-scheduled statement if you're going to be okay with this one, Jason.
>> As as our models are are getting more capable, we'd love to also see who wants to build with them. Right now, the the vast majority of of companies that are doing additional reinforcement learning
and fine-tuning on top of models are are doing it on what I would consider right now the you know, best-in-class open source models, the Quens and Fies and Miniaxes of the world. And uh we'd like to start figuring out how we can you
know partner with you with our our models anywhere from any checkpoint early on to where we are today for you to be building closer together with us on top of things. Uh we haven't really figured out the approach to it yet. Uh
but I think since we have this audience it's uh it's not a bad place to put it out there and so definitely reach out to us. Uh we think the world till date was
us. Uh we think the world till date was built by intelligence. The world in the future is being built on top of intelligence and so be a great way to partner.
>> Well thanks ISO. Thanks everybody here.
And now we do have five minutes left. I
don't know if we're supposed to take questions, but I'm happy to. So, if
anyone does, but if not, I'm just going to go that way.
>> What was that?
>> Sort of. I mean, I think of him that way. Here, here's a fun story. Here's
way. Here, here's a fun story. Here's
how I met ISO. I like to tell this story because um ISO is a fun fun dude. I met
ISO because started with a failed acquisition at GitHub. So back when I joined GitHub in 2017 as a CTO, I wanted to take GitHub from a kind of collaborative collaborative code host with open source bent and turn it into
an endto-end software development platform infused by intelligence. And so
you know the the products that we launched from 27 on or 17 on GitHub actions packages alerts notifications, eventually code spaces um and then co-pilot was the last thing
that the office of the CTO did before I left with Natt Freriedman, Ugmore and a couple of other folks inside there. But
ISO in 2017 when I joined uh he had working code completion before the transform architecture had landed fully.
He had on LSTMs and so I quickly tried to acquire his company and he just he just said no. So he just said no to me.
Uh but we had that was a long drawn out process talking about what we thought neural networks were going to mean for the world. And so during that process
the world. And so during that process which was a lengthy one, we became really good friends and we'd stayed in close contact over the years. And then
22 rolled around obviously chat GPT comes out, Anthropics out and we kind of saw the endgame at play and we said do we jump back in or not? And of course yes we jump back in. But I like to tell
that story about how he just kept saying no to me and I just kept asking him questions and eventually he said yes, we should found a company cuz by the way when I asked him if we should do this he said oh godamn no. That was were his
exact words. He's like no we should just
exact words. He's like no we should just learn how to paint and sale. But here we are.
So >> yeah, >> it's it's been a ground roll journey together. Jason, I I think the reason we
together. Jason, I I think the reason we ended up doing this is because of our our opinionated view on what it was going to take to build more capable intelligence and and the first 18 months of this company, you know, obsessing and
focusing on reinforcement learning combined with LLMs felt like one of the most contrarian opinions in the world, but I think today it's absolutely not.
And it's super exciting to see the the progress that's continuing to make. like
we're in the coming years we're going to see the world that started in completions and went to chat and is now at agentic increasingly approach more autonomous and we're all of it is stemming effectively from the
combination of bringing highly capable models that are constantly evolving together with real world problems and and I think what we're starting to see now is we're entering these kind of awkward teenage years ahead of AGI where
everybody in this room who's building out incredible companies and applications is bridging this gap of what it really takes to make intelligence that in its raw form actually be valuable and we uh we want
to be a small humble part of that. We've
got a lot of work still ahead of us. Uh
the team is growing. Uh but hopefully what you've seen today uh is what our our customers and enterprises have been having access to and seeing for a while is that we're you know hard at work at uh and really pushing those capabilities. We also want to make sure
capabilities. We also want to make sure we make them available to build together with others.
>> Well, that's it. Thanks everybody.
RL has boosted base models, but it's opaque and hard to scale across enterprises.
But what if we could apply RL techniques to prompts instead of model weights?
Here to show us how is the co-founder and CPO of Arise, Aparna Dina Kuran.
Hi everyone. Thanks so much for coming.
Um well, today I'm excited. We're going
to talk a little bit about prompt learning and how to use that with evals.
uh if any of you guys um are spending a lot of time thinking about the frontier coding models, I think there's so much attention on on them, but just what's
not so obvious is how much time is actually spent uh on the system prompts uh for those building these coding agents. So, here's actually a look. Um
agents. So, here's actually a look. Um
this is a tweet that went viral about the whole system prompt uh of Claude that's been leaked. I'm sure you know they've changed it since then. Um, but
you can actually see that Claude, there's cursor, there's Clyde. Um, and
just the length of the actual system prompt um, for each one of these. And I
think what's not as obvious is these actually aren't just static. They are
repeatedly iterated on. And it's such an important piece of context that actually goes into making these coding agents the most successful agents out there.
Um, it's not just us talking about it.
Karpathi talks about it a lot. Um and
this was a viral tweet that that he posted which was there's this paradigm around iterating on these prompts that he he's kind of coined it system prompt
learning and what he said is that it almost feels like humans learning because they take back English feedback uh and use that to actually iterate on
what they should do differently the next time. And I think he wrote something
time. And I think he wrote something like it's almost like that movie momento where the guy forgets uh what you know what he learns and then he starts writing it down and then uses that to
actually kind of go through his next day. And so this is a little bit of the
day. And so this is a little bit of the concept behind system prompt learning.
And what we wanted to do was show you guys a little bit of how that works and then put that to test on two of the most popular coding agents uh Claude and Klein today. So first off, how does
Klein today. So first off, how does prompt learning actually work? So for
those of you who are familiar with RL, what I thought we'd do is just do a little analogy compare how does RL work versus system prompt learning. For RL,
you know, if we just took an analogy of a student who's trying to improve their exam scores, they take an exam, you know, somebody grades the exam, you have a scalar reward, which is like, you
know, they got a 70%, an 80%, 90%, and then they have to figure out almost blindly just with that score how to actually improve their score on the next
exam. And I think this is actually one
exam. And I think this is actually one of the flaws of I mean RL works, don't get me wrong, amazing in so many concepts and domains, but it can be, you
know, a long path to actually figure out what the right solution is. And I think some of the things that we've noticed is that it can be sample inefficient. It
takes a lot of data to get what you want. It's time inensive. It's data
want. It's time inensive. It's data
hungry. You need to have a whole data science team to do this. And it just might be overkill for teams who are trying to build agents because LLMs are already so good. So if you're a team
who's actually trying to build an agent, maybe prompt learning is actually slightly might be slightly more of an interesting paradigm for you. So in this scenario, same same analogy. You have a student
who's taking an exam, there's some exam score, except in this case, what actually gets outputed isn't just the score. They got a 70, they got an 80,
score. They got a 70, they got an 80, but you also get back some kind of English feedback. Why did they get this
English feedback. Why did they get this answer right? What did they mess up on?
answer right? What did they mess up on?
Here's concepts that they missed on, what do they need to go study? And then
they use this information to actually go and and prepare on what to do next um to to get a better score. This is basically the the concept that we applied to
coding agents. And we ran this kind of
coding agents. And we ran this kind of test on both Claude as well as Klein. Um
both of these as you know start off with some kind of uh system prompt which in cloud code this is kind of a snippet of it and they both kind of come with something that you can append rules to.
So client has rules cloud MD has the cloud MD file and it starts off empty.
You can go in and add whatever is important for your repo. So what we did was actually took you know just benchmark both client and cloud code on SWEBench. I'm going to kind of run
SWEBench. I'm going to kind of run through theam uh this entire example at Swebench, but this entire thing we also ran on BBH and a ton of other uh software engineering data sets but you
can see here just on vanilla client vanilla cloud code um nothing added to the cloud MD or the client rules um they had you know about I think with client
somewhere on you know cloud set 45 it was about 30% of the GitHub issues actually resolved uh cloud code it was about 40% of the GitHub issues resolved so we took This is kind of our starting
benchmark and the thesis is is could we actually use prompt learning to see if we can improve the system prompt and see if um it was able to with the new system
prompt actually you know give us a better uh score on these benchmarks. We
didn't do anything on fine-tuning. We
didn't change the models anything like that. It was just focused on the system
that. It was just focused on the system prompt. Um this is the process that we
prompt. Um this is the process that we went through. We took the coding agent
went through. We took the coding agent uh we had it actually write some code.
um we ran unit tests and then um we then passed that through to some kind of um model that was doing the LLM as a judge evals and I'll show you guys what that
looks like but the LM as a judge eval actually gave back why did it fail did it fail because of this uh can you give some examples of you know what were common scenarios that it didn't do good
on and then it actually used those kind of evals to then go back and add it to a meta prompt to come back with kind of the the system prompt rules that we're going to append to. So let's talk
through kind of the process. So first we had kind of the SWEBench data set. Uh
SWEBench in this scenario is just 150 examples. Uh we did this for both client
examples. Uh we did this for both client and cloud code where we took the original prompt which had no rules. We
gave it kind of the software engineering problem and then it generated some kind of patch to actually solve that and then we ran the generated solution through the unit test.
Then whatever the unit test came back with, whether it was right or wrong, we then passed this into an LLM as a judge eval. And this is kind of the most
eval. And this is kind of the most important part because this actually generated the explanation for us. So we
passed in the problem statement. We
passed in what the coding agent solution was, the unit test, and then the actual solution that it came up with. Uh pass
that in. And this that you're looking at in the center here is actually the LLM as a judge eval. And these evals, we're going to talk into this uh talk a bit about this, but eval engineering is a
whole kind of concept that you know we spend a lot of time on. And writing
really good evals is I think um how you get the best kind of insight into what you could do to improve your agents. So
in this scenario, what we did was we wrote a good LM judge eval prompt. It
outputed whether it failed or passed.
And then this is the key part. We
actually asked for an explanation. Why
did it actually mess up? um you know for specific libraries in the Sweetbench light test um you know it was parsing errors or it was not handling um there
there's all sorts of actually different categories of errors but we went through and we we kind of looked at the explanation of what went wrong in each scenario. We then passed into a huge
scenario. We then passed into a huge meta prompt. So this is actually what's
meta prompt. So this is actually what's helping us iterate on our system prompt.
We passed in the original claude or client system prompt. We passed in the original rules which for us started off empty. Um and then we passed in here was
empty. Um and then we passed in here was the input, here was the LLM is a judge eval and then here was the actual explanation from that eval.
Passed that all into the meta prompt and then we did kind of a diff comparing you know the old world. So just for you just to remember the old world had the original clawed system prompt no rules
kind of added or appended to it. And
then the new world where it generated this entire rules of what to avoid or what to um what it had learned essentially from all those mistakes it
had actually made. And then we ran this basically on the entire speedbench light again. Um and what we saw was that you
again. Um and what we saw was that you know on 150 examples we were able to get cloud code up by 5% more GitHub issues
resolved clin um you know 15% and this was literally on I think the key thing is like 150 examples of just training
data that was used um on the most kind of powerful coding agents that are out there. Um, and so just think about kind
there. Um, and so just think about kind of the impact that could have for your agents. Many of you guys in this room
agents. Many of you guys in this room might be thinking, okay, well, prompt learning is cool, but how does that compare to GEA? If you're familiar with DSPI and you've kind of seen, I don't know if it's GPA or Jeepa. I've heard
both. Um, but you know, you guys might be asking, well, how is this different?
Um, so GEA, just just in case you guys aren't familiar, it's a prompt optimizer from DSPI that is essentially very very similar to what we're talking about,
which is taking English feedback using that English feedback inside of the actual prompt. Um, and what we did was
actual prompt. Um, and what we did was actually run a sidebyside benchmark against GEA where we compared kind of our prompt learning against GEA. And um
I think what we saw was that GEA required many many loops and rollouts compared to um kind of a a fraction of that which was our approach. And I think
the key difference here, I mean the underlying approach around using English feedback is the same, but I think the key thing that was really different here was we spent a lot of time actually developing and iterating on the evals
and the eval prompts really mattered to making sure that you gave really good explanations back to the agent. Um, and
so eval.
This was super critical for us to be able to get this to work. Um, and if you guys are curious about learning more, reading more about kind of what we do,
um, check out kind of our blog. We write
a lot about eval prompt optimization and, uh, we're actively hiring, so come check us out. Awesome.
When it comes to AI agents, theory doesn't always compile to production.
Here to share hard one lessons building effective AI coding agents is the creator and head of AI at Klein, Nick Pash.
Wow, it's wild to be on the same stage as so many people I've drawn inspiration from. Let's dive into it. My name is
from. Let's dive into it. My name is Nick. I'm the head of AI at Klein. And
Nick. I'm the head of AI at Klein. And
today, I'm going to share some lessons we learned along the way.
So, let's start with the bitter truth.
For years, we compensated for weak models by building clever scaffolds around them. All kinds of clever ideas
around them. All kinds of clever ideas like rag indexing systems, search trees, tool calling scaffolds. All this was invented to cope with weaker models and
frontier models simply bulldoze those abstractions. Now you don't really need
abstractions. Now you don't really need your scaffolding anymore. Your your
scaffolding just gets in the way of these models. And the question really
these models. And the question really isn't how fancy is your agent stack increasingly. It's how strong is the
increasingly. It's how strong is the model driving it.
And the lesson here is relentless. Um, a
perfect example of what I'm talking about is Gemini 3.0 released this week and it immediately dominated terminal bench leaderboards with no agent harness
supporting it at all. In this chart, you can see Gemini 3.0 on Terminus scored better than the vast majority of model agent combinations in the world all out of the box. And what's remarkable is
that Terminus is designed to be an unopinionated generic stripped down harness. And it has no graph search, no
harness. And it has no graph search, no rag, no indexing, just here's a terminal, go figure it out. And it
crushes. The whole point of terminus is that it has no clever tool calling, no context engineering features. So the
takeaway here is that capability beats scaffolding. If you get out of the
scaffolding. If you get out of the model's way, it will perform just fine.
So really what I'm driving at and the key takeaway from this whole talk is if you're building agents, just relax.
Cool it with all your clever engineering tricks. Stop overthinking it. That's it.
tricks. Stop overthinking it. That's it.
That's the lesson. And another point on this, kind of like an aside, is I don't know about you guys, but we're all on Twitter. I'm on Twitter. And at this
Twitter. I'm on Twitter. And at this point, I just think talking about these like clever little context tricks and and hacks is a little played out. Like
at this point, I'm straight up tired of seeing some of this stuff. And like I get it, it's free engagement and we all, you know, indulge in it a little bit.
But personally, I think there's not really much signal there.
So, if you want the full playbook for building an effective coding agent, like the playbook's right here. here. It's up
on the screen. Um, there's really some novelty talking about it like months ago, but at this point, in my opinion, it's been done to death. And we've been in this, you know, we're model agnostic at Klein. We support all the models.
at Klein. We support all the models.
Every two weeks, there's a new big model release going out. And we've basically come down to the same playbook of supporting each model as it comes out.
So, I'm sure everyone here knows how to tune an agent from Sonnet 4 to Sonnet 4.5, from Gemini 2.5 to Gemini 3, and
GBT 5 to GP GBT 5.1. I feel like this entire conversation is a little played out. So, I'm not really even going to
out. So, I'm not really even going to cover this in depth because the tweaks here are trivial and the gains are marginal.
So what I really want to talk about is something that's not actually given a lot of attention and it's the real bottleneck. And the real bottleneck is
bottleneck. And the real bottleneck is that you can build the cleanest agent in the world, but that doesn't improve model capability by even 1%. Models only
get better when labs train on something hard. And benchmarks, not agent
hard. And benchmarks, not agent cleverness, not all your clever engineering techniques, not your clever rag pipelines. It's benchmarks that
rag pipelines. It's benchmarks that determine what frontier models learn to do next. And models didn't magically get
do next. And models didn't magically get better at tool use.
They got better because people built RL environments that force them to practice certain actions. Handling failure more
certain actions. Handling failure more handling failure modes retrying and for example like agents improve only when the model learns inside the right environment. Every jump in reasoning
environment. Every jump in reasoning we've seen came from a benchmark. every
jump in agent reliability came from an RL environment.
So the real questions become what is a benchmark? How do you turn real world
benchmark? How do you turn real world agentic coding data into an RL environment? And what makes a good
environment? And what makes a good verifier? How do you detect real
verifier? How do you detect real difficulty? And how do you train these
difficulty? And how do you train these models to work on the problems that we actually care about as engineers? These
are the questions that matter for the next frontier.
So what is the benchmark?
A benchmark put simply it's an environment. It's a so in our case it's
environment. It's a so in our case it's like a docker container where you let the agent run wild. It's a starting state which is the snapshot of the code when you started working on a real world
coding task as well as a starting prompt. And the last thing is a verifier
prompt. And the last thing is a verifier at the end that checks whether an end state is correct or acceptable.
So how are RL environments different?
Well, here's the thing. They're not
really different at all. And you might notice this chart is basically the same thing as the previous slide. The only
real difference, the only distinction here is how the reward is used.
Benchmarks measure models. RL
environments improve models. The score
doesn't just stop in a leaderboard where you publish the results. The score is actually used to update the weights of the policy model.
So, how do you transform real world coding data into useful RL environments for training?
At Klein, we created the system called an RL environments factory. Looking for
a better name there, but that's what we got so far. And the first phase in this pipeline is you get sub agents and you have them qualify tasks. And these sub
agents, they work in parallel to decide whether or not given tasks are suitable to be turned into RL environments for the purpose of training.
And the qualification process goes as follows. So you have you start with
follows. So you have you start with origins. So you have to validate does
origins. So you have to validate does the repository actually exist? Is the
starting commit accessible? Is it open source? the journey where you look at
source? the journey where you look at the starting prompt, the other follow-on prompts that the user might have followed up with with the agent. You
have to try to understand what was the user actually trying to accomplish, what was the spirit of their task. And
lastly, it's the outcome. So, can we find the actual commits or PRs that fix the problem in real life? Like, did they actually commit the solution to their
problem later on in the timeline? And
we're actively looking for easy disqualifiers as part of this. So things
like vibecoded slop. We don't need another benchmark that tests for, you know, build the next.js app uh from scratch. We're looking we're looking to
scratch. We're looking we're looking to disqualify trivial tasks that are too easy and tasks that have no reliable start or end states.
And lastly, what makes a good RL environment good? How do we actually
environment good? How do we actually make an RL environment? And what makes a good test or verifier?
So phase two of this pipeline is building the actual RL environment. So
you start out with archaeology where you actually reconstruct both states locally. You pull down the code. You see
locally. You pull down the code. You see
if you can implement it yourself, reconstruct it, build it, and verify that the bug that the user was referencing and the solution actually exists. You document every obstacle and
exists. You document every obstacle and dependency. You containerize it with
dependency. You containerize it with docker removing git obviously so agents can't reward hack and last you define the verifier at the end and this is
where it gets into like a little bit of the art of building a good verifier and I want to talk about this because the analogy that I typically give is a teac kettle.
So let's say the user's goal is I want to boil water.
A really good example of a verifier to test whether or not the water is boiling is a little whistle attachment that goes inside your teac kettle. And the whistle
is a pure outcome verification. It's an
example of a pure outcome driven verifier where the water either reached the boiling point or it didn't. Either
it's whistling or it's not. The kettle
doesn't care how you achieved it, whether you used a gas stove, an electric induction stove, or a campfire.
It just signals the result.
And in the process of doing this, all these weird bad tests can emerge. So you
might have noticed like that the sub agent might have noticed like oh in the ground truth solution like in a previous run the burner was set to high. So maybe
we should be checking for that. But we
all know that water can boil at a low setting on the burner. Or was it on the front left burner has 5 minutes elapse?
Like all kinds of weird bad tests. And
the key point here is don't overprescribe based on the ground truth.
Test for the spirit of the task. Test
for the outcome of the task. And the
outcome at the end of all of this is a containerized benchmark or RL environment for that task. Agent work is recorded so you can see the traces, the trajectory that the agent took to
complete the task and you can reliably score it and verify it. And it's fully portable. You can run it on any device.
portable. You can run it on any device.
So the path to automation that we've been undertaking as part of this is can we fully automate the process of converting real world coding data into
RL environments for the purpose of training models.
And this work largely started out manual but then the first time the RL environment was like about 16 hours of my time. And what used to take 16 hours
my time. And what used to take 16 hours now takes less than 20 minutes per task.
And we're building towards a fully automated RL environment factory where the bottleneck shifts from engineering to collecting high quality tasks. And an
interesting kind of point here, the natural endpoint of all this is what if we actually built RL environments and this is like a question for everyone in the audience is what if we built RL
environments to test how well agents can actually make RL environments kind of like a meta benchmark. What would hill climbing on that look like? And you can kind of start imagining that as models
get really really good at making their own RL environments to train on based on real world user data, you kind of complete that loop. Something to think
about. So okay, um this next part is the
about. So okay, um this next part is the truth nuke. Um also known as TRO. Um
truth nuke. Um also known as TRO. Um
an unspoken fact is that we're not alone at Klein building this kind of system.
Every major agent lab captures this data. They all do some version of this
data. They all do some version of this behind the scenes, but no one really talks about it. And I don't even need to name them. If you know, you know. And
name them. If you know, you know. And
realistically, you all know. These same
companies site internal benchmarks to justify legacy systems that they spent months maintaining. But curiously,
months maintaining. But curiously, you'll never be able to study or inspect them because they don't publish them openly. And this data is so valuable yet
openly. And this data is so valuable yet no one shares it. It's the only thing that actually moves the needle.
And here's the heart of my argument is by standing between real world engineers working on real world tasks and the models agent labs have a unique role in
history. We can build better prompts. We
history. We can build better prompts. We
can build better tools. But none of that improves the underlying models. We
possess the single richest data set of real engineering work anywhere in the world. Models don't improve without this
world. Models don't improve without this data and keeping them closed is slowing down Frontier Research.
So today we're announcing ClientBench.
This is our attempt to finally create a benchmark that isn't cosplay engineering. It's not write me a server
engineering. It's not write me a server that generates Fibonacci sequences. This
is real software development captured and packaged into standardized RL and inval and eval environments and this is the benchmark that we always wanted someone else to build. No one did. So
we're doing it and anyone can participate. So here's how it works. The
participate. So here's how it works. The
whole thing is open source. There's no
secret sauce, no locked away data sets.
You can openly run it yourself and inspect it to see how it works. Anyone
can use these environments for SFT, RL, eval, whatever. The point is is to just
eval, whatever. The point is is to just give the entire ecosystem a real substrate to measure and improve models on, not just leak code puzzles.
And this only works if the community contributes. And the good news is you
contributes. And the good news is you don't actually need to do anything special. Just work on your open source
special. Just work on your open source project with the client provider turned on and opt into the client bench initiative. If a frontier model gets
initiative. If a frontier model gets stuck and you step in to fix it, that's actually a ideal task for to be a
candidate for a benchmark and that's it.
Just use the climb provider, see where the model struggles and we'll pick it up and introduce it into this open- source benchmark.
So, client bench will always remain free, fully open source and freely accessible.
Thank you all. If you want to contribute, thank you.
If benchmark scores for AI coding agents are so high, what explains the problems developers and teams face when working with them? Here to provide us with a few
with them? Here to provide us with a few explanations is Meta Researcher Joel Becker.
Hey guys, thank you so much for having me. My name is Joel Becker. I work as a
me. My name is Joel Becker. I work as a researcher or member of technical staff at MET, which stands for model evaluation and threat research. As we'll
see in a second, I'm going to be talking about AI capabilities. How do we know how performance AIs are today? How how
performant they might be in the near future from these two different sources of evidence that seem to give somewhat conflicting answers. You know what? I I
conflicting answers. You know what? I I
could have done this whole talk without reference to meter papers in particular, but we'll look at two papers I've been um involved with as as examples of benchmark style evidence and then more economic style evidence. On the
benchmark side, measuring AI ability to complete long tasks. This is the paper um that comes with the the charts that many of you would have seen on on Twitter and so on that meter is well
known for. And then the second this um
known for. And then the second this um RCT measuring how allowing AI affects developer productivity. And then we'll
developer productivity. And then we'll be talking about how to reconcile uh the the gap that's implied between these two different kinds of measurements.
As I mentioned, META stands for model evaluation and threat research. We are a independent research nonprofit that seeks to inform the the public, policy makers, labs about the degree to which
AIs might pose catastrophic risks to society. The model evaluation part uh
society. The model evaluation part uh means that we seek to understand AI capabilities and propensities and the threat research part means we try to connect those capabilities and propensities to potential catastrophic
risks.
Okay. The first paper we're going to talk about associated with this chart that that many of you I think might have seen.
Um take taking a step back first before we dive into the paper. you know, how how usually do we think about measuring AI capabilities using benchmarks on a SWE bench or a GPQA, so on and so forth.
There's some notion of 0% performance um or or random performance. So for GPQA, that's that's 25% which corresponds to this floor that the worst you can
possibly do. Perhaps there's a um human
possibly do. Perhaps there's a um human baseline that's below 100% for GPQA. I
think this is something like 75% that represents maybe expert human performance. And then of course you can
performance. And then of course you can go all the way up to 100% potentially on on these kinds of benchmarks. But but
what does it mean? You know, if I'm getting 50% on GPQA, if I'm like half the way from the um from the floor to the to the expert baseline, what you know, what does that really mean about how performant the AIS are? If I meet
the human baseline, does that mean that the AIS are now as performant or even more performant than than expert humans in in a relevant sense that I that I care about? It's hard to interpret. You
care about? It's hard to interpret. You
know, another thing that you see from this graph is that um benchmarks seem to have less and less time between coming online sort of giving any signal at all
and being fully saturated. It's harder
and harder to create benchmarks that have uh plenty of signal that you know might might be informative to us about how capable models are for for an extended period of time. So, we're we're
going to go about this a different way.
First, we're going to gather human baseline data for diverse tasks spanning a range of difficulties. You should
think of these humans as, you know, experienced experts, but on their first day or or or first week on the job.
These are not people with context on the tasks in particular. It's not exactly the kind of thing that's come up in their work before, but if it's a software engineering task, you know, there are relevantly skilled general software engineer. Same for the machine
software engineer. Same for the machine learning tasks and the cyber security tasks here that we'll talk about. The
the type of tasks come from these three um buckets or task distributions. Hast
which is a collection of um softwarebased tasks seemingly requiring autonomy you know interacting with tools um uh interacting with the environments thinking thinking through the problem
not not just this kind of Q&A style um style data set um the SWAR suite which are these atomic problems. These are problems that you know maybe GPT2 can do
maybe maybe it can't problems like um here are four files. One of them is called passwords.txt which file contains the passwords and then on the other end of difficulty we
have rebench which are challenging novel open-ended um machine learning research engineering challenges which are are very difficult even for top human experts.
In addition to gathering the the human baseline data, we'll also under as close to identical conditions as possible measure AI performance for the AIs that we're that we're interested in on the
same set of tasks. And then we're going to convert the time it takes for humans to complete these tasks into an estimate of AI autonomous capabilities as I'll
I'll show you in a second. Here's an
illustrative diagram in this case for claw 3.7 Sonet which was the the frontier model at the time that this paper came out. You can see that you know for the for the very short tasks something like 4 minutes or below Sonnet
is getting the answers correct you know essentially 100% of the time or or maybe even here literally 100% of the time for the very hardest tasks it's struggling and then and then there's some range where we're kind of in the middle you
know we're somewhere between 10 and 10 and 90%. I'll say that this empirical
and 90%. I'll say that this empirical pattern where models are less performant at tasks that take humans longer is, you know, it's not a fact of nature, but it's it's something that we see pretty pretty commonly, pretty pretty robustly
across models, at least on this task distribution. And I'd conjecture for for
distribution. And I'd conjecture for for other task distributions as well. So, we
try and fit this dark purple line to to something like this data on on how long it took humans to complete the relevant tasks that the models are are um are attempting. And then we call the point
attempting. And then we call the point on the x-axis, this horizontal axis, this human time to complete axis at which we predict the models will succeed
50% of the time, the time horizon of those models that there's much to debate in the 50% number. I can I can talk later about the reasons why we chose that. And and then we'll do the same
that. And and then we'll do the same exercise for the other models. So here I have uh claw 3 opus has a time horizon of something like 4 minutes. That's
where we're predicting that it has a success probability on this task distribution of 50%. For 01 preview, I'm seeing something like 15 minutes so on and so forth. And then of course all these models, you know, they they come
out over um calendar time. So if we plot the time horizon, the x-coordinate on uh on on this set of plots against um against calendar time, we find something like this. It looks, you know, kind of
like this. It looks, you know, kind of like um kind of like an exponential trend that's that's going up at some constant rate. In fact, it doesn't just
constant rate. In fact, it doesn't just look like an exponential trend. If we
had a perfectly straight line here, it would indicate um a perfectly exponential trend. Um we we see
exponential trend. Um we we see something really remarkably steady, actually much more steady than we were anticipating when we uh went about doing this research project
and that's continued to be the case. So
many of you will have seen updates that we made of of this graph on on on Twitter. This is going all the way up to
Twitter. This is going all the way up to GPT 5.1 Codeex Max. So extremely recent.
um the the predictions from this, you know, shockingly straight line have have held up very well. I think
taking a quick step back, what are benchmarks telling us or or here kind of benchmark like evidence? Well, one thing is that AIs can succeed at what for humans would be exceedingly difficult
tasks. The tasks in rebench are, you
tasks. The tasks in rebench are, you know, really far beyond my capabilities uh uh personally and and you know, the AI is having a good crack at them some some decent percentage of the time. And
the second's you know kind of obvious is that progress is rapid.
On the other hand um you know how much how much stock should we put in the um the evidence suggested by benchmarks? Um
what limitations might they have? Lots
but here are here are three that I'll note. One is as I mentioned these are
note. One is as I mentioned these are humans who are you know expert in some relevant sense but they're low context.
It's something like their their first week on the job. They haven't seen tasks exactly like this previously. they just
have some relevant experience.
Presumably people who were more sort of you know not not just having the relevant experience but also highly familiar with um uh with the with the set of tasks would perform the tasks even sooner and then we think relative
to those people the AIs were more performant.
The second is that benchmarks can be low ceiling. Even you know GPQA I'll use
ceiling. Even you know GPQA I'll use that example again. Um we're we're beginning to get to the point where where that benchmark is um is totally
saturated not providing um additional information for marginal models whereas time horizon is providing this nice way to sort of chain benchmarks together in in in some sense over time.
Um but you know nonetheless it's still very hard to um uh to create these ever harder tasks when the um when the time horizon of models is doubling every something like six to seven months. So
even time horizon might be might be saturated in not too long or the benchmarks underlying time horizon.
And the next one is you know not not a concern that's limited to the to the meter task to the task behind time horizon. It's also true for sweet bench.
horizon. It's also true for sweet bench.
It's also true for for many of your um favorite agentic benchmarks that the problems aren't very messy in some sense. They don't require a ton of
sense. They don't require a ton of coordination with humans. They're often
in relatively small contained environments where where not much can go wrong. You know, not these sort of
wrong. You know, not these sort of massive open source code bases or or um other ways in which the the problems can involve more interaction with the real world or or or be messy in in some
sense.
Um so we did this we did this project and then um early this year we were you know we were trying to think about um uh how can we attack some of these limitations? What what's a different
limitations? What what's a different source of evidence that um might have its own own pros and cons but you know importantly be more externally valid in in the scientific jargon.
Perhaps field experiments are the answer. Some more economic style
answer. Some more economic style evidence. So here we might be interested
evidence. So here we might be interested in very high context developers who are expert on the kind of tasks they're already doing.
Speed up or some notion of productivity boost. You know it seems to have more
boost. You know it seems to have more signal through even some um superhuman according to benchmarks range. You know
perhaps GPQA is fully saturated and you're getting a 1.5x 2x speed up something like that. But you can still achieve a 3x 4x 5x speed up even even after that we we maintain more signal.
And the last is that you know that the tasks are messier. They are tasks that are coming up in people's real work.
They're not um synthetic. They're not
small and contained. Um this is a real deployment scenario.
Here's what we're going to do for this paper. We're going to gather 16
paper. We're going to gather 16 experienced developers on large mature open source projects that we'll go through in a second. Each of these developers will on average complete about 16 tasks from their real work.
These are these are issues on the on the relevant GitHub repositories. The kind
of thing that they might otherwise have completed with the with the caveat that the very longest issues we're not going to include.
The tasks will be randomly assigned to AI disallowed or AI allowed. AI
disallowed, you know, it means it means what you think it means. It means
software development in 2019. It means
no AI powered tab autocomplete. It means
no cursor agentic coding tools. It means
no LLMs via the web UI.
or they can be randomly assigned to AI allowed in which case everything's on the table. You know, any of the AI tools
the table. You know, any of the AI tools I just mentioned or not using the AI tools. If you're in AI allowed
tools. If you're in AI allowed condition, you're not compelled to use AI. You just have the option and we buy
AI. You just have the option and we buy these developers cursor pro. So, um for the for the most part, that's the tool that they're using with typically 3.6 or 3.7s on it at the time, uh which was the
Frontier model when we conducted this work. And then we're going to record the
work. And then we're going to record the time it takes for the developers to complete each task and see the degree to which they might save time when AI is allowed versus when it's not.
These are some of the repositories. Many
of you will be familiar with them. We've
got the Haskell compiler represented. We
have scikitlearn. We have hugging face transformers. These are on average a
transformers. These are on average a million lines of code plus. They've been
around for 10 plus years. The developers
who are going to be working on these repositories as part of this study are on average the third top contributor out of hundreds or or even in some cases thousands of contributors to these repositories. They personally have been
repositories. They personally have been contributing to the repository for something like 5 years on average. These
are top experts.
Some of you might have seen this graph too and so the punch line's been spoiled for for the rest of you. Um we asked uh economics experts, machine learning experts, you know, these are people at
major AI companies and labs, um uh top academics, um some graduate students, so on and so forth, you know, how much they expect developers to save time when they're using AI. They say something
like 40% or a little bit less. We ask
the developers themselves, the study participants how much they expect to be sped up ahead of time, and they say something like 24 25%. Then we asked the developers after the study has been
completed how much they think they were sped up in the past by AI being allowed on the issues they completed as part of this study and they say that it will have sped them up by something like 20%.
And the punch line is that we find that developers are slowed down by 19%. They
take 19% more time when AI is allowed relative to when AI is not allowed. You
know, when I first saw the data coming in, saw sort of early versions of this plot, um, I thought presumably the same thing that many of you might be thinking right now, that we've messed something up. Um, that that, you know, something's
up. Um, that that, you know, something's gone wrong. There's some there's some
gone wrong. There's some there's some issue in in how we've set up the experiments. How could it possibly be
experiments. How could it possibly be the case? You know, at least these um,
the case? You know, at least these um, uh, these developers have access to the zero points because they cannot use AI at at any time. Um, so we poured over,
you know, many, many, many, many, many hours of screen recordings from these developers working on issues as part of the study. We look to dive into, um, a
the study. We look to dive into, um, a bunch of hypotheses that might explain what's going on and try to categorize, you know, the things that that we think are going on versus not. Um, many of this is is listed in the paper. I I'll
just quickly go through some of the things that we think are contributing.
First, overoptimism about AI usefulness.
that that seems like an obvious one. You
know, the developers even after the study is completed, they think that um uh that AI is going to be helpful to their work. It's it makes sense they
their work. It's it makes sense they might overuse AI um on that basis. Um
two more implicit repository context and high developer familiarity. You know,
these developers are coming to these problems already knowing the solution to the problem. They don't they don't um
the problem. They don't they don't um they're so expert in this work. you
know, I I I imagine them as as not trying to spend a bunch of time thinking through the solution that the the AI can can work through. Instead, they're just limited by how fast they can type. Um,
which which means that, you know, using AIs, instructing AIS to do it, um, comes with some significant time cost versus how they might otherwise have spent their time.
I think many of us have the sense that AIS might be less performant on on large and complex repositories, which is a different from this difference from this benchmark style evidence or or from or
from some previous work. And then low AI reliability. You know, um maybe the AIs
reliability. You know, um maybe the AIs are very performant on these kinds of tasks, but you know, they're only performant um 50% of the time or 80% of the time, 20% of the time. And so, at the very least, you need to check their
work afterwards. And perhaps even you
work afterwards. And perhaps even you need to spend time correcting their work afterwards, which is which is something we see quite a lot on these issues.
One thing from the factors with an unclear effect that I that I'll mention briefly I have to talk to people about later is below average use of AI tools which came up in the public discussion.
This this is in the unclear column because it's sort of evidence evidence for and against. Um that that's true for for many of the things here. We don't
have anything so conclusive to say we're still working on on this line of work.
Here are some here are some caveats. All
important. Um first you know obviously we do not provide evidence for all software developers or tasks. These are
extremely experienced developers working on extremely complex long-ived open source repositories. I in my own work
source repositories. I in my own work you know not um uh as expert in the relevant sense as as these people are.
I'm working on much smaller repositories. Um I I feel more
repositories. Um I I feel more comfortable saying that even at this time I was sped up by AI tools even if even if the developers weren't. This
setting is weird. It's weird for the same reasons that it's that it's interesting this this unusual developer population.
Second, the experiment is concentrated in March 2025. As I mentioned, uh we know that AI progress is rapid. Um
perhaps this this result will have already changed by the by the time I'm giving you this talk.
So there's kind of puzzle suggested right that the benchmark style evidence is giving um a very impressive sense of what benchmark of what AI capabilities look like today. Whereas the more
economic style you know I include labor market impacts um uh uh working here too in addition to our in addition to our field experiments look somewhat more bearish or or unimpressive. You know why why is the former not not translating to
the latter at least naively? There seems
to be a clash. How might we go about resolving this puzzle?
So one possibility is that in fact we we messed something up. This is this is still live and on the table. Uh you know maybe the developers really are um uh not very capable at using AI and if we continue to run this experiment as as in
fact we are they'll you know learn more familiarity with the tools and and so get productivity benefits that they they weren't getting at the time. I'm a
little skeptical of that story but but but that's one possibility.
Another that economists like to bring up is that we're not incentivizing these developers to finish quickly. we're
paying them per the hour, um, which we do for external validity reasons. Um,
you know, looking through their videos, I I really, uh, do not think that they're developing differently in accordance with these incentives. But,
but that certainly is one possibility that's on the table.
You know, another um, more statistical in nature possibility is, you know, this is a small study. You shouldn't you shouldn't over update so much from small studies. We we are doing um, uh, bigger
studies. We we are doing um, uh, bigger things that I'm excited to release at some point.
Okay, but let's let's assume we haven't messed something up and this is uh this this is a result um uh that that we think that we think does hold up. How
could we resolve the puzzle?
So, one possibility, you know, as I as I alluded to briefly is that reliability needs to be very high to save time that you need to be getting um the the answers these problems that developers
are putting in correct. you know,
something like 95 99% of the time in order for developers to tab tab tab through and, you know, not not um not spend lots of time verifying the AI's work, which which of course um is pretty
costly from a time perspective.
Another possibility is SWEBench like or algorithmic costless scoring at the margin versus mergeability like scoring.
Sweet scores are not trying to account for you know whether the code is spilled honable by by other people in future or whether it's matching quality considerations that aren't um considered by the unit tests. You know perhaps AIS
really are performance according to SWEBench like scoring but not performance according to this kind of more holistic um uh holistic scoring that we might care about low versus high context baseliners. I I I mentioned I
context baseliners. I I I mentioned I mentioned previously these are just much more skilled humans. you know, relative to those humans, perhaps the AIS are less capable. Task distribution, maybe
less capable. Task distribution, maybe these are just different kinds of tasks, you know, in particular, less less messy than the than the benchmark style task.
Maybe that's explaining what's going on here. Suboptimal capability elicitation.
here. Suboptimal capability elicitation.
A huge amount of work has gone in at meter to making the agents as performant as possible given the underlying models on on our kinds of tasks. And um you know, that involves churning through a
load of AI tokens. Perhaps that's that's less the case for cursor in particular.
at the time when we completed the study and then interdependence across tasks.
Maybe it's the case that um you know if humans can complete task A and task B, AIs can only complete task A but not task B and of course can do task A faster then it still makes sense to for
humans to do task A and task B noticate task A because you know they they need to know the outputs. They need to know how how task A was completed in order to reliably complete task B. I think that's that's part of what's going on. you need
to maintain context as you're working through these subtasks.
Um lastly, I will say that we are hiring not just for this kind of work that you've um that you've seen being extended, you know, ever longer tasks, ever more ambitious um RCTs, um even more sources of evidence from which we
can triangulate the truth about AI capabilities, but also for for much more besides you can you can find this at meter.org/careers.
meter.org/careers.
In particular, I'm excited about research engineers, research scientists who might be um hiding in the current audience. We're excited not just, you
audience. We're excited not just, you know, for for research types with academic experience, but very much for scrappy startup people as well. And
we're also hiring for a director of operations.
And with that, thank you very much for listening.
Our final presenter is here to speak about Google's first agentic development platform anti-gravity.
Please join me in welcoming to the stage engineer at Google DeepMind. Kevin H.
All right. Hello. Last one of the day.
Can we get a uh little energy boost?
Who's ready? Who's ready?
All right. Happy Friday. I hope everyone has had a good week, a good conference.
Um, and let me tell you, it's been a really bad week if you are Gravity.
Wicked 2 is coming out tonight. And
then, of course, anti-gravity came out earlier this week alongside Gemini 3 Pro on Tuesday.
Google Anti-gravity is a brand new IDE out of Google DeepMind. It's the first one from a foundational lab and it is coming right off the press. In fact, um I probably should be working on the
product right now, but I wanted to spend some time to share what we've built here today.
Anti-gravity is unapologetically agent first. And today I'm going to tell you a
first. And today I'm going to tell you a little bit about what that means and how it manifests in the product. But perhaps
maybe a little bit more interestingly, we're going to talk a little bit about how we got here. Product principles,
direction of the industry, these sorts of things. Um, so my name is Kevin How.
of things. Um, so my name is Kevin How.
I lead our product engineering team at Google Anravity.
And let's start with the basics. Um, and
first just to get a sense of the room.
Um, who has used anti-gravity?
All right. There you go. Power of
Google. Love it. Um, who's used the agent manager?
Cool. Nice. Good. Good. All right. So,
basics of anti-gravity.
Anti-gravity, notably anti-gravity, not anti-gravity, anti-gravity. It's an AI
anti-gravity, anti-gravity. It's an AI developer platform with three surfaces.
The first one is an editor. The second
one is a browser and the third one is the agent manager. So we'll dive into what this means, which one what what each looks like. So a paradigm shift here is that agents are now living
outside of your IDE and they can interact across many different surfaces that your agent or that you as a software developer might spend time in.
And let's start with the agent manager.
So that's the thing up top. This is your central hub. It's an agent first view
central hub. It's an agent first view and it pulls you one level higher than just looking at your code. So instead of looking at diffs, you'll be kind of a little bit further back. And at any
given time, there is one agent manager window.
Now you have an AI editor. This is
probably what you've grown to love and expect. Has all the bells and whistles
expect. Has all the bells and whistles that you would expect. Lightning fast
autocomplete. This is the part where you can make your memes about yes, we forked VS Code. And it has an agent sidebar.
VS Code. And it has an agent sidebar.
And this is the sort of thing it's mirrored with the agent manager. And
this is when you need to dive into your editor to accomplish maybe your 80% to 100% of your task. And at any point, we made it very very easy because we recognize not everything can be done
purely with an agent for you to command E or control E and hop instantly from the editor into the agent manager and vice versa. And this takes on under 100
vice versa. And this takes on under 100 milliseconds. It's zippy. And then
milliseconds. It's zippy. And then
finally, something that I love, an agent controlled browser. This is really,
controlled browser. This is really, really cool. And hopefully for the folks
really cool. And hopefully for the folks in the room that have tried anti-gravity, you've noticed some of the magic that we've put in behind here. So,
we have an agent controlled Chrome browser. And this gives the agent access
browser. And this gives the agent access to the richness of the web. And I mean that in two ways. The first one, context retrieval, right? It has the same
retrieval, right? It has the same authentication that you would in your normal Chrome. You can give it access to
normal Chrome. You can give it access to your Google Docs. You can give it access to, you know, your GitHub dashboards and things like that and interact with a browser like you would as an engineer.
But also what you're seeing on the screen is that it lets you it lets the agent take control of your browser, click and scroll and run JavaScript and do all the things that you would do to
test your apps. So here I put together this like random artwork generator. All
you do is refresh and you get a new picture of um like a Thomas piece of Thomas Cole artwork. And now we added in a new feature which is this little little modal card. And the agent actually went out and said, "Okay, I
made all the code, but instead of showing you a diff of what I did, let's instead show you a recording of Chrome."
So this is a recording of Chrome where the blue circle is the mouse. It's
moving around the screen. And in this way, you get verifiable results. So this
we're very excited about our uh our Chrome browser. And then the agent
Chrome browser. And then the agent manager can serve as your control panel.
The editor and the browser are tools for your agent. And we want you to spend
your agent. And we want you to spend time in the agent manager. And as models get better and better, I bet you you're going to be spending more and more time inside of this agent manager. And it has an inbox. And I'll talk a little bit
an inbox. And I'll talk a little bit about this and sort of why we did this.
But it lets you manage many agents at once. So you can have things that
once. So you can have things that require your attention. For example,
running terminal commands. We don't want it to just kind of go off and just run every terminal command. There are
probably some commands that you want to make sure you you hit okay on. So things
like this will get surfaced inside of this inbox. One click. you can manage
this inbox. One click. you can manage many different things happening at once and it has a wonderful OS level notification. So if there is something
notification. So if there is something that you need it will sort of let you know and this kind of solves that problem of multi-threading across many tasks at once and so our team is thrilled to launch this brand new
product. It's a brand new product
product. It's a brand new product paradigm and we did so in conjunction with Gemini 3 which was a very exciting week for the team but alas we ran out of capacity.
Um, this has been tormenting me the last couple of days and so I apologize. On
behalf of the anti-gravity team, I'd like to apologize for our global chip shortage. Um, we're working around the
shortage. Um, we're working around the clock to try and make this work for you.
Uh, hopefully we'll have a few less of these sorts of errors. Um, but we've, it's what's been really exciting is people who have used the product have seen what the magic of combining these three surfaces can do for your workflows, for your software
development. Um, so let's talk about it.
development. Um, so let's talk about it.
Why did we build the product? How did we arrive at this sort of conclusion? You
might say, "Oh, adding in a new window, it's pretty pretty random, right? It's
this one to many relationship between the agent manager and many other surfaces."
surfaces." Um, and it's important to remember, I've I've been at this conference a couple of times, and and everything every single time there is this theme. The product is only ever as good as the models that
power it. And this is very important for
power it. And this is very important for us as builders, right? Every year there is this sort of new step function. The
first there was a year when it was autocomplete, right? Copilot. And this
autocomplete, right? Copilot. And this
this sort of thing was only enabled because models suddenly got good at doing this short form autocomplete. And
then we had chat. We had chat with RLHF.
Then we had agents. So you can see how every single one of these product paradigms is sort of motivated by some change that happens with model capabilities. And it's a blessing that
capabilities. And it's a blessing that our team is able to work and be embedded inside of DeepMind. We had access to Gemini for a couple of months um earlier and we were able to work with the research team to basically figure out,
you know, what are the strengths that we want to show off in our product? what
are the things that we can exploit and then also what are the gaps right this desired experience where are the gaps in the model and and how can we fix that right and so this is this was a very
very powerful part of why anti-gravity came to be and there are four main categories of improvements powered by a little nano banana artwork the first one is intelligence and reasoning you all
are probably familiar with this you use nano or you used um Gemini 3 and you probably thought it was a smarter model this is good it's better at instruction following it's better at using tools tools. There's more nuance in the tool
tools. There's more nuance in the tool use. You can afford things like, you
use. You can afford things like, you know, there's a browser now. There's a
million things that you could do in a browser. It can literally even execute
browser. It can literally even execute JavaScript. How do you get an agent to
JavaScript. How do you get an agent to understand the nuance of all these tools? It can do longer running tasks.
tools? It can do longer running tasks.
These things now take a bit longer, right? And so you can afford to run
right? And so you can afford to run these things in the background. It
thinks for longer. Just time time has gotten stretched out. And then
multimodal. I really love this property of what Google has been up to. The
multimodal functionality of Gemini 3 is off the charts. and you start combining it with all these other models like Nano Banana Pro um and you really get something magical. So we have these
something magical. So we have these roughly four different categories where things have gotten much better and if you think about these properties the question becomes what do we do about these differences and from a product
perspective it's like how do you construct a product that can take advantage of this new wave and hopefully and in my opinion this is the next step function autocomplete chat agents and then I probably got to come up with
something more interesting than whatever this thing is called.
So step one is we want to raise the ceiling of capability.
We want to aim higher, have higher ambition.
And so a lot of the teams at DeepMind were working on all sorts of cutting edge research, right? There's Google is a big big big company. And one of my learnings going from a startup to one of these bigger companies is that there is
a team of people that is attacking a very very hard technical problem. And as
a nerd, this is super exciting, right?
And then as a product person it's like wow we can start using computer use. So
browser use has been one of these huge unlocks.
And this is twofold right I mentioned the sort of retrieval aspect of things.
Um I guess for for software engineers there is much more that happens that is beyond the code right you can roughly think about it as there's what to build there's how to build it and then you actually have to build it. I would say
building it has become more or less you know it's reasonable for the model to now given context it can generate the code that hopefully functionally works and then you've got the what to build this is the part that is up to you kind of human imagination and then there's
the how to build it right and there's this richness in context the richness and institutional knowledge and these are the sorts of things that having access to a browser having access to your bug dashboards having access to
your experiments all these sorts of things that now gives the agent this additional level of context and maybe I should have clicked before, but if you saw on the screen, let's see, how do I do this?
So, this is now the other side of things. Browser has verification. So,
things. Browser has verification. So,
you might have seen this video. This is
a tutorial video that we put together on just how to use it. But this is the agent. The blue border indicates that
agent. The blue border indicates that it's being in control by the agent. And
so, this is a flight tracker. You put
in, you know, a flight ID and then it'll give you sort of the start and end of of that flight. And this is being done
that flight. And this is being done entirely by a Gemini computer use variant. is so it can click, it can
variant. is so it can click, it can scroll, it can retrieve the DOM, it can do all the things. And then what's really cool is you end up with not just a diff, you end up with a screen recording of what it did. So it's
changed the game and the model can take this and because it has the ability to understand images, it can take this and iterate from there. So that was the first category, browser use, just an
insane insane magical experience. Now
the second place that we wanted to spend time is on image generation. And we
noticed this theme when we, you know, when I when I first started at at Google, we noticed, okay, Gemini is spending a lot of time on multimodal.
And this is really great for consumer use cases, right? Nano Banana 2 was was mindboggling. Um, but also for devs.
mindboggling. Um, but also for devs. Devs are inherently this is a multimodal experience. You're not just looking at
experience. You're not just looking at text. You're looking at the output of
text. You're looking at the output of websites. You're looking at architecture
websites. You're looking at architecture diagrams. There's so much more to coding than just text. And so there's image understanding. This is verifying
understanding. This is verifying screenshots, verifying recordings, all these sorts of things. And then the beautiful part about Google is that you have this synergistic nature. This
product takes into account not just Gemini 3 Pro, but also takes into account the image side of things. And so
here I want to give you a quick demo of um mock-ups. So I have a hunch, and you
um mock-ups. So I have a hunch, and you all probably believe this too. Design is
going to change, right? You're going to spend, you know, maybe some time iterating with an agent to to arrive at a mockup. But for something like, oh,
a mockup. But for something like, oh, let's build this website. we can start in image space. And what's really cool about image space is it lets you do really cool things like this. We can add comments and so you end up commenting
and leaving a bunch of a bunch of queued up responses. And it's kind of like
up responses. And it's kind of like GitHub. You'll just say, "All right, now
GitHub. You'll just say, "All right, now update the design." And then it'll put it in here. The agent is smart enough to know when and how to apply those comments. And now we're iterating with
comments. And now we're iterating with the agent in image space. So really,
really cool new capability. And what was awesome is that um we had Nano Banana Pro, you know, we pulled an allnighter for uh for the Gemini launch because that was our first launch. Then they
said, "Do it again. Do it on Thursday."
So we made Gemini Pro um I'm getting all these model names confused. The image
gen one, the nano banana one, we made that available on day one. I'm running
on very little sleep on day one inside of the anti-gravity editor. And our hope is that the anti-gravity editor is this place where any sort of new capability can be represented inside of our product.
And so step two was all right, we have this new capability. We've pushed the ceiling higher. Agents can do longer
ceiling higher. Agents can do longer running tasks. They can do more
running tasks. They can do more complicated things. They can interact on
complicated things. They can interact on other surfaces. And so this necessitates
other surfaces. And so this necessitates a new interaction pattern. And we're
calling this artifacts.
This is a new way to work with an agent.
And this is one of my favorite parts about the product. And at its core is this agent manager.
So let's start by defining an artifact.
An artifact is a dynamic representation of something that the agent generates.
Sorry, it's a an artifact is something that the agent generates. That is a dynamic representation of information for you and your use case. And the key here is that it's dynamic.
Artifacts are used to keep the agent organized. They can use used for uh kind
organized. They can use used for uh kind of like self-reflection and and self-organization. It can be used to
self-organization. It can be used to communicate with the user to maybe give you a screenshot, to maybe give you a screen recording like we described. And
it can also be used across agents, whether this be with our browser sub agent or with other conversations or as memory. And this is what you see on the
memory. And this is what you see on the right side of this agent manager. We've
dedicated sort of half the screen and and your sidebar to this concept of artifacts.
And so we've all tried to follow along chain of thought. And I would say this, you know, we did some fanciness here inside of the agent manager to make sure conversations are broken up into like chunks. So in theory, you could follow
chunks. So in theory, you could follow along a little bit better in the conversation view, but ultimately you're looking at a lot a lot of strings, a lot of tokens. This is like very hard to
of tokens. This is like very hard to follow. And then this is actually like
follow. And then this is actually like there's like 10 of these, right? So you
just scroll and scroll and scroll.
You're like, what the heck did this agent do? And and this this has been
agent do? And and this this has been traditionally the way that people review and sort of supervise agents. They're
kind of just looking at the thought patterns.
But isn't it much easier to understand what is going on inside of this visual representation? And that is what an
representation? And that is what an artifact is. The whole point and the
artifact is. The whole point and the reason why I'm not just standing up here and giving you this long, you know, stream of consciousness is because I have a PowerPoint. The PowerPoint is my artifact. And so Gemini 3 is really
artifact. And so Gemini 3 is really really strong with this sort of visual representation. It's really strong with
representation. It's really strong with multimodal. And so instead of showing
multimodal. And so instead of showing this, which of course we always let you show, we always we will always show you this, but we want to focus on this. And
I think this is the game-changing part about anti-gravity.
And the theme is this dynamicism.
The model can decide if it wants to generate an artifact. And let's remember there are some tasks. We're changing a title. We're changing something small.
title. We're changing something small.
Doesn't really need to to produce an artifact for this. So, it will decide if it needs an artifact. And then second, what type of artifact? And this is where it's really cool. There there are many
potential in potentially infinite ways that it can represent information. And
so, the common ones are markdown in the concept of a of a plan and a walkthrough. So this is probably what
walkthrough. So this is probably what you've used most most often. When you
start a task, it will do some research.
It will put together a plan. This is
much very similar to like a PRD. It will
even list out open questions. So you can see in this feedback section, it'll surface, hey, you should probably answer these three questions before I get going. And what's really awesome, and
going. And what's really awesome, and we're betting on the models here. What's
really awesome is that the model will decide whether or not it can auto continue. If it has no questions, why
continue. If it has no questions, why should it wait? It should just go off.
But more often than not, there are probably areas where you may be underspecified or maybe it did something during research, right? everyone has
gone through and and started a big refactor then realized they actually don't have all the information ahead of them. They got to go back to the drawing
them. They got to go back to the drawing board, maybe talk to some people. Same
idea. So it'll surface um it'll surface open questions for you. And so that's you'll start with that implementation plan and then you'll say all right LGTM let's like send it. You'll go all the way down. It might produce other
way down. It might produce other artifacts. You know we've got a task
artifacts. You know we've got a task list here. This is the way that you can
list here. This is the way that you can monitor the the progress of the agent instead of looking at the conversation.
might put together some architecture diagrams and then you'll get a you'll get a walkthrough at the end and this walkthrough you kind of saw a glimpse of this before but it is hey how do I prove to you agent to human that I did the
correct thing and I did it well and then this is the part that you'll end with it's kind of like a PR description and then there's a whole host of other types right Images screen recordings these mermaid diagrams and really what's
what's what's quite cool is that because it's dynamic the agent will decide this over time so suddenly there's maybe a new type of artifact that we maybe we missed Right? And then it'll figure that
missed Right? And then it'll figure that out. It'll just become part of the
out. It'll just become part of the experience. So it's very scalable. But
experience. So it's very scalable. But
this artifact primitive is something that's very very powerful that I'm pretty excited about. And then I guess another question is why is it needed? So
we'll always explain to the user what the purpose of this artifact is. Um and
then interestingly like who should see it? So should the sub agents see it?
it? So should the sub agents see it?
Should the other agents see it? Should
other conversations see this? Should
this be stored in my memory bank? Right?
If this is something that I derived, one of the cool examples um that I like is like if you give it a a piece of documentation and give it your API key, it'll like go off and run curl requests to basically figure out the exact schema
of like what the types of APIs you're using and it'll do this like deep research um for quite a while and then it'll give you a report and basically like deeply understand uh this sort of uh this sort of API. You wouldn't want
to just throw that away and have to rederive it the second time you did this. So it'll store it in your memory
this. So it'll store it in your memory and then all of a sudden that's just a part of your knowledge base. So, and
then there's also this idea of like notifications, right? So, if there's an
notifications, right? So, if there's an open question, you want the agent to be proactive with you. And that's another very cool property of this artifact system. We want to be able to provide
system. We want to be able to provide feedback along this cycle. So, from task start to task end, we want to be able to provide feedback and inform the agent on what to change.
And the artifact system lets you iterate with the model more fluidly during this process of execution. And
so, not to sound like a complete Google shell, but I love Google Docs, right?
Google Docs is a great pattern. It's
awesome. The comments are great. And
this is how you might interact with a colleague, right? You're collaborating
colleague, right? You're collaborating on a document. Then all of a sudden, you want to leave a textbased comment. So,
we took inspiration from that. We took
inspiration from GitHub. But you leave comments. You highlight text. You say,
comments. You highlight text. You say,
"Hey, maybe this part needs to get ironed out a bit more. Maybe there's a part that you missed or actually don't use Tailwind. Use vanilla CSS." So,
use Tailwind. Use vanilla CSS." So,
these are the sorts of comments that you would leave. You'd batch them up. And
would leave. You'd batch them up. And
then you go off and send. And then in image space, this is very cool. We now
have this like Figma style drag and drop like or not drag, you know, highlight to select. And now you're leaving comments
select. And now you're leaving comments in a in a completely different modality, right? And we've done this and
right? And we've done this and instrumented the agent to ma naturally take your comments into consideration without interrupting that task execution loop. So at any point during your
loop. So at any point during your conversation, you could just say, "Oh, actually, you know, mid mid browser actuation, I actually really don't like the way that that turned out. Let me
just highlight that, tell you, send it off." and then I'll just get notified when you're done taking into consideration those comments. And so
it's a whole new way of working. And
this is really at the center of what we're trying to build with anti-gravity.
It's pulling you out into this higher level view. And the agent manager really
level view. And the agent manager really is built to optimize the UI of artifacts.
So we have a beautiful, beautiful artifact review system. We're very proud of this. And it can also handle sort of
of this. And it can also handle sort of the property that is like parallelism and orchestration. So whether this be
and orchestration. So whether this be many different projects, whether this be the same project and you just want to execute maybe a design mockup iteration at the same time you're doing research on an API at the same time you're iterating and and and actually building
out your app. You can do all these things in parallel and the artifacts are the way that you provide that feedback.
The notifications are the way that you know that something requires your attention. It's a completely different
attention. It's a completely different pattern. And what's really nice is that
pattern. And what's really nice is that you can you can take a step back and of course you can always go into the editor. I'm not going to lie to you.
editor. I'm not going to lie to you.
There are tasks that you know you maybe don't trust the agent yet. You don't
trust the models yet. And so you can command E and you can command E and it'll open inside the editor within a split second with the exact files, the exact artifacts and that exact conversation open ready for you to
autocomplete away to continue chatting synchronously to get you from 80% to 100%. So we always want to give devs
100%. So we always want to give devs that escape hatch. But in the future world, we're building for the future.
You'll spend a lot of time in this agent manager working with parallel sub agents, right? It's a very very exciting
agents, right? It's a very very exciting concept.
Okay, so now that you've seen we've got new capabilities, multitude of new capabilities, we've got a new form factor. Now the question is like what is
factor. Now the question is like what is going on under the hood at Deepmind? And
the secret here is a lesson that I guess we've just learned over the past I don't know we've spent like or I I've personally spent like three years in in codegen just to be your your biggest
user right and that creates this research and product flywheel and so I will tell you anti-gravity will be the most advanced product on the market because we are building it for
ourselves we are our own users and so in the dayto-day we were able to give Google engineers deep mind researchers is we were able to give them an early access and now an
official access to anti-gravity internally. And so now all of a sudden
internally. And so now all of a sudden the actual experience of the models that people are improving, the actual experience of of using the agent manager
and touching artifacts is letting them see at a very very real level what are the gaps in the model.
And whether it be computer use, whether it be image generation, whether it be instruction following, right? Every
single one of these teams, and there are many teams at Google, has some hand inside of this very, very full stack product.
And so you might notice as an infrastructure engineer, you might say, "Oh, this is a bit slow." Well, build it for yourself. Make it faster. Image
for yourself. Make it faster. Image
genen, all of a sudden, computer use isn't going well. It can't click this button. It's really bad at at scrolling.
button. It's really bad at at scrolling.
Really bad at finding text on the page.
Well, go off and and make that better, right? So, it gives you this level of
right? So, it gives you this level of insight that eval just simply can't give you. And I think that's what's really
you. And I think that's what's really cool about being a deep mind. You are
able to integrate product and research in a way that creates this flywheel and pushes that frontier. And I guarantee you that whatever that frontier provides, we will provide an anti-gravity for the rest of the world.
These are the same product. And so, I'll give you two examples of how this is has worked. The first one was that computer
worked. The first one was that computer use example, right? in collaboration
with a computer use team which we sit you know a couple couple tens of feet away from we identified gaps on both sides right so we're not just using an API we are interacting across teams to
basically say oh like the capability is kind of off here can can we go off and figure out what's going on here maybe there's a there's a mismatch in data distribution and then on the other side it's like yo your like agent harness is
like pretty screwed up you got to fix your tools right and so then we'll go off and we'll fix our side but it's this harmony it's it's both sides talking to each other that really makes this type of thing possible. Similarly, you come
up with a new product paradigm artifacts. Artifacts were not good on
artifacts. Artifacts were not good on the initial on the initial uh versions, right? What part of training, what part
right? What part of training, what part of data distribution includes this like weird concept of reviews? And so, it took a little bit of plumbing, a little bit of work with the research team to figure out, all right, let's steadily
improve this ability. Let's give you a hill to climb. And then now we were able to launch Gemini 3 Pro with a very good ability to handle these sorts of artifacts. And so it's this cyclic
artifacts. And so it's this cyclic nature that I'm really really betting on.
And this this is really how anti-gravity will defy gravity. We've got pushing the ceiling. We're going to have an agent
ceiling. We're going to have an agent with very very high level of ambition.
We're going to try and do as much as we can. And this includes vibe coding.
can. And this includes vibe coding.
Though I will say there are some excellent products out there by Google.
AI studio is an excellent product.
We are in the business of increasing the ceiling.
Second, we built this agent first experience artifacts agent manager. And
then finally, we have this research product flywheel. And this is the magic
product flywheel. And this is the magic and this is the three-step process that we used in building anti-gravity.
So, it's been a blast. I mean, I've I've been back at um AI Engineer Summit.
Thank you again, Swix and Ben, for having me. It's been awesome to come
having me. It's been awesome to come back every year. And so on behalf of the anti-gravity team, I just want to thank you for your time, for your patience as you use the product um and your support.
And of course, you too can adopt a TPU and help us uh turn off pager duty a bit more. Um and
then of course, you know, you could also yell at me on Twitter. That's another
way of doing it. Maybe do it in DMs instead. Um but we've got a lot of
instead. Um but we've got a lot of exciting things and I'm really really excited to bring anti-gravity to market.
The team is thrilled that this is now out in the wild. So we welcome your feedback. Um, and thank you again for
feedback. Um, and thank you again for listening. Enjoy the rest of the
listening. Enjoy the rest of the conference.
Ladies and gentlemen, please welcome back to the stage Jed Borave.
>> Thank you, Kevin. Let's hear for Kevin and all of our speakers today.
All right. What a great way to end our sessions. How are we doing? I know it's
sessions. How are we doing? I know it's been a big day.
>> Yeah, we're still somewhat alive. Okay,
before we wrap up, a few logistics.
Thing. One, there is an official afterparty tonight. If you've
afterparty tonight. If you've registered, there'll be a follow-up email in about 30 minutes. The second
thing is tomorrow is a full day of workshops. Importantly, they are not
workshops. Importantly, they are not here. There's two different buildings.
here. There's two different buildings.
There's the Data Dog building that's right around the corner. There's also
the AWS building, JFK27 on 39th Street.
But when in doubt, the schedule is on the website. My last act as MC is to
the website. My last act as MC is to invite the events co-founders up to stage for a special closing word. Please
join me in welcoming AI engineer co-founders Benjamin Dumpy and Swix.
>> All right, let's keep it going for Jed, everyone.
Let's keep it going for Jed and Alex yesterday, our lovely MC's who really glued this place together. So, we
really had such a great time. Did you
all have a good time?
>> So glad to see that. Hopefully the
program went smoothly. Um me and the team here, the lovely team at the Time Center, Argus HD, Max Video Productions, everyone here supporting the production,
that's us. And the content curation,
that's us. And the content curation, let's give it up for this guy. Oh, thank
you. I hope you enjoyed it. You have no idea how hard he works on this. And for
me as a conference producer who used to do both the content curation and the youth production, it's just a godsend to partner with someone who is, you know, on the forefront of thought leadership
like this. And you have no idea how hard
like this. And you have no idea how hard he works. And he's actually recently so
he works. And he's actually recently so invested in this that he's actually recently stepped up as CEO. So,
>> oh, okay.
>> Just kind of announced that.
>> I was going to do that.
>> I was expecting Yes.
>> But yes, Sean is now the the new CEO of the company. And I couldn't I wouldn't
the company. And I couldn't I wouldn't have it any other way. If there's one person I want to follow like in this world, it's this man. And there's one person I'm happy to be working for, it's this man.
>> Oh yeah. Likewise. I the production quality and the creative design and all the music that you hear online is is all Ben. So you got to give it all for him
Ben. So you got to give it all for him as well.
>> Thank you.
So I think we're on let's go to the other notes. Let's go to this. So we
other notes. Let's go to this. So we
wanted to talk just a little I mean you saw the slide earlier from from Sean.
It's basically just showing the growth that we had you know like I when I started conference production I always had I always approached events from the point of view for myself and with clients that from your first event you
you you want to start small and you know worst case scenario you sell out. You
don't want to start big and you know you have an empty venue you blow the budget you go bankrupt. So we did that intentionally. We kept the first one
intentionally. We kept the first one very small when we announced it you know with the rise of the AI engineer back in 2023. Um, and we sold out that and then
2023. Um, and we sold out that and then we sold out every single event since then and we're also growing online. So
all of you watching on the live stream, thank you so much for being part of this community remotely. If you can make it
community remotely. If you can make it to one of our events, we are growing. We
have some announcements for you in just a few moments. We'll be coming to you a little bit closer around the world. Um,
and for this event in particular, we had 2,463 applicants and 815 of you came. So,
uh, if we ask an LLM, they might tell us that's maybe, you know, 2% admission rate, I think, something like that. It's
actually 33%. But, uh, that's just showing like the exclusivity of this event in particular. Summit is designed to be a little more exclusive because we want this to be our salv conference. If
you know the Salv conference, some of you might. That's essentially where
you might. That's essentially where Albert Einstein, Nurie Curi, all of the physicists from around the world came and advanced physics over the course of a couple weeks I think um basically decades, right? And basically to the
decades, right? And basically to the point where we are today because string theory, where has that taken us? Come
on. Um hopefully we can solve that with uh the rise of AI. Um in any case, uh that's what we have for summit. And uh
where we at? I put this together in the last 30 minutes, so pardon me. Um, so
there's clearly demand for this community. There's clearly growth here.
community. There's clearly growth here.
There's um a lot of excitement here and there's a lot of great content and there's a lot of questions to be answered. There are no clear answers
answered. There are no clear answers here. This is potentially, as people
here. This is potentially, as people argue, a new consciousness, right? It's
a new form of consciousness, right?
We're all discovering this together. So,
as we push forward into the dark, into the future, in this wonderful, beautiful moment in this temporal existence, we will figure it out together. That's why
we are excited to announce, roll tape, World's Fair is coming back to San Francisco 2026
on June 29th, typo, to July 2nd. So,
we're going to do four days. So, we're
going to do one workshop day and three days of sessions. So, tickets are on sale now. These are going to be the
sale now. These are going to be the cheapest they will be. Did we come up?
Did we decide a discount code or just like >> the price is cheap?
>> No, the super early bird is the discount.
>> It's the super early bird.
>> Okay.
>> Well, it's there's two choices, right?
The ticket is this price and we're going to change the price later or here's the discount code. So, that's always a
discount code. So, that's always a question. Um, we're running the event.
question. Um, we're running the event.
This is, you know, side stuff. This is
future events. So these uh so so we're really happy and very excited to announce that this year we are at Mosone West that is the baby brother of the Moscone Center. So we're not quite there
Moscone Center. So we're not quite there yet but we sold out the Marriott Marquee which is the largest um event hotel in San Francisco at about 3200 uh just a
few months ago and so the only place to go from there is to Moscone West which is about 6,500 capacity. So, we're
guesstimating, you know, conservative 5,000, but we hope to get to the max capacity of 6,500. So, buy your tickets now. They're not going to get any
now. They're not going to get any cheaper. Um, why are we Oh, the video
cheaper. Um, why are we Oh, the video ended. Supposed to be quicker than this.
ended. Supposed to be quicker than this.
Let's go to the the next one. We got one more announcement.
Let's You want to say something about Worlds Fair?
>> Uh, yeah. Worlds Fair is our attempt at is our flagship. We are basically trying to capture all of AI in as in one event.
and for you to basically have kind kind of an all you can eat pass to go to multiple conferences at once. Uh this
year we had 10 simultaneous tracks that was a lot. Um but so we'll never expand beyond that but I think we want to really make it count and have you
basically just dip into whether it's like generative media or like voice or like robotics or anything else that you want to explore. Uh we are also bringing it to Europe for the first time. So
that's uh the next announcement.
>> Yes. So this next April, April 8th to 10th, whether you're from Europe or you want to make the trip to beautiful London, we are right in Westminster at the beautiful Queen Elizabeth 2. We can
fit about 800 people in there. So buy
your tickets soon. We expect that one to sell out as well. It's going to be basically baby World's Fair. We want to establish World's Fair essentially in Europe. There's no better place than
Europe. There's no better place than London. We love Paris and we love what
London. We love Paris and we love what the COB team did with uh a engineer Paris. Um but we feel London is the
Paris. Um but we feel London is the right call for this event. The venues
there are beautiful. The city's
beautiful. I'm biased. I really love Paris, but London ain't bad, too. Um I I live there for a little bit, so I do enjoy it.
>> I really care about direct flight from SF to London. So
>> that's very nice.
>> Yeah.
>> Okay. So tickets are also on sale here.
Again, super early bird pricing. The
cheapest they're going to get. So
ai.engineer/urope.
I don't know when those are going to expire, but I'm sure we'll communicate that to you guys later. Um, and then I think this is a video, so let's go to the next one. Okay, so with all of this
growth, um, little old Sean and me, we can't do this ourselves. So, we needed to bring
this ourselves. So, we needed to bring someone on to really help grow the company. We needed to get procedures. A
company. We needed to get procedures. A
lot of a lot of organizing conferences is it's it's a lot of grunt work, right?
It's a lot of human connection. It's a
lot of reminders. It's a lot of sales, but it's also running the business and then just coming up with processes. So,
a lot of times you're kind of figuring it out as you're going. And you get these processes, but then things change and then some things are not perfect.
So, you're always behind and deadlines are always coming and these budgets are insane. Like, you have no idea and you
insane. Like, you have no idea and you don't want to see a budget for one of these events. Um, and there's just a lot
these events. Um, and there's just a lot of complexity to that. So when we first started the event, we um back in 2023, we brought on Leah McBride who uh came from who used to be the director of
events at Twitter and she helped us out within two months basically she was helping us to run an engineer summit to the point where it it was a lot smoother than when I was just running events on my own with you know you get the
Hollywood crew that comes as professionals to help you run the actual event but the whole production is is is a lot of difficulty. So, we're very, very pleased to announce that Leah
McBride is joining us as our new general manager to help us grow into a proper corporation. So, please join me in
corporation. So, please join me in welcoming to the stage Leah McBride, everyone.
>> Leah, thank you, Ben and Swix. Hi, everyone.
Um, as Ben has just told you, um, I'm very excited to join the company as general manager. I've been working in
general manager. I've been working in event marketing uh tech event marketing for almost 20 years. I was lucky enough to join one of the biggest London agencies. I am Scottish so I I'm from
agencies. I am Scottish so I I'm from the UK um way quite a few years ago and um Google was my main client for quite a long time. So I kind of grew up through
long time. So I kind of grew up through that um Google excellence and how we operationally produce excellent events.
Um I was then lucky enough to go on and be uh the director of events for the developer platform at Twitter for a number of years where I led multiple global tours.
>> H I led multiple global tours and I also led um our flagpole event in San Francisco which was flight if anyone ever went to that.
>> Um and following that I moved that was in San Francisco. Then I moved back to London and I um I started a company called Improbable uh which is a gaming
platform. Uh so I worked uh mainly with
platform. Uh so I worked uh mainly with gaming developers there and grew the marketing team from three up to 38 and following that I was lucky enough to
meet Ben and so yes I've been with them since the first event um and we have just got a super exciting year coming up. So to add to our program of Europe
up. So to add to our program of Europe and uh Moscone, San Francisco, we if anyone was there, I don't know, was anyone in Paris with us?
>> Excellent.
>> Show of hands. We got like five or six people.
>> So you'll be able to contest that the Paris event was absolutely phenomenal.
So um our partners, Collab, came to us and asked if we were going to do Paris.
We weren't really ready for that and they suggested that we partner with them. This was a great idea. Um they did
them. This was a great idea. Um they did an absolutely incredible job of producing the Paris event. We It was a soldout sponsorship, soldout event. We
had so much incredible feedback that we have actually decided to turn that into a program. So we are launching in 2026
a program. So we are launching in 2026 our partner program. We have already signed up. So we are hoping again to be
signed up. So we are hoping again to be doing Paris October 2026.
Um we have also recently signed up with a Miami team. Um so we're going to be doing an AI engineer Miami April 20 to 21st and then also we're very excited to
be going to Melbourne also with a partner. So we'll be doing Melbourne
partner. So we'll be doing Melbourne June 3rd to 4th 2026.
Um this part this program is obviously just starting. So there's going to be
just starting. So there's going to be more opportunity as we grow to be part of this program. So if there is anyone in this room who wants to know more about that, please just email us at
sponsorshipsai.engineer
sponsorshipsai.engineer and we can talk about what that looks like for your city. Um so yeah, so thank you. I'm very excited for 26.
you. I'm very excited for 26.
>> Thank you so much.
>> All right, so that's basically all the announcements we had. We hope to see you at one of those events. Whether you come to San Francisco or London, we'll probably be be back for New York. So,
should we come back to New York?
>> All right. Uh uh, by the show of Woos, how do you say that? By Woos, how many of you are from New York?
>> Wow. Yeah.
>> San Francisco.
>> That was close. Let's do it again. New
York.
San Francisco.
>> Oh man, I think Scotland >> 73.
>> San Francisco changed pitch.
>> All right.
>> San Francisco went hired this time. I
don't know.
>> I met Cape Town the other day. So is
Cape Town here.
>> Cape Cape Town has left the building.
>> We had New Zealand, right? So people are coming all all over the world for these events. So we obviously New York is a
events. So we obviously New York is a great one of the great cities of the world and um of course you know those 70% of you will say the greatest um and uh we love coming here. We love giving
people an excuse to come here. So uh we we hope to be back soon and we love this venue uh despite being in Time Square.
Um we we we we put up with it because the crew here is just so fantastic obviously like look how gorgeous this place is. Um so we we hope to
place is. Um so we we hope to potentially be back here uh next year.
Um, and then the rest of our crew that we cannot do this without um, everyone from Argus HD who's running the all of the AV. Can we give a hands round of
the AV. Can we give a hands round of applause for them?
And they put up with us because we're getting them as super super late. So,
um, namaste. I think that's what I'm supposed to say. Um, and I always forget people at this moment. Who else am I forgetting? Flormon catering. Our
forgetting? Flormon catering. Our
caterers are are super fantastic. um you
know, photographer, Randall Ge, Max video productions, um doing our B-roll.
We're gonna have some of those end of day clips.
>> Marina, Kyle, >> yes, of course. Our rest of our team members, my god, why don't I add those as as our slides? We got those in the end credits. I at least thought enough
end credits. I at least thought enough to I was doing this like 30 minutes away. Um anyways, yeah, Marina, our our
away. Um anyways, yeah, Marina, our our senior event producer, Wendy, who recently just joined, Kyle, our program production manager, just joined like two weeks ago, and it's been like just hit
the ground running. um Trish, our our production uh assistant uh supervisor and like just so many of our partners in programs. So we we want to we want to
thank all of them. Um but that's about it for for that. Apologies if I forgot you, but uh we now have about until I think we have until like 6:30 in the
venue. So we have a little bit of time
venue. So we have a little bit of time to kind of chill, make your final evening plans. We do have an official
evening plans. We do have an official evening afterparty brought to you by Cerebras along with uh some friends BCV, Mackenzie, Warp, Exa, Modal, am I pronouncing that right? And Metronome.
So that is the official offsite afterparty. We're not giving you the
afterparty. We're not giving you the location right now because we're on the live stream right now, but believe it's on the signs outside and we'll also send an email in just a few moments and it's
also on your attendee guide. So uh do go ahead and check that out. They are
asking for RSVP for headcount, but you don't need it. you can just show up with a badge and you can get it. But don't
forget your badge. Um, so let's give it up for Cerebras. Thank you.
Also, last people to thank is our our sponsors and speakers. Like our
speakers, they work so hard for this and you know, due to Sean's no vendor talks uh policy, which um is, you know, it's it's a great boon for the for for the um for for the community, but they have
very little incentive to do that other than, you know, thought leadership. So,
um, that is a lot of work and they they they do a great job with that. And then
our sponsors, uh, do we enjoy the sponsor expo?
>> Yeah, it's really well done by Art and Display too.
>> Should we try for the photo?
>> Sorry.
>> Should we try for the photo?
>> Uh, sure. Uh, yeah. Is that the plan?
Okay, cool. Yeah.
>> So, we did this last year and it was really great. This is like this is our
really great. This is like this is our salv conference photo. If you want to come up here and do a photo, we can do that together. Uh Randall Gear, our
that together. Uh Randall Gear, our lumpy photographer, is going to take a photo. He will direct us. We We can
photo. He will direct us. We We can actually get him the mic and then he can like yell at us to go left and right. So
come on stage. Just one caveat. These
things here, they look like you can step on them. You can't. They will break and
on them. You can't. They will break and we'll get charged a lot of money. So
just be careful of those. But otherwise,
come on down if you want.
>> Yeah, we did this. We've done this every conference. It's a little memorabilia.
conference. It's a little memorabilia.
>> Please come on and please keep my mic on for a little bit.
>> Hi. Thanks for coming.
>> Yeah.
>> Hope you enjoyed it.
>> Oh my gosh. I'm doing everyone now. Hey,
>> thank you so much.
>> I'm in Minneapolis, so we just started, you know, like a meet up there and that's awesome.
>> Yeah, we we do we do AI meetups, too.
>> My pleasure. Yeah,
>> I'm like on a hot mic.
>> Code summit summit world's fair.
>> Okay.
>> Yeah. So, the AI engineer series.
>> Okay, don't be shy. take up some steps to the front of the stage and we'll have different rows. And if you're on the
different rows. And if you're on the left side of the podium, I won't get you. So, get on this side of the podium.
you. So, get on this side of the podium.
>> And don't squat yet, but we'll do kind like a little squat in the front, but not yet. Okay?
not yet. Okay?
And if you can, don't be shy. Squish in.
There's a lot of space in the middle, so don't be shy. Towards the middle, the middle Where's >> actually this side?
Okay. So, now we're going to have the front squat just a little bit. If you're
in the middle, squat, medium. And then
if you're in the back row, you can tell a little bit. Yeah. Okay. On three. Here
we go. On three. Looking at me. One,
two, three. One, two, three. Wait for
it. Wait. Make sure it's good. Okay.
Now, give me some.
>> Perfect. Thank you.
We Hello.
Heat.
Hey Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Loading video analysis...