No Vibes Allowed: Solving Hard Problems in Complex Codebases – Dex Horthy, HumanLayer
By AI Engineer
Summary
## Key takeaways - **AI Ships Slop in Brownfield Codebases**: Survey of 100,000 developers found most AI use in software engineering leads to rework and codebase churn, especially in complex brownfield codebases, while it works great for greenfield projects. [01:02], [01:13] - **Intentional Compaction Avoids Dumb Zone**: Compress context window into a markdown file with exact files and line numbers before starting new agent sessions to stay in the smart zone under 40% capacity and prevent diminishing returns. [03:47], [06:09] - **RPI Workflow Conquers 300k LOC Rust**: Research objectively, plan with file names and code snippets, then implement to ship a week's work in a day on a 300,000-line Rust codebase with PRs passing expert review. [07:46], [08:41] - **Don't Outsource Thinking to AI**: AI amplifies your thinking or lack thereof; humans must review research and plans to catch errors early, as a bad research line hoses the entire process. [10:07], [17:12] - **Plans Enable Mental Alignment**: Code review is primarily for mental alignment to keep the team on the same page about codebase changes; reviewing detailed plans with snippets scales better than reading thousands of code lines. [15:03], [15:20]
Topics Covered
- AI Coding Creates Slop in Brownfields
- Intentional Compaction Beats Restarting
- Dumb Zone Ruins Agent Performance
- Sub-Agents Control Context Scope
- Plans Enable Mental Alignment
Full Transcript
[music] Hi everybody. How y'all doing?
Hi everybody. How y'all doing?
>> It's exciting. I'm Dex. Uh, as they did in the great intro, I've been hacking on agents for a while. Um, our talk 12 factor agents at AI engineer in June was one of the top talks of all time. Uh, I
think top eight or something. One of the best ones from from AI engineer in June.
May or may not have said something about context engineering. Um, why am I here
context engineering. Um, why am I here today? What am I here to talk about? Um,
today? What am I here to talk about? Um,
I want to talk about one of my favorite talks from AI engineer in June. And I
know we all got the update from Eigor yesterday, but they wouldn't let me change my slides. So, this is going to be about what Eigor talked about in June. uh basically that they surveyed a
June. uh basically that they surveyed a 100,000 developers across all company sizes and they found that most of the time you use AI for software engineering you're doing a lot of rework a lot of
codebase churn uh and it doesn't really work well for complex tasks brownfield code bases um and you can see in the chart basically you are shipping a lot more but a lot of it is just reworking
the slop that you shipped last week so uh and then the other side right was that uh if you're doing green field little versel dashboard something like this then it's going to work great. Uh
if you're going to go in a 10-year-old Java at codebase, maybe not so much. And
this matched my experience personally and talking to a lot of founders and great engineers, too much slop uh tech debt factory. It's just it's not going
debt factory. It's just it's not going to work from our codebase. Like maybe
someday when the models get better, but that's what context engineering is all about. How can we get the most out of
about. How can we get the most out of today's models? How do we manage our
today's models? How do we manage our context window? So we talked about this
context window? So we talked about this in August. Um I have to confess
in August. Um I have to confess something. The first time I used cloud
something. The first time I used cloud code, I was not impressed. It was like, okay, this is a little bit better. I get
it. I like the UX. Um, but since then, we as a team figured something out. Um,
that we were actually able to get, you know, 2 to 3x more throughput. And we
were shipping so much that we had no choice but to change the way we collaborated. We rewired everything
collaborated. We rewired everything about how we build software. Uh, it was a team of three. It took eight weeks. It
was really freaking hard. Uh, but now that we solved it, we're we're never going back. This is the whole no slop
going back. This is the whole no slop thing. I think I think we got somewhere
thing. I think I think we got somewhere with this went super viral on HackerNews in September. Uh we have thousands of
in September. Uh we have thousands of folks who have gone on to GitHub and grabbed our you know research plan implement prompt system. Um so the goals here which we kind of backed our way
into we need AI that can work well in brownfield code bases that can solve complex problems. No slop, right? No
more slop. Uh and we had to maintain mental alignment. I'll talk a little bit
mental alignment. I'll talk a little bit more about what that means in a minute.
And of course we want to spend with everything we want to spend as many tokens as possible. what we can offload meaningfully to the AI is really really important. Um, super high leverage. So,
important. Um, super high leverage. So,
this is advanced context engineering for coding agents. Um, I'll start with kind
coding agents. Um, I'll start with kind of like framing this. The most naive way to use a coding agent is to ask it for something and then tell it why it's wrong and resteere it and ask and ask
and ask until you run out of context or you give up or you cry. Um, we can be a little bit smarter about this. Most
people discover this pretty early on in their AI like exploration. uh is that it might be better if you start a conversation and you're off track that
uh you just start a new context window.
You say, "Okay, we went down that path.
Let's start again. Same prompt, same task, but this time we're going to go down this path and like don't go over there cuz that doesn't work." So, uh how do you know when it's time to start over?
If you see this, it's probably time to start over, right?
This is what Claude says when you tell it it's screwing up.
Um, so we can be even smarter about this. We can do what I call intentional
this. We can do what I call intentional compaction. Um, and this is basically
compaction. Um, and this is basically whether you're on track or not, you can take uh your existing context window and ask the agent to compress it down into a markdown file. You can review this, you
markdown file. You can review this, you can tag it, and then when the new agent starts, it gets straight to work instead of having to do all that searching and codebase understanding and getting caught up. Um, what goes into
caught up. Um, what goes into compaction? Well, the question is like
compaction? Well, the question is like what takes up space in your context window. So, um, it's looking for files,
window. So, um, it's looking for files, it's understanding code flow, it's editing files, it's test and build output. And if you have one of those
output. And if you have one of those MCPs that's dumping JSON and a bunch of UU ids into your context window, you know, God help you. Uh, so what should we compact? I'll get more specifics
we compact? I'll get more specifics here, but this is a really good compaction. This is exactly what we're
compaction. This is exactly what we're working on. The exact files and line
working on. The exact files and line numbers that matter to the problem that we're solving. Um, why are we so
we're solving. Um, why are we so obsessed with context? Because LMS are actually got roasted on YouTube for this one. And they're not pure functions cuz
one. And they're not pure functions cuz they're nondeterministic, but they are stateless. And the only way to get
stateless. And the only way to get better better performance out of an LLM is to put better tokens in and then you get better tokens out. And so every turn of the loop when Claude is picking the next tool or any coding agent is picking
the next and there could be hundreds of right next steps and hundreds of wrong next steps. But the only thing that
next steps. But the only thing that influences what comes out next is what is in the conversation so far. So we're
going to optimize this context window for correctness, completeness, size, and a little bit of trajectory. And the
trajectory one is interesting because a lot of people say, "Well, I I told the agent to do something and it did [clears throat] something wrong. So, I
corrected it and I yelled at it and then it did something wrong again and then I yelled at it." And then the LM is looking at this conversation says, "Okay, cool. I did something wrong. The
"Okay, cool. I did something wrong. The
human yelled at me and then I did something wrong and the human yelled at me." So, the next most likely conver
me." So, the next most likely conver token in this conversation is I better do something wrong so the human can yell at me again. So, mind be mindful of your trajectory. If you were going to invert
trajectory. If you were going to invert this, the worst thing you can have is incorrect information, then missing information, and then just too much noise. Um, if you like equations,
noise. Um, if you like equations, there's a dumb equation if you want to think about it this way. Um, Jeff
Huntley uh did a lot of research on coding agents. Uh, he put it really
coding agents. Uh, he put it really well. Just the more you use the context
well. Just the more you use the context window, the worse outcomes you'll get.
This leads to a concept I'm in a very very academic concept called the dumb zone. So, you have your context window.
zone. So, you have your context window.
You have 168,000 tokens roughly. Some
are reserved for output and compaction.
This varies by model. Um, but we'll use cloud code as an example here. Around
the 40% line is where you're going to start to see some diminishing returns depending on your task. Um, if you have too many MCPs in your coding agent, you are doing all your work in the dumb zone and you're never going to get good
results. People talked about this. I'm
results. People talked about this. I'm
not going to talk about that one. Your
mileage may vary. 40% is like it depends on how complex the task is, but this is kind of a good guideline. Um so back to compaction or as I will call it from now
on cleverly avoiding the dumb zone. Um
we can do sub agents. Um if you have a front-end sub aent and a backend sub aent and a QA sub aent and a data data scientist sub aent please stop. Sub aents are not for
please stop. Sub aents are not for anthropomorphizing roles. They are for
anthropomorphizing roles. They are for controlling context. And so what you can
controlling context. And so what you can do is if you want to go find how something works in a large codebase um you can steer the coding agent to do this if it supports sub agents or you can build your own sub agent system. But
basically you say hey go find how this works and it can fork out a new context window that is going to go do all that reading and searching and finding and reading entire files and understanding
the codebase and then just return a really really succinct message back up to the parent agent of just like hey the file you want is here. parent agent can read that one file and get straight to
work. And so this is really powerful. If
work. And so this is really powerful. If
you wield these correctly, you can get good responses like this and then you can manage your context really, really well. Um, what works even better than
well. Um, what works even better than sub agents or like a layer on top of sub aents is a workflow I call frequent intentional compaction. We're going to
intentional compaction. We're going to talk about research plan implement in a minute, but like the point is you're constantly st keeping your context window small. You're building your
window small. You're building your entire workflow around context management. So comes in three phases.
management. So comes in three phases.
research, plan, implement. Um, and we're going to try to stay in the smart zone the whole time. So, the research is all about understanding how the system works, finding the right files, staying objective. Here's a prompt you can use
objective. Here's a prompt you can use to do research. Here's the output of um, a research prompt. These are all open source. You can go grab them and play
source. You can go grab them and play with them yourself. Um, planning, you're going to outline the exact steps. You're
going to include file names and line snippets. You're going to be very
snippets. You're going to be very explicit about how we're going to test things after every change. Here's a good planning prompt. Here's one of our
planning prompt. Here's one of our plans. It's got actual code snippets in
plans. It's got actual code snippets in it. Um, and then we're gonna implement.
it. Um, and then we're gonna implement.
And if you read one of these plans, you can see very easily how the dumbest model in the world is probably not going to screw this up. Um, so we just go through and we run the plan and we keep the context low. As a planning prompt, like I said, it's the least exciting
part of the process. Um, I wanted to put this into practice. So, working for us, uh, I do a podcast with my buddy uh, Vibv, who's the CEO of a company called Boundary ML. Uh, and I said, "Hey, I'm
Boundary ML. Uh, and I said, "Hey, I'm going to try to oneshot a fix to your 300,000line Rust codebase for a programming language."
programming language." Um, and the whole episode goes in, it's like an hour and a half. Uh, I'm not going to talk through it right now, but we built a bunch of research and then we threw them out because they were bad.
And then we made a plan and we made a plan without research and with research and compared all the results. It's a fun time. Uh, by that was Monday night. By
time. Uh, by that was Monday night. By
Tuesday morning, we were on the show and the CTO had like seen the PR and like didn't realize I was doing it as a bit for a podcast and basically was like, "Yeah, this looks good. We'll get in the next release." He I think he was a
next release." He I think he was a little confused. Um, here's the the
little confused. Um, here's the the plan. But anyways, uh, yeah, confirmed
plan. But anyways, uh, yeah, confirmed works in brownfield code bases and no slop. But I wanted to see if we could
slop. But I wanted to see if we could solve complex problems. So, Vib was still a little skeptical. I sat down, we sat down for like 7 hours on a Saturday and we shipped 35,000 lines of code to
BAML. One of the PRs got merged like a
BAML. One of the PRs got merged like a week later. I will say some of this is
week later. I will say some of this is codegen. You know, you update your
codegen. You know, you update your behavior. All the golden files update
behavior. All the golden files update and stuff, but we shipped a lot of code that day. Um, he estimates it was about
that day. Um, he estimates it was about 1 to 2 weeks and 7 hours. And uh, so cool. We can solve complex problems.
cool. We can solve complex problems. There are limits to this. I sat down with my buddy Blake. We tried to remove Hadoop dependencies from Parket Java. If
you know what Paret Java is, I'm sorry uh for whatever happened to you to get you to this point in your career. Uh it
did not go well. Uh here's the plans, here's the research. Uh at a certain point, we threw everything out and we actually went back to the whiteboard. We
had to actually once we had learned where were the where all the foot guns were, we we went back to okay, how is this actually going to fit together? Um,
and this brings me to a really interesting point that Jake's going to talk about later. Uh, do not outsource the thinking. AI cannot replace
the thinking. AI cannot replace thinking. It can only amplify the
thinking. It can only amplify the thinking you have done or the lack of thinking you have done. So people ask, so Dex, this is specri development,
right? No, specri development is broken.
right? No, specri development is broken.
Not the idea, but the phrase. Um,
it's not well defined. This is Brietta from Thought Works. Um, and a lot of people just say spec and they mean a more detailed prompt. Does anyone
remember this picture? Does anyone know what this is from? All right, that's a deep cut. Uh, there will never be a year
deep cut. Uh, there will never be a year of agents because of semantic diffusion.
Martin Fowler said this in 2006. We come
up with a good term with a good definition and then everybody gets excited and everybody starts meaning it to mean a hundred things to a 100 different people and it becomes useless.
We had an agent is a person, an agent is a micros service. An agent is a chatbot.
An agent is a workflow. And thank you, Simon. We're back to the beginning. An
Simon. We're back to the beginning. An
agent is just tools in a loop. Um, this
is happening to spec driven dev. I used
to have Sean's uh slide in the beginning of this talk, but it caused a bunch of people to focus on the wrong things. His
thing of like, forget the code. It's
like assembly now and you just focus on the markdown. Very cool idea, but people
the markdown. Very cool idea, but people say Spectrum Dev is writing a better prompt, a product requirements document.
Sometimes it's using like verifiable feedback loops and back pressure. Maybe
it is treating the code like assembly like Sean taught us. Um, but a lot of people is just using a bunch of markdown files while you're coding. Or my
favorite, I just stumbled upon this last week. Uh, a spec is documentation for an
week. Uh, a spec is documentation for an open source library. So it's gone. It's
as specri dev is overhyped. It's useless
now. It's semantically diffused.
Um, so I want to talk about like four things that actually work today. The
tactical and practical steps that we found working internally and with a bunch of users. Um, we do the research, we figure out how the system works. Um,
remember Momento? This is the best the best movie on context engineering, as Peter says it. Guy wakes up, he has no memory. He has to like read his own
memory. He has to like read his own tattoos to figure out who he is and what he's up to. If you don't onboard your agents, they will make stuff up. And so,
if this is your team, this is very simplified for most of you. Most of you have much bigger orgs than this. But
let's say you want to do some work over here. Um, one thing you could do is you
here. Um, one thing you could do is you could put onboarding into every repo.
You put a bunch of context. Here's the
repo. Here's how it works. This is a compression of all the context in the codebase that the agent can see ahead of time before actually getting to work.
This is challenging because sometimes it gets too long. As your
codebase gets really big, you either have to make this longer or you have to leave information out. And so as you uh are reading through this, you're going to read the context of this big 5
million line monor repo and you're going to use all the smart zone just to learn how it works. And you're not going to be able to do any good tool calling in the dumb zone. So that's uh you can
dumb zone. So that's uh you can you can shard this down the stack. You
can do they're just talking about progressive disclosure. You could split
progressive disclosure. You could split this up, right? You could just put a file in the root of every repo and then like at every level you have like additional context based on if you're working here, this is what you need to
know. Uh we don't document the files
know. Uh we don't document the files themselves cuz they're the source of truth. But then as your agent is
truth. But then as your agent is working, you know, you pull in the root context and then you pull in the subcontext. We won't talk about any
subcontext. We won't talk about any specific like you could use cloudd for this, you can use hooks for this, whatever it is. Um, but then you still have plenty of room in the smart zone because you're only pulling in what you need to know. Um, the problem with this
is that it gets out of date. And so
every time you ship a new feature, you need to kind of like cache and validate and rebuild large parts of this internal documentation. And you could use a lot
documentation. And you could use a lot of AI and make it part of your process to update this. Um, but I want to ask a question between the actual code, the function names, the comments, and the
documentation. Does anyone want to guess
documentation. Does anyone want to guess what is on the y-axis of this chart? slop
>> slop. It's actually the amount of lies you can find in any one part of your codebase.
Um, so you could make a part of your process to update this, but you probably shouldn't cuz you probably won't. What
we prefer is on demand compressed context. So if I'm building a feature
context. So if I'm building a feature that relates to SCM providers and Jira and Linear, um, I would just give it a little bit of steering. I would say, hey, we're going over in like this like part of the codebase over here. Um, and
a good research uh prompt or or slash command might take you or skill even uh launch a bunch of sub aents to take these vertical slices through the codebase and then build up a research
document that is just a snapshot of the actually true based on the code itself parts of the codebase that matter. We
are compressing truth. Um, planning is leverage. Planning is about compression
leverage. Planning is about compression of intent. Um, and in plan we're going
of intent. Um, and in plan we're going to outline the exact steps. We take our research and our PRD or our bug ticket or our whatever it is and we create a plan and we create a plan file. So we're
compacting again. And I want to pause to talk about mental alignment. Um does
anyone know what code review is for?
>> Mental alignment. Mental alignment is it is about finding making sure things are correct and stuff but the most important thing is how do we keep everybody on the team on the same page about how the codebase is changing and why. And I can
read a thousand lines of Golang every week. Uh sorry I can't read a thousand.
week. Uh sorry I can't read a thousand.
It's hard. I can do it. I don't want to.
Um, and as our team grows, I all the code gets reviewed. We don't not read the code, but I, as you know, a technical leader in the in on the team, I can read the plans and I can keep up to date and I can that's enough. I can
catch some problems early and I maintain understanding of how the system is evolving. Um, Mitchell had this really
evolving. Um, Mitchell had this really good post about how he's been putting his AMP threads on his pull requests so that you can see not just, hey, here's a wall of green text in GitHub, but here's the exact steps, here's the prompts, and
hey, I ran the build at the end and it passed. This takes the reviewer on a
passed. This takes the reviewer on a journey in a way that a GitHub PR just can't. And as you're shipping more and
can't. And as you're shipping more and more in two to three times as much code, it's really on you to find ways to keep your team on the same page and show them here's the steps I did and here's how we
tested it manually. Um, your goal is leverage. So you want high confidence
leverage. So you want high confidence that the model will actually do the right thing. I can't read this plan and
right thing. I can't read this plan and know what actually is going to happen and what code changes are going to happen. So we've over time iterated
happen. So we've over time iterated towards our plans include actual code snippets of what's going to change. So
your goal is leverage. You want
compression of intent and you want reliable execution. Um and so I don't
reliable execution. Um and so I don't know I have a physics background. We
like to draw lines through the center of peaks and curves. Uh as your plans get longer, reliability goes up, readability goes down. There's a sweet spot for you
goes down. There's a sweet spot for you and your team and your codebase. you
should try to find it because when we review the research and the plans, if they're good, then we can get mental alignment. Um, don't outsource the
alignment. Um, don't outsource the thinking. I've said this before, this is
thinking. I've said this before, this is not magic. There is no perfect prompt.
not magic. There is no perfect prompt.
You still will not work if you do not read the plan. So, we built our entire process around you, the builder, are in back and forth with the agent reading the plans as they're created. And then
if you need peer review, you can send it to someone and say, "Hey, does this plan look right? Is this the right approach?
look right? Is this the right approach?
Is this the right order to look at these things?" Um Jake again wrote a really
things?" Um Jake again wrote a really good blog post about like the thing that makes research plan implementing valuable is you the human in the loop making sure it's correct. So if you take
one thing away from this talk it should be that a bad line of code is a bad line of code and a bad part of a plan is could be a hundred bad lines of code and a bad line of research like a
misunderstanding of how the system works and where things are your whole thing is going to be hosed. You're going to be telling sending the model off in the wrong direction. And so when we're
wrong direction. And so when we're working internally and with users, we're constantly trying to move human effort and focus to the highest leverage parts of this pipeline. Um, don't outsource the thinking. Watch out for tools that
the thinking. Watch out for tools that just spew out a bunch of markdown files just to make you feel good. I'm not
going to name names here. Uh, sometimes
this is overkill. And the way I like to think about this is like, yeah, you don't always need a full research plan implement. Sometimes you need more,
implement. Sometimes you need more, sometimes you need less. If you're
changing the color of a button, just talk to the agent and tell it what to do. Um, if you're doing like a simple
do. Um, if you're doing like a simple plan and as a small feature, if you're doing medium features across multiple repos, then do one research, then build a plan. Basically, the hardest problem
a plan. Basically, the hardest problem you can solve, the ceiling goes up the more of this context engineering compaction you're willing to do. Um, and
so if you're in the top right corner, you're probably going to have to do more. A lot of people ask me, "How do I
more. A lot of people ask me, "How do I know how much context engineering to use?" It takes reps. You will get it
use?" It takes reps. You will get it wrong. You have to get it wrong over and
wrong. You have to get it wrong over and over and over again. Sometimes you'll go too big. Sometimes you go too small.
too big. Sometimes you go too small.
Pick one tool and get some reps. I
recommend against minmaxing across cloud and codeex and all these different tools. Um, so I'm not a big acronym guy.
tools. Um, so I'm not a big acronym guy.
Uh, we said specri dev was broken. Uh,
research plan and implement I don't think will be the steps. The important
part is compaction and context engineering and staying in the smart zone. But people are calling this RPI
zone. But people are calling this RPI and there's nothing I can do about it.
So, uh, just be wary. There is no perfect prompt. There is no silver
perfect prompt. There is no silver bullet. Um, if you really want a hypy
bullet. Um, if you really want a hypy word, you can call this harness harness engineering, which is part of context engineering, and it's how you integrate with the integration points on codeex, claude, cursor, whatever. How you
customize your codebase. Um, so what's next? I think the coding agent stuff is
next? I think the coding agent stuff is actually going to be commoditized.
People are going to learn how to do this and get better at it. And the hard part is going to be how do you adapt your team and your workflow and the SDLC to work in a world where 99% of your code is shipped by AI. Uh, and if you can't
figure this out, you're hosed because there's kind of a rift growing where like staff engineers don't adopt AI because it doesn't make them that much faster. And then junior mid-levels
faster. And then junior mid-levels engineers use a lot because it fills in skill gaps and then it also produces some slop. And then the senior engineers
some slop. And then the senior engineers hate it more and more every week because they're cleaning up slop that was shipped by cursor the week before. Uh,
this is not AI's fault. This is not the mid-level engineers fault. Like if
cultural change is really hard and it needs to come from the top if it's going to work. So if you're a technical leader
to work. So if you're a technical leader at your company, pick one tool and get some reps. If you want to help, we are
some reps. If you want to help, we are hiring. We're building an Aentic IDE to
hiring. We're building an Aentic IDE to help teams of all sizes speedrun the journey to 99% AI generated code. Uh if
we'd love to we'd love to talk if you want to work with us. Uh go go hit our website, send us an email, come find me in the hallway. Uh thank you all so much for your energy. [music] Heat.
[music]
Loading video analysis...