Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)
By Latent Space
Summary
## Key takeaways - **Fine-tuning's initial value prop: distilling expensive GPT-4**: OpenPipe initially focused on distilling expensive GPT-4 workflows into smaller, cheaper models, finding early traction with customers paying hundreds of thousands monthly to OpenAI. [03:31] - **LoRAs are underrated for production deployments**: LoRAs offer attractive properties for fine-tuning, especially at inference time, allowing multiplexing of many LoRAs on a single GPU and providing deployment flexibility. They remain a viable option for lightweight model customization. [09:07] - **90% of AI projects fail due to reliability, not capability**: Kyle Corbitt believes that the majority of AI projects get stuck in proof-of-concept due to reliability issues, not inherent capability limitations. Solving this through continuous learning from real-world experience is key. [24:53], [01:05:05] - **RULER: LLMs as judges for accessible RL rewards**: RULER, OpenPipe's library, leverages the GRPO insight that LLMs can act as relative judges, ranking agent behaviors without needing complex absolute reward engineering. This makes RL training more accessible. [52:02] - **GRPO's parallel rollout requirement is a dead end**: While GRPO offers operational simplicity and relative scoring, its requirement for perfectly reproducible parallel rollouts makes data generation complicated and is seen as a potential dead end for widespread adoption. [22:49] - **Sandboxing is the hardest part of agent deployment**: Building realistic, reproducible training environments for agents is significantly harder than the AI training itself. This involves replicating not just functionality but also failure modes and edge cases of real-world systems. [23:35]
Topics Covered
- Distillation's Value Prop Eroded by Frontier Model Cost Drops
- LoRAs: A Practical Fine-Tuning Tool Underrated by Marketing
- Fine-tuning is only cost-effective when forced to smaller models
- RL's potential is huge, but environment simulation is the bottleneck
- LLM judges are effective for RL, but environments remain the challenge
Full Transcript
Hey, everyone. Welcome to the Layton Space podcast. This is Alessio, founder of Kernel Labs,
and I'm joined by Swix, editor of Layton Space. ALESSIO SILESIO SILESIO- Hello, hello. And
we're so excited to have Kyle finally in the studio. Welcome. KYLE BASSETT- Hey. I'm
very excited to be here. ALESSIO SILESIO- Kyle, you're CEO, founder? KYLE BASSETT- Yeah. ALESSIO
SILESIO- Co-founder, CEO, yeah. ALESSIO SILESIO- Of OpenPipe, which started two years ago and recently
got acquired by CoreWeave. Congrats. KYLE BASSETT- Thanks. ALESSIO SILESIO- Where I think you might
be our first, like, started and exited founder that we've had on the pod? Maybe-ish?
I don't know. I'm not keeping true. Especially on that timeline. Well, I don't think
I was exited when we, I don't remember if we set this up before or
after we announced we were getting acquired. I specifically pinged you because you
got, I think you got acquired. You've been on my list to watch. Obviously you've
spoken three times at AIE and you've been on my list of like, when is
it a good time to have an open pipe or fine tuning, or all discussion.
And then you got acquired and I'm like, okay, yeah, that's a good, that's a
good time to talk about it. Also because I think like it gives us a
window to talk about acquisitions, consolidation, like what should be an independent company, what, what
maybe doesn't have to be anyway, but we'll maybe do this chronologically. So we don't,
we don't get too far ahead of ourselves. You were famously director of startup school.
Yes. Maybe for people who don't know, like what is Startup School? Did that make
you fall in love with the color orange? Yes, I'm wearing an orange shirt for
those who are listening. A very bright orange shirt. This is my conference shirt, and
I felt like it was appropriate for the pod as well. So yes, I was
at Y Combinator for about four and a half years and led the Startup School
team there. So Startup School, it's changed over the years. It meant one thing before
I was there. It means another thing now. But during the time I was at
YC, Startup School was Basically all of the external facing a lot of the
content, certainly all of the tech. So it was things like we had a MOOC,
effectively, where founders could come in, they could learn about how to start a company,
they could get advice from YC founders, YC partners. We had a co-founder matching service
that we built, which actually worked really well. We got a lot of people through.
Our total, I guess, Technically, I can't. That probably doesn't matter anymore. But a very
large fraction of the batches that went through YC while I was there were directly
attributable to people that we found and ended up recruiting to YC through their experience,
too, at startup school. So that was kind of what we were working on. Yeah,
I always kind of consider it as like the scout program for YC. Yeah. Right.
Like the YC before the YC. Any notable, like famous people that met? as part
of your co-founder matching? Because I'm always very negative on those things because it's like
online dating. The chances of success is super low. But when it works, it's really
nice. You know, that's a great question. I left, so we launched that product probably
nine months before I left. And so I don't know what the long-term outcomes were
of that specifically. Yeah. So you left YC. You spent a year in kind of
the wilderness. You went through YC S23. What's that journey like? You know, I was
very excited about it. AI things in general. So I left YC, I guess, beginning
of 2022. And I was trying out a bunch of different things. Ended up landing
on what turned into OpenPipe in early 2023. This was, let's see, so
I'd been working, so my co-founder is my brother, my little brother, which has been
a fun journey on its own. We were looking at different ideas, and one thing
we realized was we actually started the company immediately after the GPT-4 launch. And what
we saw as the opportunity in the market at the time, which has changed since
then, was GPT-4 was insanely expensive and extremely powerful. But there was an opportunity to
distill specific workflows from GPT-4 down to much smaller, much cheaper models. And
there was a very clear value prop there. Given how expensive GPT-4 was, it was
hard to deploy in production. But you could take those abilities and deploy them much
more cheaply. So that was the first thing we built, was this very managed, very
clean distillation flow. What was that process like in the beginning to get people to
actually care? Because I'm assuming most people are doing experimentation, but they don't really have
these large production workflows that they needed to distill down. And then I think maybe
once we got there, the models get cheaper and faster. So what was the initial
six, nine months of the company through the evolution of the model? Yeah, so it
worked. It was great. So, I mean, it did take us a while. I guess
we formed the company early, maybe March of 2023. By the time we launched... Our
product, it was August, I want to say. There were some different things we were
trying in between. And actually, it was not hard to find people and get them
excited. There weren't very many. I mean, this was even late 2023. There weren't very
many people in production. But anyone who did have production workflows, it was extremely painful.
They were paying hundreds of thousands of dollars a month to open AI. So it
was very easy to convince them to try this out. And so we got our
first three customers after launching probably within a month. And we were doing significant revenue
over the next six months. We actually got to a million in ARR over about
an eight-month period following that launch, so by the latter part of 2024. So actually,
yes, initial traction was super strong, very clear value prop. But then, as you were
alluding to, there was just this slow march of the Frontier Model token prices just
dropping over and over by 3, 5x over and over again, which kind of ate
away at our value prop over time. What was the process of fine tuning the
model? Because even the open models were not that great. And so what were maybe
the bottlenecks? Instead of having three to get to 30 customers, did you feel like
in the beginning it was a matter of just the market growing, the open source
models not being good enough, the fine tuning not being simple, efficient enough? The pain
point, I guess, repeating what I said before, was the price was too high on
the closed models. But you couldn't just drop in an open model and replace them
because, like you're saying, the quality was quite bad. especially as you're moving to smaller
model sizes, but larger models, open models, weren't even available at that time. So that's
kind of where the value prop was, was like, hey, the closed models are too
expensive, at least the ones that are performance enough to do the job. The open
ones are not good enough. We have a very clear managed flow. The way the
flow worked was quite simple. You simply put in our SDK as a drop-in replacement
for the OpenAI SDK. It's capturing. You continue to use GPT-4 in production for a
period of time. We're capturing the requests and responses. And then we had just a
very clean managed flow where it's like, OK, at some point you say, hey, I
want to distill this down. And you train on that. And then we provided an
API that was a direct drop in replacement. You would just change the inference URL.
And you were using your own model in it. Your app continued working. Yeah, I
think the market analysis here, because I was also exploring starting a business around that
at the time. And that's why I ended up not investing. was basically you get
squeezed between the GPU providers who also want to do fine tuning as a service,
because then that makes people more sticky, and the labs who keep putting out distilled
versions of something, whatever mini versions of their models. What was the analysis on the
NeoCloud side? Because you also want to host the inference. Yeah. Honestly,
we, like I said, felt very squeezed from the frontier labs that were putting out
just more capable models at lower cost. I did not see the competition ever really
materialize from the NeoClouds, from the GPU providers. Everybody had an offering in fine tuning.
When we talked to customers, nobody used them because they just were really hard to
use. So I do think that like, you know, call it a product thing, I
guess. Like it's not their focus. Yeah. Who cares? Yeah. Interesting. developer experience matters. It
does. Yeah, it still does. Did. I don't know. Maybe it doesn't matter anymore. Now
we just have coding models do everything for us. No, it still does. When you
have thinking machines launching an API and people getting excited about the API, you're like,
yeah, okay, that's just pure developer experience there. That's fair. Yeah. Yeah. What's the, I'm
just going through the chronological list here. What's like the Mistral 7B Find 2 and
kind of like one of the big inflection points like in the history of the
company. It's like, okay, this is like a good open model and like the 7B
size or Is that just real stinting line? Yeah, Mistral and Mixtral. That was a
golden period of fine-tuning startups because Mistral was a credible open source model. Yeah,
they were really strong models, better than the Llama 2 that they were effectively replacing.
And they also had a super open license, which I think the licensing has become
maybe less of a concern over time at the margin because people are getting used
to maybe. But at the time, that was like a pretty big deal that they
had this fully open Apache 2 license. And, you know, yeah, maybe they have their
own... issues with how they train it. I don't know. I have no inside information
there. But at least the guarantee they're making to people using their model is you
can use this. Yeah, I call this mistral washing. As long as it's like, it's
a constant sparkling region of France called mistral. It's OK. Don't ask about what goes
into it. There's plausible deniability. Exactly. Arms-lith connection there, yeah. OK, there was this mistral
period. Jan 2024, you talked about S-Laura. And there was a period of time where
Lauras became more important. I feel like they then became less important. And I don't
know what's like the rise and fall of LORAs for you as a business. Yeah.
So LORAs have really, really. So if you're predicated on the fact that you're doing
fine tuning at all, LORAs have very, very attractive properties relative to doing a full
fine tune, right? If you're doing a LoRa, you can add training time. It helps
some. You're using less memory to train. But where it really helps you out is
at inference time. Because if you're doing LoRa's, then when you deploy it for inference,
you can multiplex basically an arbitrarily large number of LoRa's on the same GPU deployment.
That lets you do things like do per token pricing as opposed to GPU hour
pricing. It just gives you much more flexibility at deployment time. I'm actually still a
Laura Bowl, for the record. You're talking about the rise and fall. I think Laura's,
their future is still out there. I mean, they're cool again, because of Thinking Machines.
Yeah. I felt very vindicated by that blog post, for the record. Just, I guess,
for listeners, Thinking Machines put out a week or two ago a blog post doing
quite a lot of research on the trade-offs between Laura's and full fine-tuning in various
different training regimes. I think the reason Lora's were uncool for a while was mostly
just because fine-tuning was uncool. I think if you're doing fine-tuning anyway, Lora's are still,
in many cases, the way you want to do it. But not that many people
are doing fine-tuning. As a marketing guy, Lora's had bad marketing. They were just like,
oh, you can't afford full fine-tuning? Here's the Walmart store brand fine-tuning. No,
that's fair. There is some of that. I think we didn't have a huge issue.
We've had to do some user education, like, hey, just try it. I think for
the training runs that, like the types of training runs that we're interested in, where
it's like, hey, I'm doing a relatively lightweight customization of an existing model for a
specific task, there's really no downside to using Allura. And there's a lot of like
upsides from an like infrasimplicity point of view. I agree that there's like a branding
issue around that. Hopefully the Thinking Machines blog post kind of like, you know, like,
yeah, rank one. And like, you know, I think there's, there's different hyperparameters, Allura's that
you can use to, to make yourself happy. The fact that John Schumann was like,
no, like we're actually, banking the company on this, at least for now, is a
pretty big vote of confidence. I think it's surprising that no one's done the research
prior to them. And I was talking to someone at Thinking Machines prior to their
launch who had come from one of the big labs. And what that research role
was like, oh, no, everyone doing post-trainer research inside this big lab uses LoRa's. I
mean, not for the full run, but when they're doing their experiments, they'll just use
LoRa's on a base model to run the experiments. It works fine. For listeners of
the pod, that was leaked in one of the pods that we released, but it's
up to you to find it. Cool. And then, so then it was the first
World's Fair. You talked about you probably don't need fine tuning as a fine tuning
founder. Basically, I think your talks are really good. I would recommend people watch all
of them. What I pulled out was you had a piece of advice. So your
talk title was obviously... somewhat intentionally clickbaity. But your actual advice on when people should
fine tune is when it's cost, latency, or quality consistency that you really care about.
Yeah, I mostly stand by that. I don't think it's changed. And the biggest one
we see today, and this is true for classical SFT, it's also true for the
RL stuff we're doing today. Crossing my fingers is not always the thing. But the
main one I see that really drives fine tuning is if you have to move
to a smaller model, and it's typically for latency reasons, and this is usually like
real-time voice. So if you're sort of forced into a smaller model anyway, then there's
a very high chance that doing some tuning on that model is going to get
you, like it will be necessary basically to have a successful deployment. So we see
that a lot coming from customers that again have those latency requirements. There's other reasons
as well. Sometimes for whatever reason, you really have to deploy on a single GPU,
you have to deploy within your own cloud. And you want a, you know, you
basically have to use a smaller model to do that. So basically in the case
where you're forced to a smaller model anyway, then fine tuning it is often necessary.
I would say for 90% of use cases where you aren't forced to a smaller
model, then it's still not a good ROI. And you probably shouldn't invest in it
today. How do you quantify these things? So costs, right? Could always be lower. So
is there kind of like a threshold of like, cost to ROI, like, because it's
also hard to figure out how much it's gonna cost you to fine tune because
you need to get the data and all of that. Like, do you have a
mental model of that? This is sort of like a function of the total amount
of overhead required. I'd say there's two parts on the cost side and then, you
know, there's multiple parts on the benefit side. On the cost side, the main things
you have to think about are the upfront effort required to get an actual like
training system set up for your task. And that can be quite variable, but I
would say at a minimum, you're going to have to dedicate a couple of weeks
of a fairly competent engineer's time. And if you have a very complex system and
you're doing RL and you need to set up a whole environment, it could be
a lot longer. It could be a couple of months of time. So that's just
a fixed cost you have to pay. There's also an ongoing carrying cost where once
you've committed to doing fine-tuning, it does make other parts of your stack easier. less
flexible, less nimble, because whenever you're updating your prompt or like you're adding new context
or whatever, like now you have to like, you know, spend a few hours training
a model and that's just going to like slow down your iterations, like which is
a real cost. And in many cases, that's the larger cost. So you only want
to do that if like the benefits are large enough. The dollar cost, I would
say, is basically never a factor. It's just so much less than the time, the
amount you're spending this engineer to do the work that it's not. I mean, it's,
you know, each of these runs is between five and a couple hundred dollars. And
it's just, you don't have to do that many of them. Yeah, because most of
the data is like first party. Yeah. Right. Okay. When was the switch to RL?
Was it when 01 preview came out? You were maybe like, okay, it's time to
move on from SFT or? Yeah. So that was a big moment for us with,
you know, there's all the leaks before that about strawberry and all this. And like,
you know, a lot of people talking about, okay, how are they doing it? We
realized through that that like, okay, Someone's figured out how to make RL actually work
with LLMs, which was not a thing. I mean, it was a thing that some
people had played around with before that, but it wasn't like I think many people
were thinking about. And so our bet at that point was, yes, let's figure out
whether this works for task-specific things. And the space we just, I think it's important
to kind of tease out different parts of the market. I think with the release
of 01, and this has been proved out many times with releases since then, I
think there's now a very strong consensus that, okay, On the frontier model, like
general purpose model side, investments in RL are paying off. I think I don't think
most people would argue with that. You're, especially as you're getting into these agentic tasks
and training them to do that, like it seems very clear. Well, obviously the big
lives are paying like ridiculous amounts of money for these environments and everything, but also
like they're actually getting really good results. The model's coming out, you know, we're seeing
it, especially on the coding model side, but like in other contexts as well, we're
seeing the sort of, especially agentic uses working way better because of this. So I
think like even, late 2024, it was pretty clear that like RL was going to
work in that context. And then the question in our mind was like, can we
apply this in a different segment of the business, which is kind of like task
specific customization. And so the question is like, does that work well? How much effort
does that take? Is it going to be something that ends up being unnecessary because,
Oh, the, the big labs can just like train on every single task and the
base models are going to be just good at everything. And so there's, you know,
no benefit to it. So those were kind of the open questions in our mind,
but it seemed like there was like at least a good enough bet that, you
know, We wanted to try it out. Yeah. And you had this agent reinforcement training
framework. And you did the email agent. That's kind of like the first proof of
concept. Was that obvious to do email? Was it obvious to call it that way?
What was the behind the scene? How should we package this? So what I told
our team, and this was we decided to go all in on RL. in January
of 2025. And we've been doing some experience before that. We released before that kind
of like an RL model that had, you know, would generate like Hacker News titles
from articles, which is a fun project. So we'd done a little bit before that,
but that was kind of like, hey, we're going to bet the company on, not
in the literal sense, like we could have done something else later, but like, this
is like the thing that we're going to spend all of our time working on
for at least a few months. And like what I told our team at that
time in January 25 was like, there's probably like a 25% chance
that this is the right direction in the sense that like a year or two
years from now, all the companies, you know, everyone doing inference should be doing RL
and task specific training so that like their models are just way, way better at
their task is a relatively low chance. But it was sort of like one of
those big, if true things, like if that is true, if it turns out that
like just doing RL on your task is just like something everyone should be doing
and it's and it's just, you know, teaching these agents continually, teaching them through experience
is just going to be a huge benefit than like being the first people working
on that. would be a really, really like awesome position to be in. So that's
how we thought about it is like, you know, less than 50% chance, but really
big outcome. If not, if so, I think since that time, and I've been very
transparent with this, like with our team and like when I'm talking to other people,
like, I don't think the chance that that is the right approach is a hundred
percent yet. I think that we're still in the process, even after going through this
and, and, you know, doing that of like figuring out, but the probabilities in my
mind are going in the right direction. Like now I think they're actually like, Today,
I was actually just thinking about this with another conversation. I think that the chances
that everyone who's deploying an agent at scale should be doing RL
with it, either as part of a pre-deployment or even continuously as it's deployed, that
that's the pattern that that's going to get to. I'd say there's a 55%, 60%
chance that that's just the better thing to do. And that's informed by our experiments
working with customers. So anyway, not 100%, but like, Going all the way back to
your question, like, no, it was not obvious. It was an informed bet. It's still
a bet, but one that I'm feeling pretty good about right now. One thing I
think that is tricky about just onboarding onto this space is all the math.
I remember reading the DPO paper. I think they were at NeurIPS for 2023, and
people were very excited about it. Some of it's just being pretentious for a paper,
but some of it's actually real complexity. You don't have a PhD, a prior ML
background. How do you come to grips with it? What were the best ways to
get around it for you? I would probably push back on that a little bit.
I don't think the math is actually that complicated. I think that when you see
the PPO equation or something with all the symbols, if that's your first intro to
it, then it feels very complicated. But I think if you were to show that
exact same equation, just code, maybe not PyTorch code, because you also have to understand.
But if you just did the naive implementation in Python and showed someone, hey, this
is how we're computing the loss here, who was a strong engineer, I think it's
actually quite grokkable. So yeah, I mean, I don't think the barrier to entry is
that high. I think you just have to believe you can do it and then
spend some time staring at it. That would be what I would recommend, is You
know, you can read the papers and look at the equation. I think actually this
is one area where OMs have been super helpful. If I'm reading a new paper
and I look at one of those equations and I'm like, I don't understand how
this new term they introduced like corresponds to like these other terms, then I can
like dump like all the context around it into, you know, GPT-5 and say like,
hey, can you like write this out of Python for me and show me what
they're doing differently? And that's super helpful for kind of like my background, I guess.
Yeah. The way I put it is I wish that all these papers would just
publish with pseudocode or just straight up Python instead of math. Because you actually just
need to look at the implementation. Yeah, totally. I know Jeremy Howard's been beating this
drum for years. I honestly agree with him. I mean, there's a little website called
Papers with Code. And people just keep not following it. I remember interviewing the DPO
guys when they were at NeurIPS. And it was just like they were just very
obsessed with proving in principle equivalence to PPO.
And it was very hard to follow. I'll definitely say that. And I think now,
obviously, at some point, GRPO kind of took over the general consensus. It was very
strange because I think when DeepSeek first started talking about it, it was viewed
as an optimization. They tend to just generally coach everything as an optimization. But I
think the later insight, which I think you touched on in one of your blog
posts, was that, no, it actually makes comparisons independent rather than global.
And that's actually what unlocks some amount of self-supervised
RL. Yeah. I mean, it's interesting. There's real pros and cons.
If you're moving from PPO or something similar to it to GRPO, there are some
big pros. I mean, one pro is just sort of like operational simplicity. There's a
whole extra model you need for this value model you need for PPO that you
can throw away with GRPO. And that just makes your life easier. You don't have
to train that model, but also there's no hyperparameters around that model that you have
to configure. So that's nice. Another thing is the benefit that you're talking about, which
we've observed. So the way GRPO works is you have to do a set of
different trajectories or set of different rollouts all in parallel with the exact same environment,
the exact same conditions. And then you score each of them. And GRP OO uses
the differences in those scores to promote the trajectories that did better and sort of
like decrease the probability of the ones that did worse because they do it in
sort of a group relative way. The only. it lets you be a little bit
looser with how you score them potentially. Like you don't have to necessarily have a
globally aware scoring function. You just need some scoring function that is able to distinguish
between this small set of things you have in front of you. And then that's
easier. That's easier for a human. You know, if you, if you tell a human,
which of these who choose, which of these is better, it's easier for them to
do than say like, is this one good or bad in terms? So that's nice.
The big downside, the huge downside of GRPO. And I think actually the reason why
GRPO actually is, is, is likely to be a dead end and we probably will
not, continue using it indefinitely. The fact that you need to have these parallel rollouts
in order to train on it is actually like that makes the data generation much
more complicated because you need a fully reproducible environment to be able to do these
sort of parallel rollouts. And it turns out in practice, that's like getting that set
up is the hardest challenge today with getting RL working is like actually designing this
robust reusable environment that you can run all of this training in. Most companies,
and that's not true, like sometimes that's easy to do. Like there's certain situations where
you can do that, but for the work we do at least, where we're training
agents on real code bases to like operate like, you know, real applications, it turns
out it's like really, really hard to sandbox those things in a way that's like
totally reproducible. And PPO, now in practice, a lot of times when you're training with
PPO, you also will use an environment like that because it lets you do a
bunch of runs and be more data efficient. But at least in principle, you have
the option with PPO. You can actually purely train on, say, real production traces of
real people interacting with your app. And so you don't have to have a simulated
environment at all, which makes the deployment much easier. Can you double click on
why it's hard to do the sandboxing? Because in principle, we just capture all the
inputs? Yeah. Well, you don't need to just capture all the inputs. You need a
system that reacts the same way your production system does. That's and in many different
ways. And so let's say you're your Airbnb. Right. And I'm bringing this up because
this is like an example of one that, like, you know, companies have gone out
and built sandboxes. Like if you're Airbnb and you're trying to you want to train
an agent to like maybe you're not Airbnb. Fine. You're a company like us. It's
trying to train an agent to like do really well at operating Airbnb and booking
on your behalf. Right. Like. You have to build a copy of the Airbnb website
that reacts to you as the user the exact same way that the real one
does with the same failure modes. Because if you don't include the same failure modes
and bugs they have, then when one of those bugs comes up in production, your
agent's going to have no idea what to do with it. It's just going to
fall over. You also need to simulate, if this is a sort of cooperative agent,
where it's getting human input as well and working with the human to get something
done, which in practice is the way a lot of these are deployed, you also
need to simulate the user. And I mean, you can do the naive thing and
just say, oh, we're going to have a separate LLM with a system prompt that
is like the user simulator. And we do that. But it's like, OK, but the
breadth of ways a user might respond, there's a lot more diversity in that than
the actual diversity you'll get in practice when you have this simulated user. And so
then it's like, OK, well, is this environment close enough to how a real user
would interact that if a user says something different, that it's going to know what
to do? And the answer in many cases is no. If you're just purely training
on an LLM user simulator, it's going to have its own idea of what
the correct way to answer is. And the breadth of a way a human might
respond in this situation is wider, and your agent just may not be able to
deal with that. Do you feel like it's hard to build the simulations as a
company that needs to build the product that lets everybody do it? Or do you
feel like even for the individual companies that own the code base that are domain
experts in their own product, It's still just like a very hard infrastructure problem. I
think it's still very hard. You know, like ideally all companies should have this anyway,
because they're getting, you know, if you're doing end-to-end testing, like theoretically, if you're following
best practices, you would have one of those set up. When we talk to enterprises
almost universally, that's like not something that really exists. So there are some startups, like
there's some companies we've talked to that do have it and we can just like
use that, but it's a very, very small number that actually have an environment like
that. And I think it's hard to do. And like, there's lots of like weird
bugs that, don't show up in an environment like that. And even if they do
have a testing environment, they don't have it populated with full realistic data, which is
also important so that it understands how to interact. So I think in practice, it's
hard in both cases. Maybe it's easier for the company, but at the same time,
depending on the quality of the company's engineers, it might not be easy for them
either. Yeah. How do you classify the types of environments? So you have formal environments
like a compiler, you know, you can put in there. It's like, you don't need
to do any work. They just work. Then you have this kind of like RL
environment startups in a way that are building a bank environment. They're building these things
that are not digital twins or whatever term of like the actual environments, but they're
like close to it. And then on top of it, you have helping people trying
to build the exact replica of their thing. There's obviously value in like the formally
verified ones. We verified that. Do you think there's value in this like RL environment
startups that are building like somewhat generic but test specific environments? And then if
none of those work, then what do we do instead of GRPO? I guess the
question. Yeah, I suspect there is value in that. You know, I think the, you
know, the folks buying those environments and training on them in the big labs would
have the best knowledge on how well they work. I think they probably work okay.
I think they, probably also are like, you know, and we'll see maybe with the
next generation of models released, like how well they transfer. I would say so far
it seems like they don't train well enough. Like if you, if you use, you
know, open AI's agent interface, it's like, okay. Or if you use the computer use
products that everybody's putting out, they're like, okay, but like not reliable enough to like
actually like let go do something interesting unsupervised in the world. And I think if
the environments they were training in were high enough fidelity, then they would be good
enough in the same way that coding agents can go much further. Because I think
that in that case, we do have environments that are much higher fidelity because it's
a much simpler environment in a lot of ways. It's a code base. It's like
maybe running a web browser. It's much easier to capture the full realistic environment in
that context. For those who are interested, when you make a reference to
RL environment startups selling to the big labs, they're selling it for a lot of
money. Yeah. Like at least seven figures. Right. I don't know. That's my understanding. I
don't know. I'm not a buyer. Please, please like drop data points because like people
who are not in Silicon Valley don't know this. And like, it's like probably the
current thing in VC, which is our environment startups. Um, anyway, I, I, a
lot of them, there's like 20 of them apparently. Yeah. But it's like a small
number. I know that, yeah, all the labs are buying. Ad hoc. But in a
way it's almost like they don't even care. It's not a product. It's like, they're
basically like paying the company to build an environment ad hoc for them. It's a
very services business at the moment. Services business. But I mean, if you're spending like
a billion dollar in a trading run. Yeah, but like you can specialize in like,
we are the one that does e-commerce. Like we are the e-commerce experts. So come
to us for e-commerce. Go to the other guys for like social media. Go to
the other guys for like, I don't know. But I'm curious, your take is like,
how do you need to get the data out? to make it fit in your
training run. Especially when you get to like these larger labs, I think they're like
very sophisticated post-training pipelines. And I don't know if there's like a way to just
build a company where it's like, you just send them a CSV of like data.
It needs to be very integrated in it. But I'm curious what you've seen working
with customers too. So for RL, like, the whole way this works is is you
know it has to sort of be getting feedback from the real environment so i
don't i don't see a world where it's as simple as like hey you can
you know there's like a csv type approach i guess you could code anything as
a csv but if you try hard enough um for our alter work you have
to be looking at real runs ideally of your actual agent in its current state
across within an environment as real as possible. So you have to look at actually.
And the data format's actually super simple. It's just basically a list of chat completion
messages. It's effectively whatever. Tool calls. Yeah, exactly. Yeah, it's whatever your agent will be
seeing and doing when it's running. So getting the data is not hard. But what's
hard is like, When you're doing one of these runs and your agent makes a
tool call, OK, now that tool call has to connect. Somehow it's got to get
data back from something, and that data has to look like it will look in
real usage. So setting up that whole part of the system is the challenge. And
then for just a reference job for more people, Web Arena is my first instance
of this kind of thing where you literally have a Docker container that has like
a clone of Reddit, a clone of Wikipedia, a clone of GitLab, a clone of
CMS, and a clone of an e-commerce place. And I think since then there's like
Mind2Web maybe. I don't know if there's other large, well-known academic
environments where people are basically using these as benchmarks. Probably also it's pretty useful for
trading. So if you want to check out those things, you can definitely check there.
I think the question for you is as someone who bet on SFT, then you
bet on RLFT and then now you see these guys making a lot of money.
Why didn't you go there? It seems to me like that definitely is a services
heavy business at the moment as it's presently constituted. I'm sure that these companies are
all developing different kinds of secret sauce on like how to do this like more
quickly. So that's part of it. I don't particularly enjoy services businesses. But... You know,
I also kind of feel like we will move towards a world where either the
big labs, like it's one of those businesses where like the only customers right now
are like whatever, four big, maybe, maybe, maybe six big labs that like, you know,
are training these models on environments. And I don't think I'm a little, what's the
tab? Yeah. Um, but you know, like, look, you can say the same about scale
AI and all of their competitors that are like, you know, many billion dollar companies
that have basically the exact same customer set. So, so yeah. It may work out.
Yeah. And Lesio, I don't know if you want to do a small shameless plug
for Varys. Oh, yeah. I mean, so Varys, one of our portfolio companies, they work
with the people building the agents, not with the model, on their internal tool call
loop. So they can observe all the internal traces and build the data to then
have an open pipe do the RFT on the thing. I think in the enterprise,
we've seen a lot of that, especially for chatbots. It's the less sexy use case,
but they work with a lot of financial services companies where their customers go in
there and say, what's my balance? Like, when did I do this transaction? And those
are all tool calls, you know, and they need a way to test and improve
that behavior. And the models haven't gotten that much better because these tools are like
badly documented. They're like badly named. I think that's kind of like the problem with
a lot of the agent builders that are not AI native companies. It's like, they
just put this like very generic tools in the thing and then They expect it
to work like magic and the simulations kind of help them also have the usual
compliance things. It's like before shipping this, we tested that it doesn't give financial advice.
We tested that, you know, there's all these different things. So I'm curious to see
how much the companies generalize, you know, I think like there's a lot of success
in like highly regulated environments because of different. But I'm curious if you have a
different way to segment the market of like when you think about RL, there's like
environments that are like low stakes. There's like environment that are like high stakes. There's
environment that have implicit rules that are made by the SEC or other government agencies.
How do you think about it? Yeah, I don't know that that segmentation is
necessarily the most relevant. I'd have to think more about that segmentation, whether it's, you
know, there's like a strong difference in how useful RL is across those sectors. Where
I see the segmentation is something basically just like capabilities based, where it's like, hey,
if I'm trying to do something that's like much more advanced and, you know, maybe
like long horizon, then RL can probably give me a much better behavior. And I
might almost think that like, yeah, those sort of like more compliance, like I feel
like in those kind of environments, you probably don't want your agent doing very much
because then it's like you can't make any guarantees about what it might do. And
so you're probably not doing these long horizon things and maybe RL is not gonna
get you what you want, but I don't know. Yeah, I haven't thought about it
too much. Yeah, I think like a lot of the customers don't necessarily end up
doing RL anyway. It's almost like the simulation and the environment. It's like a way
for them to understand the paths that the agent can take and less about we
need to then use that data to do fine tuning. But I think it's like
a, it's gonna be a spectrum. Yeah. What replaces your PO? Yeah, it's a good
question. We need the alpha. Yeah, I mean, I don't know is the short answer.
I do think this is like a fairly high salience question in the research community.
I think there's a lot of folks like trying to figure that out. Every paper
has a variance. Yeah, but a lot of, but I think the big question is,
are we doing normalization based on grouping or in some other way?
I would claim we're just going to keep calling it GRPO as long as the
normalization is done within a group, even though there's a lot of things that probably
should get their own names. A lot of things that have tried to get their
own names and have failed on the marketing side. I think something that doesn't require
group-level normalization, which a lot of older things didn't, probably works, but I think that
the older things also are really finicky. So there may be other kinds of simplification.
And I don't know exactly what those will be. Where do you put the prompt
optimization thing? We did a DevDay episode, and we mentioned JEPA. And then everybody came
out of the woodwork on Twitter. DS5 Bros. Yeah, exactly. OK, tell me, have you
or people you've talked to tried JEPA? I want to know what. I read the
paper. I'm just like, look, the prompt layer updates are not the same as weights
updates, which they're just comparing apples and oranges and i i talked with a few
people i respect on on this on on the rl side and they kind of
validated like the way that these grad students market their papers is their thing beats
the current hot thing and the current hot thing is grpo but like i they're
just not that comparable i disagree with that like i actually think they are comparable
in the sense that like but it depends on for what purpose right but like
If I'm a company and trying to like get the best performance out of my
agent, like I don't care if you're changing my prompt or if you're changing my
way. So if you get better performance on my agent, you know, I'm, I'm, I'm
happy on that front. I do think they're comparable and we've evaluated, I mean, we
evaluated like, like the, so their answer was you are going to do both. If
you really want max performance, you're going to do both. Yeah. We've evaluated everything from
dispute and we, we evaluated JEP as well. And it's like, it just doesn't work.
Okay. Like, okay. That's going to be the whole boat. Fighting words. JEPPA doesn't work.
It didn't work on the problems we tried it on. It just didn't. It got
like a minor boost over the sort of like more naive prompt we had and
was just like, it was like, okay, just kind of like our naive prompt with
our model gets maybe like 50% on this benchmark and like JEPPA got to 56
and we do our own, we get to like 96. I mean, it was just
like not even. Yeah. comparable. And so maybe we were holding it wrong. You see,
both sides are claiming skill issue, right? So what they would say is you probably
used it wrong. And then people are saying that probably JEPA guys, when they set
up the GRPO benchmark, it wasn't a very fair comparison, which is exactly what my
source said. It's hard to tell. Everyone is trying to get to some version of
the truth. Yeah. But what I will say is we want it. I mean, I
don't know if I would say it goes so far as to say we want
it to work, but we certainly want to know if it works. Like that's like
actually very relevant to like the power. Yeah. And if it's more efficient to get
there, uh, then you should be working. That's yeah. It's actually kind of more credible
now that you're like, you know, you, you're part of a larger core weave that
you're not obviously cause I think JEPA maybe is, uh, makes a open pipe like
less relevant. I totally would disagree with that because like the level we are see
ourselves operating at is actually, we're not like, RL bros trying to figure out the
use case for RL. We're like, hey, we're working with all these enterprises, all these
big companies we're talking to, and we're trying to figure out how we make their
stuff work better. And so I personally am very motivated. If something like JEPA works,
OK, let's build a product around that. That's how I think about OpenPipe, at least.
No, I mean, that's a good clarification to make. Even more so, you actually took
a sincere look at it, and you concluded that there was nothing to do, nothing
to build. Well, maybe we were holding it wrong. So we had Shen Yu on
the podcast a while ago and like, I think he's been a proponent of automatic
prompt optimization and this idea that like you can do a lot more in the
prompts than you can do in the weights. And in principle, I'm biased inclined to
believe that something like a DSPY, something like a JEPA works. So I'm very surprised
to hear this. Yeah, like we keep trying it, you know? We tried the Mipro
V2 stuff that was hyped before that. Also, okay, I should not bury the lead
on the best argument for this, which is it basically JEPA models how the big
labs do their system prompts. It's genetic evolution, you know, and they sort of
incrementally evolve based on like the overall evals that they have. It's slow because it's
done by humans, but JEPA theoretically improves it. It automates this. Okay, hold on.
Is the clinic of the big labs have something? This is news today. No, no,
no, no, no. This is philosophically the same. I'm not saying like... Oh, sure. But
like you're injecting a whole lot of human intuition and kind of like potentially out-of-band
information. We have the best model in the world, which is humanity or like smart
humans. And now we're doing JEPA using dumb... LMs. Right. But they're also like the
humans can bring in out of bad information that like maybe is not captured in
the actual like, you know, the eval. Like they can be like, oh, yes, technically
this did well on the eval, but it's like not really, you know, like I
would suspect that a lot of that ends up getting injected through that human being
in the loop. Yeah. I've always been very surprised at how these guys work on
their system prompts, which are tens of thousands of words long. And there's no ablations.
They just kind of pick what seems to work and then chuck it in there.
And that is the Cloud system prompt. Can't argue with success. Is GPT-5
the first model that had a prompt optimizer by one of the large labs? I
believe so, but I don't remember. Cloud Workbench had this like a year and a
half ago, if you see it that way. It just wasn't like fully automated, but
it was extremely good for its time. I kept telling people about it and nobody
believed me. Do we know if they used it internally? Cloud Workbench? Yeah. Okay. Why
not? Oh, I don't know. Like I, I just, my experience, you know, knowing a
lot of people at these labs is like they launch a lot of products because
like some team is super excited about this product, but that, I wouldn't put that
much weight on it just because they launched it. For some measure of use internally,
I, I, I'm sure I'm, I'm talk the guy, the people I talk to are
biased. I don't know if you fully explored that thread. Yeah, no, I think that's
a, it just interesting that now it's been a knowledge that like the LLM can
improve your prompt. And so I think like Japan always also writing this way of
like, okay, Maybe we can do this programmatically. But I also think the long tail
of people just prompts really badly. And so I think there's some value there. Versus
once you go into RL, you already have a more sophisticated audience. Like who gets
to do GRPO? People that are really smart. Who gets to do prompt optimization? Everybody's
trying to do it. So yeah, that's fair. Maybe our baseline was too high. I
know. Your naive prompt is probably like top 10 percentile of prompts that people put
in these LMs. I'll take it. Yeah. And then the other thing that comes to
mind as you were talking about injecting things out of ban and all that, I
think there's a broader trend that I'm tracking for WorldSphere 26, which is the move
to online evals. The way that we do evals today is probably too locked down.
You're kind of fighting the war that you already know should be fought, and you're
not fighting the wars that you don't about because he didn't plan for it, whatever.
How can we move more online evals into our JEPA process? Maybe that's what it
is. That part I'm much more bullish on. And we can make the analogy. We
can pull in RL intuition here, which is if you're doing JEPA on a static
data set of like, oh, this is the input. This is what makes a good
or bad output. Then as you're updating your prompt, your information, the data you're training
on, becomes less useful, right? Because it's generated by, you know, because it's based on
kind of like the problems you were running into before. And that's the same problem
you have with RL, where you have this concept of being off policy, where it's
like, as you're doing training, you really want to be training on rollouts that came
from the latest version of your model. Because if you train on some that came
from further back, then it's like, it's sort of still data. And it's like not,
it's no longer representing the current issues with your model. And so if you try
and correct for the issues that existed back then, it may not actually be helping
you that much. And I think, you know, for either RL or prompt optimization, that's
definitely true. I think that like one way to apply that in practice is exactly
what you're saying, where you're using the actual data from your, your real evals. You
have some way of saying like, Hey, either people flagging these or no, I'm flagging
these or some way of saying like, this was a good or bad output. I
totally agree with you that like, if you're bringing that into your process, I'm like
much more optimistic that you're going to get good results. Yeah. And the pipelines are
not set up. This is like analytics and UX people being drawn into the ML
process, which they've never been done before. If I had to make a bet as
a big theme for next year, this is going to be it. No, I agree.
And I mean, I think that all of the sort of observability people, like platforms,
see that and are trying to figure out what the right shape is. I haven't
seen the right shape yet, but yes, it seems like a theme for next year.
Statsig. Maybe. Yeah. I haven't, I haven't used them, but opening eye seems to like
them. Yeah. I mean, like, uh, I do think like buying, you know, an
experimentation platform makes sense. And like, you know, I think it's sort of like I've
said before on the podcast, I think that I'm very bullish on model routing as
a feature, but less bullish on model routing companies because of exactly stuff like this,
where like, it is just going to get, get absorbed into the model. It's, it's
a very big part of building the process. You probably don't want to And it's
not that hard. Like it's not rocket science. It's you're just like connecting pipes and
making sure things are set up so that it's easy to use that data. I
have a question for you, a general question. So what fraction of tokens generated by,
say, like the end of 2026, do you think are going to come from open
source models versus proprietary models? Oh, that's a fun question. So we have an
answer from Ankur, from Free Interest, where he was like, it's 5% and going down.
I think it's going to go up because of the amount of enterprise adoption
of open models that I'm seeing. And also... Because there's a lot of demand. The
enterprises would much rather be on open models if they actually could get the performance
they're looking for. Yeah. For cost, for privacy, all that stuff. And I think basically,
honestly, it's just literally... We may have hit... quote unquote, AGI in a sense of
like, it is the average LLM is capable of the work of the average human,
not the best human, but the average human, sure. Like it's actually pretty decent at
customer service. And it's actually pretty decent in like, I don't know, transcribing things from
PDFs, whatever. So like, yeah, I mean, totally, I think that should rise, but people
who believe that it should rise to like 50% are out of their minds. And
I think it's a true question. We should take coding out. I think once you
take coding out, I think, yeah, it can be like 15%, 20%. But I think
with coding, it's still going to be very low. Because these max plans are so
subsidized and so many tokens are being generated. Like Anthropic is like 50% of the
revenue is like- Is your claim that coding will mostly be closed models because
the tokens are subsidized or because the models are just so much better than people
are using anyway? I think as long as- I mean, I'm paying 200 bucks a
month and it's like I'm spending- thousands of dollars. Like by accident, by accident, I
pay with like my credit card and I spend like a hundred bucks in like
an hour. And it's like, this is like the thing that no one wants to
talk about for a topic. Like a topic went from like 1 billion in revenue
to 5 billion. And it was like, Ooh, yay. And then like, what's the margins?
You have this like goose me and going like, what's the margins? Um, they say
it's like 6%. You are part of the 6% that is abusing everything. So everyone
else. I'm not abusing. You're the loss leader. It's not like I'm rotating accounts. I'm
just using the one that I paid for. You know, it's like, yeah. Yeah. But
like through you, people like hear about cloud code, they pay the $200 a month
and then they don't use it and they pay for your inputs. Yeah. Thank you.
Thank you everyone. Keep doing it. Right. So I don't want to have to go
away. But I think like, I don't really see, it's hard to see a world
in which QuantCoder or whatever model replaces that. between quality
and cost. It's like to generate this amount of tokens for 200 bucks a month,
I don't know how anybody can offer together fireworks. They cannot really offer it at
that price. And the quality is not as good. But the reason they can't offer
it at that price is because of the subsidies, right? Which is not like the
long-term sustainable dynamic. I mean, it's interesting because both
Anthropic and OpenAI are building their own infra, right? And they're going to get to
a place where they're going to have idle GPUs that they own. And so they
will also be incentivized to have 100% utilization. And so they will subsidize some of
it. Just the same way if you go on SF Compute, you pay $1.40 for
an H100 instead of the $2.20 listed price on AWS. So I think it
will continue. But again, it depends on whether or not they actually have the 500
billion, like they were saying, which I think they do. You know, just to be
clear, I think Stargate will go online. But once it goes online, then it's like,
well... If they figure out how to pay for $500 billion worth of compute, then
they probably can subsidize for a while. I think they have the 500B, they're going
bigger. Isn't it obvious? What do we mean by have? At the start of this
year, when they announced Stargate, people were like, oh, you don't even have 10. Elon
was like, you don't even have 10. whatever. And then Satya's like, I'm good for
my 80. But like now, now we're seeing all the money start coming in and
like, Probably it's in the order of like 200, 300 billion, like that you could
probably get raised and committed and they're going to get the rest. Like it's, it's
fine. Like I think that the plan is actually a lot bigger. Can I just
say, I love this industry. It's like, yeah, they've got like two or 300 billion
and like what's another couple hundred billion? There's no other industry in the history of
the world where you can say. Yeah. Yeah. It is stupid, but like also like,
do you doubt it? Like I don't, I like. Yeah. That's fair. No, like I
literally like after last week, I think maybe maybe two weeks ago with the whole
Oracle, NVIDIA, and then even AMD deal. I'm like, oh, like these guys, not only
they've locked down Stargate one, they're working on Stargate two, whatever that, that is. And,
and like the sheer ambition is like freaking crazy. There is still one more shoe
to drop, which is the non-sovereign wealth funding that OpenAI needs to get, which
they've promised to drop by the end of this year. And my money is on,
they have to do a coin. Like, I'm not a crypto guy at all. But
like, you know, this is going to be like an open AI coin. This is
the one AI founder that has his own coin already. Yeah. And like, he needs
more money. And he said that they will come up with new innovative financing methods.
What else is there? Yeah, I mean. They're already in a token selling business. Like,
but you got to. That's a great line. Like, buy an open AI token. It
translates to a GPT-5 token. Like, you sure? It's a stable coin.
You'd have to get a lot of political buy-in, I think, to take that level
of risk. What, the White House that is most crypto-friendly since the dawn of time?
Well, I guess Elon's out of there now, so maybe they can make the friends.
I think it's doable. We'll see. Who knows? For what it's worth,
this is a me theory. I don't have any insider information. Should we go back
to Ruler? Yeah, sorry. Right. Open fly. Anyways, we were saying. I think this story
takes us to July 25 when you released Ruler, which you call Easy Mode for
RL Rewards. And then, I mean, shortly after, you get acquired in September. So maybe
you just want to talk through the summer. What was the vision? Then maybe how
the acquisition came together. Yeah, absolutely. So I mentioned my initial
opinion of how likely this direction was to work was maybe 25%. We're up to
55% or so. And Ruler is actually a big update that got me from the
25 to the 50. So let me, I guess, just for context there. So basically,
there are several problems you have to solve if you want to use RL successfully.
The problems you have to solve, I mean, some of them are just really dumb,
basic, like, hey, you got to get the infra and the libraries. have all really
sucked and been built by PhD students who don't know how to build reliable software.
So there's all these practical issues that we're working through. That's one thing. And that's
kind of what we're trying to solve with art. But even after you've got that
solved, you've got major issues, which is you've got to know if your agent is
actually, or whatever system you're using on RL, is doing a good job. That's fundamental.
You have to have a reward. You have to know it's doing well or poorly.
Sometimes that's easy to do. If you're solving a math problem or something, you can
come up with a data set of math problems and the known solution and check
if it's the same. On the coding side, there's been a lot of innovative work
around, I mean, there's, first of all, a lot of open data and a lot
of, I think the approach a lot of companies take is you find existing test
cases and then you break them, but there's sort of a way to figure out
if, you know, you can run the test case, right, and see if your code
fixes it or not. In a lot of other domains, it's much more murky. It's
like, what is a good job versus a bad job? How do I know if
I did a good job? And you really need that information. So we've tried a
bunch of different things. Ruler is a library that we released. Which, let me, relative
universal LLM elicited rewards. Thank you. Yes. And the way it works is, basically, this
depends on the sort of GRPO insight, which I was mentioning earlier, that you actually
don't, with GRPO, it has this nice property where you don't have to have like
an absolute judge the truth. You just have to judge relatively. And so simplifying it
a lot, it's basically just LLMS judge on a whole group. So you say, okay,
this is the task I'm trying to achieve. Here's four different runs of an agent
trying to achieve it. Which of these did best? And it stack ranks them. And
it turns out that works phenomenally well with GRPO, like way better than I expected,
way better than, you know, anyone who kind of like I talked to before we
actually tried this expected. Because it's sort of in, in, in the, the LM years
in his judge, it can have sort of like self ground because it's, it's just
getting these relative ranks. Right. So it doesn't have to like, have like an omniscient
view of like what good or bad looks like. So that, has worked at basically
everything we threw it at. We've done it with a bunch of client projects. We've
done a bunch of our own customers. It basically just works. I honestly kind of
feel like the reward assignment problem is fairly solved. Yeah, it's fantastic.
Just any LMS judge off the hook? We've tried it with so many things. So
one of the results we published was we used QN 2.5 14B as the model
we're training. And as the judge, we used QN 2.5 32B, which is like, not,
I mean, it's fine, but it's like not a, it's, it's much worse than any
frontier model. Right. And even with that combination, we were able to get our, our
agent doing like state of the art better than any frontier model on, on the
tasks we tried it on, even with like an extremely weak judge model. So it's,
it really doesn't depend on having like a really great judge model, um, in practice.
So yeah, it's, it's just like, it's just not something we've had to worry about
since then at all. So that's kind of like checked off. So that's sort of
like got me a significant increase in like, OK, this is actually something people can
apply. This is now something that's packaged up. People can just use our, we open
sourced everything. You can use it off the shelf. If you stick in your train
to run, it will probably just work. So that leaves the remaining problem, which I
guess we were talking about them out of order, but that leaves the environment problem.
That's the one big remaining piece that we don't know yet how to automate or
remove and requires a lot of manual work for every single task. For listeners, you
know, this is why I kind of refer to it as self-supervised because it is
like removes more and more of the human judgment and like the history of machine
learning all the way from like, I guess the, the, the, the, the start of
like, uh, uh, image net and everything, uh, is, is really like that, that insight
of like, you should just take humans increasingly out of it and scale up the
data. You can just throw in there with no supervision. Yeah. Yeah, totally. Yeah. It's,
it's really awesome. Are you bullish on, um, dedicated LMS judge models? Have you looked
at those? Bespoke Labs, we did an episode with them, and they're really trying to
carve out a niche in there. We've looked into it. We've trained some ourselves. We've
also used some off the shelf. There's an evaluation benchmark that the AI2 people put
together, a reward bench. And so reward bench is kind of like trying to benchmark
models on serving as LMS. And reward models are LMS judges in your mind. It's
the same thing. Yeah, yeah, yeah. Mildly different. It depends on the task. LM is
judged is usually more product-facing, and reward modeling is much more
specific within a chat task. That used to be the old meaning of reward model.
I don't know. Maybe terminologies change. I think they're pretty equivalent. I understand that. Yeah.
I can see you guys side. Anyway, so yeah, RewardBench is kind of like, and
so we've tried a bunch off that. The thing is, I guess my maybe meta
take on this is any task that is extremely common is
going to end up in like as a specific, like part of the training data
for the frontier labs. And LMS judge is just something everybody's doing in so many
different contexts that you have to assume that all of the frontier labs have a
bunch of like LMS judge style tasks that they're training their models on. And I
do believe that if something does kind of like make it in, in a like
more than minor way into their training data, that like they're going to do at
least as good a job as, as a dedicated model. So I don't think there's
probably a lot of alpha in dedicated LMS judges just because it's something that like
the, let me caveat that and say, like, if you've got like a very, very
specific task that's like weird and has weird requirements and you have a lot of
data on what's good or bad, then like training a reward model for your specific
task, I think could still work. Um, or, you know, fine tuning an LMS judge
on your specific task could work. I'm pretty bearish on like a, hey, this is
a model that is trained as an LMS judge, but it's a generic LMS judge
that can be used to judge anything. I just don't think you're going to beat
the Frontier Labs on that. Yeah. One other version of this that is not quite
an LLM, but some people are thinking about it, is something that we're working on
for a future episode, which is world models. Sexy. Yeah, very sexy. First
applied in video, as far as I can tell, for Genie 1, 2, 3, and
now with code, and potentially with virtual cells for AI Bio. Any
exploration there that's interesting to you? Yeah. So we've been playing around with it a
little bit. It's one of the directions that I'm fairly optimistic on for solving the
environment problem specifically. Because if you think about it, like a world model, it's a
simulated environment. That's its whole purpose, right? But in an LLM-like thing, not like a
Docker. Yes. Yeah, yeah. you know, whatever, hallucinating, generating, imagining
the responses you'll get from the world. So you can imagine, right, if you had
like a really, really great world model that you're training on. Yeah, it's like your
agent that you're using, it would go and make some tool call. And then this
world model model would generate, hey, this is like probably what the tool call. And
if you have a smart enough, strong enough one, then it could keep its own,
you know, effective internal state of like the changes you made so far and how
that affects. So we've played around with it some. I think if we can get
it to work really well, then that could be a solution for the environment problem,
where you just take a bunch of production traces and use those to condition your
world model so it understands your specific system and what its failure modes are, and
then train against that world model. And the resultant
agent that you train with that would then be able to perform in your real
environment. So I do think it's a really interesting area of research. Yeah. And did
you see the meta-code world model work? I don't think I saw that one. OK.
Yeah, it was like two weeks ago. We just confirmed that the guy for AIU
code in November. And it's really interesting. Like the world model is. Oh, sorry. You're
talking about the meta one? Yeah. OK. Yes, I did. I saw that one. I
said a lot of syllables. I may not have parsed. But like, yeah, it's literally
like having a debugger as the environment, as the world model, and opening up the
execution trace to the model to see what's going on and see the state and
track the state as the code executes. seems to be smart and exploits the unique
situation of code environments where we can actually do these things. Yeah, I think the
way they envision that model being used is a little different.
Actually, I'm curious. I'll have to see the talk. But my understanding from that paper
is the goal they're imagining is this is almost sort of like a pre-training step.
And then now that this model understands code really, really well, we can then use
it as basically like a code generation or a coding agent of some kind. OK,
yeah, which I think makes sense. That's almost more like a different kind of pre-training,
I would say. The way I'm interested in applying world models is basically as its
own end, right, where it's like, actually, the goal is to come out of this
with something that simulates the world, which is not something you really need in code
at all because it's so easy to run code. And you don't need to model
what will happen if you execute this code, typically, because you can just execute the
code and see what happens for training purposes. But it closely models how we think
about code when we code is we kind of mentally execute. the model as we
type. And we go like, is that what we really want? Yeah, I don't know.
Anyway, it's the first model that Meta's released since the MSL reorganization. We know, just
based on our context, that they're very, very interested in code models as a path
to AGI, which I'm also, of course, very interested in. I know we kept in
here for a while. Let's wrap up on the acquisition. So a lot of people
say companies are not sold, they're bought. What was that process like for you? Did
it just Like what was the behind the scenes? Yeah, so that was driven by
actually mostly the weights and biases founding team. Lucas. Yeah. So, so yeah, Lucas and
Sean, particularly. So they, you know, had recently been acquired by CoreWeave and
CoreWeave was looking to, you know, continue growing up the stack. And so, yeah, they
approached me were like, hey, you know, like no pressure, but like this is like
an area that we think is really promising and we, you know, would you like
to work here? And so that's how the conversation started. It was like long. It
was pretty painful. There were points as late as, you know, like the week before
we actually signed where it was like unclear if it was actually going to happen.
So that part was super painful. However, we've been there a month now. We just
shipped a product yesterday, which I'm super excited about. It's been fantastic working there so
far. Like I was like very concerned. I was like, okay, yes, this is great.
We make a lot of money by selling our company, but like is the work
environment going to like really, really suck? And I was like, well, I guess that's
just a risk we'll have to take. It's been fantastic. Like it's honestly been great.
way, way better than I could have imagined. Do you go down to the office,
the one down here? I was there today. We work for it, so I'm based
in Seattle. And they have a small office up there that we work for. Ways
and Biases office in San Francisco is fantastic. If you have the chance, go visit.
They do a lot of hackathons and co-working things. Yeah, there's a hackathon going on
in a month or so. Every week there's a hackathon. But yeah, I mean, so
do you consider yourself working for Ways and Biases or Core Reef? Or both?
No, yeah. So we, so we, I report to the weights and biases, like, yeah,
founders. So we're within that organization and in the org chart, we're there. I don't
know, like branding wise, they're trying to say everything kind of that's not being
sold to like big labs is kind of weights and biases. So like our stuff
we're launching is weights and biases branded. It's not, yeah, not, not core we've branded
as much. I don't know. It's still like, they're still figuring it out. And what's
the product you launched? We launched serverless reinforcement learning. Basically, it lets you offload all
of the GPU management. You don't have to worry about crashes and out of memories
and scaling up and down. We handle all that for you. And you just define
your environment. You define your reward function. And then you just, every time you run
a step, you kind of ship back to our back end. Hey, these are the
trajectories. These are the rewards. Now update my model. And we just make it work
for you. It makes it way easier. Yeah. OK. Very thinky. It is very thinky-like.
I love the thinking machines launch. I think they have a really good idea. It's
also very validating. How did this take so long to appear? Like, it seems... I
don't know. Yeah, we would know. But that's... I felt this way about everything. Like,
there's so many things that should exist. Like, clearly. I just think there's, like, still
not enough people, like, smart people working in this space. Like, honestly, we need... Like,
I realize that there's, like, you know, like, a lot of people. It just feels
like there's still a lot of low-hanging fruit nobody's doing. Okay. One thing I saw
from your... post was your North Star as the RL team at CoreWeave is to
build an old world where every agent learns continually from his real world experience. So
you're touching on the hot topic of the moment, continual learning. What else do we
need to get there? I super believe that. And like, that's basically the vision where
I'm like, you know, I keep talking about these percentages, 25, 50. Like if we
get to the world where we build that, then I think it's just like the
advantages are huge. They're clear. Everyone should just deploy their agents that way. We want
to be like the team that builds the software that makes that easy to do.
So I talked to a lot of engineers at our customers and they're trying to
deploy agents and it's so easy to get the initial prototype and like something that
like kind of works well. It is so hard to get from that to something
that like you are confident is reliable enough to actually deploy in production. And when
you actually look at what those failure modes look like, it's like, oh yeah, like
we know if it gets in this situation or if it gets like these kind
of like inputs, like it behaves funnily. But then it's like, yeah, you can update
your problem to address that. But that's not scalable, because at a certain point, it's
going to start breaking other things. You don't know what it's breaking. You really want
some way to just say, OK, look, this thing you did there, that was the
wrong thing. Just adjust this behavior when you get into this, and then otherwise carry
on. And that's what we can do with RL. And that's what we can do
with continual learning, is we don't have to have this concept of, oh, up front,
I'm trying to make the perfect model that solves everything. It's like, I'm trying to
make a model that's good enough, I can deploy it in production. And then when
these errors come in, I'm going to say, oh, you know, exactly. I mean, very
analogous to how you train a human employee. Like, oh, no, actually, that's not what
you should do in that situation. All right. Fix that and carry on. And that's
just going to make this whole process so much easier. And I think that, you
know, like I think that there is today like 10 times as much inference
that could exist than is existing right now, just purely with projects that are like
sitting in the proof of concept stage and have not been deployed. because there's like
huge bucket of those. And it's, it's all about this kind of like reliability issue
where it's like, okay, like it, it works in controlled circumstances. There's other areas where
it doesn't work. And so if we can solve this problem, there's that, that like
90% of the like inference market, like addressable market today, that's just going to like
come online because we've solved that problem. So, um, that's what we want to do.
Um, I'm super excited about it. And like, I think we have very concrete ideas
on like the specific pieces we need to make that work. Uh, and we just
have to execute against them. Do you feel like the online RL is more susceptible
to like the reward hacking? Especially as you're like short in this loop and like
you don't spend as much time like looking at the different checkpoints. I'm not that
worried about it. And the reason why is because it is reward hacking is quite
easy to detect once it starts happening because once the models found some hack, it
just starts like doing it all the time. It's like, oh yes, this worked great.
I'm just going to keep doing it. And so you like notice very quickly, whoa,
it's doing this thing. And assuming you're using, at least in part, an LLM as
judge to determine which ones are good and bad, it's so easy to just throw
in an extra term and be like, hey, that weird thing that you keep doing,
if it does that, that's bad. Give it a low reward. So we've done this
with a bunch of customers. And reward hacking does happen, but you just see it.
And you adjust your reward prompt, and it just goes away. What's a thing from
YC that guided you through your entrepreneurship journey? And what's one thing that maybe you
find that you disagree with YC on? Oh, that's a good question. One thing that
I really identify with and I've tried to do a good job is kind of
like, you know, sort of, I think they say like, hold your problem tight and
your solution loosely, right? Where it's like, That's what you did. Yeah. Spend a lot
of time thinking about what is the problem people are trying to solve. And then
it's like, don't be too bought into like the way you're solving it today. I
think that's super important. Everyone, you know, it's very easy to get that balance wrong
if you're not thinking about it very consciously. Something I disagree with, that's a good
question. I think there's lots of things I disagree with, but I don't have it
like cached in that direction in my brain. I don't know. Like
I definitely have disagreed with lots of specific piece of advice, but yeah, I don't
have like a great answer right now. I'll bridge it for you in case something
comes up. Sam Altman's like, you know, everything I said as president of YC was
wrong for OpenAI, right? Like do B2B, ended up doing B2C. you should ship products
often, like ended up being installed for two years. Yeah. Yeah. Actually, I
think that second one does resonate with me a lot. We have tried to ship
really quickly and just kind of like, sort of like follow the gradient of the
market. I think if I do another startup, like, and I don't know, maybe this
is just me like being beat up by the market too much. If I do
another startup, like I would like, I think at least some points I probably would
have done better to be like heads down and execute on my vision for longer.
and like kind of like go for the more ambitious thing, but that would take
longer to sort of like prove value, which is definitely not the YC way. But
I think if you have like, I don't know, a good vision and good taste,
then like that, that can like work quite well. Yeah. We'll see what that is
whenever that comes out. But thanks for your time. This is a great overview of
everything. This has been a super fun conversation. Thanks to both of you. Awesome.
Loading video analysis...