TLDW logo

Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)

By Latent Space

Summary

## Key takeaways - **Fine-tuning's initial value prop: distilling expensive GPT-4**: OpenPipe initially focused on distilling expensive GPT-4 workflows into smaller, cheaper models, finding early traction with customers paying hundreds of thousands monthly to OpenAI. [03:31] - **LoRAs are underrated for production deployments**: LoRAs offer attractive properties for fine-tuning, especially at inference time, allowing multiplexing of many LoRAs on a single GPU and providing deployment flexibility. They remain a viable option for lightweight model customization. [09:07] - **90% of AI projects fail due to reliability, not capability**: Kyle Corbitt believes that the majority of AI projects get stuck in proof-of-concept due to reliability issues, not inherent capability limitations. Solving this through continuous learning from real-world experience is key. [24:53], [01:05:05] - **RULER: LLMs as judges for accessible RL rewards**: RULER, OpenPipe's library, leverages the GRPO insight that LLMs can act as relative judges, ranking agent behaviors without needing complex absolute reward engineering. This makes RL training more accessible. [52:02] - **GRPO's parallel rollout requirement is a dead end**: While GRPO offers operational simplicity and relative scoring, its requirement for perfectly reproducible parallel rollouts makes data generation complicated and is seen as a potential dead end for widespread adoption. [22:49] - **Sandboxing is the hardest part of agent deployment**: Building realistic, reproducible training environments for agents is significantly harder than the AI training itself. This involves replicating not just functionality but also failure modes and edge cases of real-world systems. [23:35]

Topics Covered

  • Distillation's Value Prop Eroded by Frontier Model Cost Drops
  • LoRAs: A Practical Fine-Tuning Tool Underrated by Marketing
  • Fine-tuning is only cost-effective when forced to smaller models
  • RL's potential is huge, but environment simulation is the bottleneck
  • LLM judges are effective for RL, but environments remain the challenge

Full Transcript

Hey, everyone. Welcome to the Layton Space podcast. This is Alessio, founder of Kernel Labs,

and I'm joined by Swix, editor of Layton Space. ALESSIO SILESIO SILESIO- Hello, hello. And

we're so excited to have Kyle finally in the studio. Welcome. KYLE BASSETT- Hey. I'm

very excited to be here. ALESSIO SILESIO- Kyle, you're CEO, founder? KYLE BASSETT- Yeah. ALESSIO

SILESIO- Co-founder, CEO, yeah. ALESSIO SILESIO- Of OpenPipe, which started two years ago and recently

got acquired by CoreWeave. Congrats. KYLE BASSETT- Thanks. ALESSIO SILESIO- Where I think you might

be our first, like, started and exited founder that we've had on the pod? Maybe-ish?

I don't know. I'm not keeping true. Especially on that timeline. Well, I don't think

I was exited when we, I don't remember if we set this up before or

after we announced we were getting acquired. I specifically pinged you because you

got, I think you got acquired. You've been on my list to watch. Obviously you've

spoken three times at AIE and you've been on my list of like, when is

it a good time to have an open pipe or fine tuning, or all discussion.

And then you got acquired and I'm like, okay, yeah, that's a good, that's a

good time to talk about it. Also because I think like it gives us a

window to talk about acquisitions, consolidation, like what should be an independent company, what, what

maybe doesn't have to be anyway, but we'll maybe do this chronologically. So we don't,

we don't get too far ahead of ourselves. You were famously director of startup school.

Yes. Maybe for people who don't know, like what is Startup School? Did that make

you fall in love with the color orange? Yes, I'm wearing an orange shirt for

those who are listening. A very bright orange shirt. This is my conference shirt, and

I felt like it was appropriate for the pod as well. So yes, I was

at Y Combinator for about four and a half years and led the Startup School

team there. So Startup School, it's changed over the years. It meant one thing before

I was there. It means another thing now. But during the time I was at

YC, Startup School was Basically all of the external facing a lot of the

content, certainly all of the tech. So it was things like we had a MOOC,

effectively, where founders could come in, they could learn about how to start a company,

they could get advice from YC founders, YC partners. We had a co-founder matching service

that we built, which actually worked really well. We got a lot of people through.

Our total, I guess, Technically, I can't. That probably doesn't matter anymore. But a very

large fraction of the batches that went through YC while I was there were directly

attributable to people that we found and ended up recruiting to YC through their experience,

too, at startup school. So that was kind of what we were working on. Yeah,

I always kind of consider it as like the scout program for YC. Yeah. Right.

Like the YC before the YC. Any notable, like famous people that met? as part

of your co-founder matching? Because I'm always very negative on those things because it's like

online dating. The chances of success is super low. But when it works, it's really

nice. You know, that's a great question. I left, so we launched that product probably

nine months before I left. And so I don't know what the long-term outcomes were

of that specifically. Yeah. So you left YC. You spent a year in kind of

the wilderness. You went through YC S23. What's that journey like? You know, I was

very excited about it. AI things in general. So I left YC, I guess, beginning

of 2022. And I was trying out a bunch of different things. Ended up landing

on what turned into OpenPipe in early 2023. This was, let's see, so

I'd been working, so my co-founder is my brother, my little brother, which has been

a fun journey on its own. We were looking at different ideas, and one thing

we realized was we actually started the company immediately after the GPT-4 launch. And what

we saw as the opportunity in the market at the time, which has changed since

then, was GPT-4 was insanely expensive and extremely powerful. But there was an opportunity to

distill specific workflows from GPT-4 down to much smaller, much cheaper models. And

there was a very clear value prop there. Given how expensive GPT-4 was, it was

hard to deploy in production. But you could take those abilities and deploy them much

more cheaply. So that was the first thing we built, was this very managed, very

clean distillation flow. What was that process like in the beginning to get people to

actually care? Because I'm assuming most people are doing experimentation, but they don't really have

these large production workflows that they needed to distill down. And then I think maybe

once we got there, the models get cheaper and faster. So what was the initial

six, nine months of the company through the evolution of the model? Yeah, so it

worked. It was great. So, I mean, it did take us a while. I guess

we formed the company early, maybe March of 2023. By the time we launched... Our

product, it was August, I want to say. There were some different things we were

trying in between. And actually, it was not hard to find people and get them

excited. There weren't very many. I mean, this was even late 2023. There weren't very

many people in production. But anyone who did have production workflows, it was extremely painful.

They were paying hundreds of thousands of dollars a month to open AI. So it

was very easy to convince them to try this out. And so we got our

first three customers after launching probably within a month. And we were doing significant revenue

over the next six months. We actually got to a million in ARR over about

an eight-month period following that launch, so by the latter part of 2024. So actually,

yes, initial traction was super strong, very clear value prop. But then, as you were

alluding to, there was just this slow march of the Frontier Model token prices just

dropping over and over by 3, 5x over and over again, which kind of ate

away at our value prop over time. What was the process of fine tuning the

model? Because even the open models were not that great. And so what were maybe

the bottlenecks? Instead of having three to get to 30 customers, did you feel like

in the beginning it was a matter of just the market growing, the open source

models not being good enough, the fine tuning not being simple, efficient enough? The pain

point, I guess, repeating what I said before, was the price was too high on

the closed models. But you couldn't just drop in an open model and replace them

because, like you're saying, the quality was quite bad. especially as you're moving to smaller

model sizes, but larger models, open models, weren't even available at that time. So that's

kind of where the value prop was, was like, hey, the closed models are too

expensive, at least the ones that are performance enough to do the job. The open

ones are not good enough. We have a very clear managed flow. The way the

flow worked was quite simple. You simply put in our SDK as a drop-in replacement

for the OpenAI SDK. It's capturing. You continue to use GPT-4 in production for a

period of time. We're capturing the requests and responses. And then we had just a

very clean managed flow where it's like, OK, at some point you say, hey, I

want to distill this down. And you train on that. And then we provided an

API that was a direct drop in replacement. You would just change the inference URL.

And you were using your own model in it. Your app continued working. Yeah, I

think the market analysis here, because I was also exploring starting a business around that

at the time. And that's why I ended up not investing. was basically you get

squeezed between the GPU providers who also want to do fine tuning as a service,

because then that makes people more sticky, and the labs who keep putting out distilled

versions of something, whatever mini versions of their models. What was the analysis on the

NeoCloud side? Because you also want to host the inference. Yeah. Honestly,

we, like I said, felt very squeezed from the frontier labs that were putting out

just more capable models at lower cost. I did not see the competition ever really

materialize from the NeoClouds, from the GPU providers. Everybody had an offering in fine tuning.

When we talked to customers, nobody used them because they just were really hard to

use. So I do think that like, you know, call it a product thing, I

guess. Like it's not their focus. Yeah. Who cares? Yeah. Interesting. developer experience matters. It

does. Yeah, it still does. Did. I don't know. Maybe it doesn't matter anymore. Now

we just have coding models do everything for us. No, it still does. When you

have thinking machines launching an API and people getting excited about the API, you're like,

yeah, okay, that's just pure developer experience there. That's fair. Yeah. Yeah. What's the, I'm

just going through the chronological list here. What's like the Mistral 7B Find 2 and

kind of like one of the big inflection points like in the history of the

company. It's like, okay, this is like a good open model and like the 7B

size or Is that just real stinting line? Yeah, Mistral and Mixtral. That was a

golden period of fine-tuning startups because Mistral was a credible open source model. Yeah,

they were really strong models, better than the Llama 2 that they were effectively replacing.

And they also had a super open license, which I think the licensing has become

maybe less of a concern over time at the margin because people are getting used

to maybe. But at the time, that was like a pretty big deal that they

had this fully open Apache 2 license. And, you know, yeah, maybe they have their

own... issues with how they train it. I don't know. I have no inside information

there. But at least the guarantee they're making to people using their model is you

can use this. Yeah, I call this mistral washing. As long as it's like, it's

a constant sparkling region of France called mistral. It's OK. Don't ask about what goes

into it. There's plausible deniability. Exactly. Arms-lith connection there, yeah. OK, there was this mistral

period. Jan 2024, you talked about S-Laura. And there was a period of time where

Lauras became more important. I feel like they then became less important. And I don't

know what's like the rise and fall of LORAs for you as a business. Yeah.

So LORAs have really, really. So if you're predicated on the fact that you're doing

fine tuning at all, LORAs have very, very attractive properties relative to doing a full

fine tune, right? If you're doing a LoRa, you can add training time. It helps

some. You're using less memory to train. But where it really helps you out is

at inference time. Because if you're doing LoRa's, then when you deploy it for inference,

you can multiplex basically an arbitrarily large number of LoRa's on the same GPU deployment.

That lets you do things like do per token pricing as opposed to GPU hour

pricing. It just gives you much more flexibility at deployment time. I'm actually still a

Laura Bowl, for the record. You're talking about the rise and fall. I think Laura's,

their future is still out there. I mean, they're cool again, because of Thinking Machines.

Yeah. I felt very vindicated by that blog post, for the record. Just, I guess,

for listeners, Thinking Machines put out a week or two ago a blog post doing

quite a lot of research on the trade-offs between Laura's and full fine-tuning in various

different training regimes. I think the reason Lora's were uncool for a while was mostly

just because fine-tuning was uncool. I think if you're doing fine-tuning anyway, Lora's are still,

in many cases, the way you want to do it. But not that many people

are doing fine-tuning. As a marketing guy, Lora's had bad marketing. They were just like,

oh, you can't afford full fine-tuning? Here's the Walmart store brand fine-tuning. No,

that's fair. There is some of that. I think we didn't have a huge issue.

We've had to do some user education, like, hey, just try it. I think for

the training runs that, like the types of training runs that we're interested in, where

it's like, hey, I'm doing a relatively lightweight customization of an existing model for a

specific task, there's really no downside to using Allura. And there's a lot of like

upsides from an like infrasimplicity point of view. I agree that there's like a branding

issue around that. Hopefully the Thinking Machines blog post kind of like, you know, like,

yeah, rank one. And like, you know, I think there's, there's different hyperparameters, Allura's that

you can use to, to make yourself happy. The fact that John Schumann was like,

no, like we're actually, banking the company on this, at least for now, is a

pretty big vote of confidence. I think it's surprising that no one's done the research

prior to them. And I was talking to someone at Thinking Machines prior to their

launch who had come from one of the big labs. And what that research role

was like, oh, no, everyone doing post-trainer research inside this big lab uses LoRa's. I

mean, not for the full run, but when they're doing their experiments, they'll just use

LoRa's on a base model to run the experiments. It works fine. For listeners of

the pod, that was leaked in one of the pods that we released, but it's

up to you to find it. Cool. And then, so then it was the first

World's Fair. You talked about you probably don't need fine tuning as a fine tuning

founder. Basically, I think your talks are really good. I would recommend people watch all

of them. What I pulled out was you had a piece of advice. So your

talk title was obviously... somewhat intentionally clickbaity. But your actual advice on when people should

fine tune is when it's cost, latency, or quality consistency that you really care about.

Yeah, I mostly stand by that. I don't think it's changed. And the biggest one

we see today, and this is true for classical SFT, it's also true for the

RL stuff we're doing today. Crossing my fingers is not always the thing. But the

main one I see that really drives fine tuning is if you have to move

to a smaller model, and it's typically for latency reasons, and this is usually like

real-time voice. So if you're sort of forced into a smaller model anyway, then there's

a very high chance that doing some tuning on that model is going to get

you, like it will be necessary basically to have a successful deployment. So we see

that a lot coming from customers that again have those latency requirements. There's other reasons

as well. Sometimes for whatever reason, you really have to deploy on a single GPU,

you have to deploy within your own cloud. And you want a, you know, you

basically have to use a smaller model to do that. So basically in the case

where you're forced to a smaller model anyway, then fine tuning it is often necessary.

I would say for 90% of use cases where you aren't forced to a smaller

model, then it's still not a good ROI. And you probably shouldn't invest in it

today. How do you quantify these things? So costs, right? Could always be lower. So

is there kind of like a threshold of like, cost to ROI, like, because it's

also hard to figure out how much it's gonna cost you to fine tune because

you need to get the data and all of that. Like, do you have a

mental model of that? This is sort of like a function of the total amount

of overhead required. I'd say there's two parts on the cost side and then, you

know, there's multiple parts on the benefit side. On the cost side, the main things

you have to think about are the upfront effort required to get an actual like

training system set up for your task. And that can be quite variable, but I

would say at a minimum, you're going to have to dedicate a couple of weeks

of a fairly competent engineer's time. And if you have a very complex system and

you're doing RL and you need to set up a whole environment, it could be

a lot longer. It could be a couple of months of time. So that's just

a fixed cost you have to pay. There's also an ongoing carrying cost where once

you've committed to doing fine-tuning, it does make other parts of your stack easier. less

flexible, less nimble, because whenever you're updating your prompt or like you're adding new context

or whatever, like now you have to like, you know, spend a few hours training

a model and that's just going to like slow down your iterations, like which is

a real cost. And in many cases, that's the larger cost. So you only want

to do that if like the benefits are large enough. The dollar cost, I would

say, is basically never a factor. It's just so much less than the time, the

amount you're spending this engineer to do the work that it's not. I mean, it's,

you know, each of these runs is between five and a couple hundred dollars. And

it's just, you don't have to do that many of them. Yeah, because most of

the data is like first party. Yeah. Right. Okay. When was the switch to RL?

Was it when 01 preview came out? You were maybe like, okay, it's time to

move on from SFT or? Yeah. So that was a big moment for us with,

you know, there's all the leaks before that about strawberry and all this. And like,

you know, a lot of people talking about, okay, how are they doing it? We

realized through that that like, okay, Someone's figured out how to make RL actually work

with LLMs, which was not a thing. I mean, it was a thing that some

people had played around with before that, but it wasn't like I think many people

were thinking about. And so our bet at that point was, yes, let's figure out

whether this works for task-specific things. And the space we just, I think it's important

to kind of tease out different parts of the market. I think with the release

of 01, and this has been proved out many times with releases since then, I

think there's now a very strong consensus that, okay, On the frontier model, like

general purpose model side, investments in RL are paying off. I think I don't think

most people would argue with that. You're, especially as you're getting into these agentic tasks

and training them to do that, like it seems very clear. Well, obviously the big

lives are paying like ridiculous amounts of money for these environments and everything, but also

like they're actually getting really good results. The model's coming out, you know, we're seeing

it, especially on the coding model side, but like in other contexts as well, we're

seeing the sort of, especially agentic uses working way better because of this. So I

think like even, late 2024, it was pretty clear that like RL was going to

work in that context. And then the question in our mind was like, can we

apply this in a different segment of the business, which is kind of like task

specific customization. And so the question is like, does that work well? How much effort

does that take? Is it going to be something that ends up being unnecessary because,

Oh, the, the big labs can just like train on every single task and the

base models are going to be just good at everything. And so there's, you know,

no benefit to it. So those were kind of the open questions in our mind,

but it seemed like there was like at least a good enough bet that, you

know, We wanted to try it out. Yeah. And you had this agent reinforcement training

framework. And you did the email agent. That's kind of like the first proof of

concept. Was that obvious to do email? Was it obvious to call it that way?

What was the behind the scene? How should we package this? So what I told

our team, and this was we decided to go all in on RL. in January

of 2025. And we've been doing some experience before that. We released before that kind

of like an RL model that had, you know, would generate like Hacker News titles

from articles, which is a fun project. So we'd done a little bit before that,

but that was kind of like, hey, we're going to bet the company on, not

in the literal sense, like we could have done something else later, but like, this

is like the thing that we're going to spend all of our time working on

for at least a few months. And like what I told our team at that

time in January 25 was like, there's probably like a 25% chance

that this is the right direction in the sense that like a year or two

years from now, all the companies, you know, everyone doing inference should be doing RL

and task specific training so that like their models are just way, way better at

their task is a relatively low chance. But it was sort of like one of

those big, if true things, like if that is true, if it turns out that

like just doing RL on your task is just like something everyone should be doing

and it's and it's just, you know, teaching these agents continually, teaching them through experience

is just going to be a huge benefit than like being the first people working

on that. would be a really, really like awesome position to be in. So that's

how we thought about it is like, you know, less than 50% chance, but really

big outcome. If not, if so, I think since that time, and I've been very

transparent with this, like with our team and like when I'm talking to other people,

like, I don't think the chance that that is the right approach is a hundred

percent yet. I think that we're still in the process, even after going through this

and, and, you know, doing that of like figuring out, but the probabilities in my

mind are going in the right direction. Like now I think they're actually like, Today,

I was actually just thinking about this with another conversation. I think that the chances

that everyone who's deploying an agent at scale should be doing RL

with it, either as part of a pre-deployment or even continuously as it's deployed, that

that's the pattern that that's going to get to. I'd say there's a 55%, 60%

chance that that's just the better thing to do. And that's informed by our experiments

working with customers. So anyway, not 100%, but like, Going all the way back to

your question, like, no, it was not obvious. It was an informed bet. It's still

a bet, but one that I'm feeling pretty good about right now. One thing I

think that is tricky about just onboarding onto this space is all the math.

I remember reading the DPO paper. I think they were at NeurIPS for 2023, and

people were very excited about it. Some of it's just being pretentious for a paper,

but some of it's actually real complexity. You don't have a PhD, a prior ML

background. How do you come to grips with it? What were the best ways to

get around it for you? I would probably push back on that a little bit.

I don't think the math is actually that complicated. I think that when you see

the PPO equation or something with all the symbols, if that's your first intro to

it, then it feels very complicated. But I think if you were to show that

exact same equation, just code, maybe not PyTorch code, because you also have to understand.

But if you just did the naive implementation in Python and showed someone, hey, this

is how we're computing the loss here, who was a strong engineer, I think it's

actually quite grokkable. So yeah, I mean, I don't think the barrier to entry is

that high. I think you just have to believe you can do it and then

spend some time staring at it. That would be what I would recommend, is You

know, you can read the papers and look at the equation. I think actually this

is one area where OMs have been super helpful. If I'm reading a new paper

and I look at one of those equations and I'm like, I don't understand how

this new term they introduced like corresponds to like these other terms, then I can

like dump like all the context around it into, you know, GPT-5 and say like,

hey, can you like write this out of Python for me and show me what

they're doing differently? And that's super helpful for kind of like my background, I guess.

Yeah. The way I put it is I wish that all these papers would just

publish with pseudocode or just straight up Python instead of math. Because you actually just

need to look at the implementation. Yeah, totally. I know Jeremy Howard's been beating this

drum for years. I honestly agree with him. I mean, there's a little website called

Papers with Code. And people just keep not following it. I remember interviewing the DPO

guys when they were at NeurIPS. And it was just like they were just very

obsessed with proving in principle equivalence to PPO.

And it was very hard to follow. I'll definitely say that. And I think now,

obviously, at some point, GRPO kind of took over the general consensus. It was very

strange because I think when DeepSeek first started talking about it, it was viewed

as an optimization. They tend to just generally coach everything as an optimization. But I

think the later insight, which I think you touched on in one of your blog

posts, was that, no, it actually makes comparisons independent rather than global.

And that's actually what unlocks some amount of self-supervised

RL. Yeah. I mean, it's interesting. There's real pros and cons.

If you're moving from PPO or something similar to it to GRPO, there are some

big pros. I mean, one pro is just sort of like operational simplicity. There's a

whole extra model you need for this value model you need for PPO that you

can throw away with GRPO. And that just makes your life easier. You don't have

to train that model, but also there's no hyperparameters around that model that you have

to configure. So that's nice. Another thing is the benefit that you're talking about, which

we've observed. So the way GRPO works is you have to do a set of

different trajectories or set of different rollouts all in parallel with the exact same environment,

the exact same conditions. And then you score each of them. And GRP OO uses

the differences in those scores to promote the trajectories that did better and sort of

like decrease the probability of the ones that did worse because they do it in

sort of a group relative way. The only. it lets you be a little bit

looser with how you score them potentially. Like you don't have to necessarily have a

globally aware scoring function. You just need some scoring function that is able to distinguish

between this small set of things you have in front of you. And then that's

easier. That's easier for a human. You know, if you, if you tell a human,

which of these who choose, which of these is better, it's easier for them to

do than say like, is this one good or bad in terms? So that's nice.

The big downside, the huge downside of GRPO. And I think actually the reason why

GRPO actually is, is, is likely to be a dead end and we probably will

not, continue using it indefinitely. The fact that you need to have these parallel rollouts

in order to train on it is actually like that makes the data generation much

more complicated because you need a fully reproducible environment to be able to do these

sort of parallel rollouts. And it turns out in practice, that's like getting that set

up is the hardest challenge today with getting RL working is like actually designing this

robust reusable environment that you can run all of this training in. Most companies,

and that's not true, like sometimes that's easy to do. Like there's certain situations where

you can do that, but for the work we do at least, where we're training

agents on real code bases to like operate like, you know, real applications, it turns

out it's like really, really hard to sandbox those things in a way that's like

totally reproducible. And PPO, now in practice, a lot of times when you're training with

PPO, you also will use an environment like that because it lets you do a

bunch of runs and be more data efficient. But at least in principle, you have

the option with PPO. You can actually purely train on, say, real production traces of

real people interacting with your app. And so you don't have to have a simulated

environment at all, which makes the deployment much easier. Can you double click on

why it's hard to do the sandboxing? Because in principle, we just capture all the

inputs? Yeah. Well, you don't need to just capture all the inputs. You need a

system that reacts the same way your production system does. That's and in many different

ways. And so let's say you're your Airbnb. Right. And I'm bringing this up because

this is like an example of one that, like, you know, companies have gone out

and built sandboxes. Like if you're Airbnb and you're trying to you want to train

an agent to like maybe you're not Airbnb. Fine. You're a company like us. It's

trying to train an agent to like do really well at operating Airbnb and booking

on your behalf. Right. Like. You have to build a copy of the Airbnb website

that reacts to you as the user the exact same way that the real one

does with the same failure modes. Because if you don't include the same failure modes

and bugs they have, then when one of those bugs comes up in production, your

agent's going to have no idea what to do with it. It's just going to

fall over. You also need to simulate, if this is a sort of cooperative agent,

where it's getting human input as well and working with the human to get something

done, which in practice is the way a lot of these are deployed, you also

need to simulate the user. And I mean, you can do the naive thing and

just say, oh, we're going to have a separate LLM with a system prompt that

is like the user simulator. And we do that. But it's like, OK, but the

breadth of ways a user might respond, there's a lot more diversity in that than

the actual diversity you'll get in practice when you have this simulated user. And so

then it's like, OK, well, is this environment close enough to how a real user

would interact that if a user says something different, that it's going to know what

to do? And the answer in many cases is no. If you're just purely training

on an LLM user simulator, it's going to have its own idea of what

the correct way to answer is. And the breadth of a way a human might

respond in this situation is wider, and your agent just may not be able to

deal with that. Do you feel like it's hard to build the simulations as a

company that needs to build the product that lets everybody do it? Or do you

feel like even for the individual companies that own the code base that are domain

experts in their own product, It's still just like a very hard infrastructure problem. I

think it's still very hard. You know, like ideally all companies should have this anyway,

because they're getting, you know, if you're doing end-to-end testing, like theoretically, if you're following

best practices, you would have one of those set up. When we talk to enterprises

almost universally, that's like not something that really exists. So there are some startups, like

there's some companies we've talked to that do have it and we can just like

use that, but it's a very, very small number that actually have an environment like

that. And I think it's hard to do. And like, there's lots of like weird

bugs that, don't show up in an environment like that. And even if they do

have a testing environment, they don't have it populated with full realistic data, which is

also important so that it understands how to interact. So I think in practice, it's

hard in both cases. Maybe it's easier for the company, but at the same time,

depending on the quality of the company's engineers, it might not be easy for them

either. Yeah. How do you classify the types of environments? So you have formal environments

like a compiler, you know, you can put in there. It's like, you don't need

to do any work. They just work. Then you have this kind of like RL

environment startups in a way that are building a bank environment. They're building these things

that are not digital twins or whatever term of like the actual environments, but they're

like close to it. And then on top of it, you have helping people trying

to build the exact replica of their thing. There's obviously value in like the formally

verified ones. We verified that. Do you think there's value in this like RL environment

startups that are building like somewhat generic but test specific environments? And then if

none of those work, then what do we do instead of GRPO? I guess the

question. Yeah, I suspect there is value in that. You know, I think the, you

know, the folks buying those environments and training on them in the big labs would

have the best knowledge on how well they work. I think they probably work okay.

I think they, probably also are like, you know, and we'll see maybe with the

next generation of models released, like how well they transfer. I would say so far

it seems like they don't train well enough. Like if you, if you use, you

know, open AI's agent interface, it's like, okay. Or if you use the computer use

products that everybody's putting out, they're like, okay, but like not reliable enough to like

actually like let go do something interesting unsupervised in the world. And I think if

the environments they were training in were high enough fidelity, then they would be good

enough in the same way that coding agents can go much further. Because I think

that in that case, we do have environments that are much higher fidelity because it's

a much simpler environment in a lot of ways. It's a code base. It's like

maybe running a web browser. It's much easier to capture the full realistic environment in

that context. For those who are interested, when you make a reference to

RL environment startups selling to the big labs, they're selling it for a lot of

money. Yeah. Like at least seven figures. Right. I don't know. That's my understanding. I

don't know. I'm not a buyer. Please, please like drop data points because like people

who are not in Silicon Valley don't know this. And like, it's like probably the

current thing in VC, which is our environment startups. Um, anyway, I, I, a

lot of them, there's like 20 of them apparently. Yeah. But it's like a small

number. I know that, yeah, all the labs are buying. Ad hoc. But in a

way it's almost like they don't even care. It's not a product. It's like, they're

basically like paying the company to build an environment ad hoc for them. It's a

very services business at the moment. Services business. But I mean, if you're spending like

a billion dollar in a trading run. Yeah, but like you can specialize in like,

we are the one that does e-commerce. Like we are the e-commerce experts. So come

to us for e-commerce. Go to the other guys for like social media. Go to

the other guys for like, I don't know. But I'm curious, your take is like,

how do you need to get the data out? to make it fit in your

training run. Especially when you get to like these larger labs, I think they're like

very sophisticated post-training pipelines. And I don't know if there's like a way to just

build a company where it's like, you just send them a CSV of like data.

It needs to be very integrated in it. But I'm curious what you've seen working

with customers too. So for RL, like, the whole way this works is is you

know it has to sort of be getting feedback from the real environment so i

don't i don't see a world where it's as simple as like hey you can

you know there's like a csv type approach i guess you could code anything as

a csv but if you try hard enough um for our alter work you have

to be looking at real runs ideally of your actual agent in its current state

across within an environment as real as possible. So you have to look at actually.

And the data format's actually super simple. It's just basically a list of chat completion

messages. It's effectively whatever. Tool calls. Yeah, exactly. Yeah, it's whatever your agent will be

seeing and doing when it's running. So getting the data is not hard. But what's

hard is like, When you're doing one of these runs and your agent makes a

tool call, OK, now that tool call has to connect. Somehow it's got to get

data back from something, and that data has to look like it will look in

real usage. So setting up that whole part of the system is the challenge. And

then for just a reference job for more people, Web Arena is my first instance

of this kind of thing where you literally have a Docker container that has like

a clone of Reddit, a clone of Wikipedia, a clone of GitLab, a clone of

CMS, and a clone of an e-commerce place. And I think since then there's like

Mind2Web maybe. I don't know if there's other large, well-known academic

environments where people are basically using these as benchmarks. Probably also it's pretty useful for

trading. So if you want to check out those things, you can definitely check there.

I think the question for you is as someone who bet on SFT, then you

bet on RLFT and then now you see these guys making a lot of money.

Why didn't you go there? It seems to me like that definitely is a services

heavy business at the moment as it's presently constituted. I'm sure that these companies are

all developing different kinds of secret sauce on like how to do this like more

quickly. So that's part of it. I don't particularly enjoy services businesses. But... You know,

I also kind of feel like we will move towards a world where either the

big labs, like it's one of those businesses where like the only customers right now

are like whatever, four big, maybe, maybe, maybe six big labs that like, you know,

are training these models on environments. And I don't think I'm a little, what's the

tab? Yeah. Um, but you know, like, look, you can say the same about scale

AI and all of their competitors that are like, you know, many billion dollar companies

that have basically the exact same customer set. So, so yeah. It may work out.

Yeah. And Lesio, I don't know if you want to do a small shameless plug

for Varys. Oh, yeah. I mean, so Varys, one of our portfolio companies, they work

with the people building the agents, not with the model, on their internal tool call

loop. So they can observe all the internal traces and build the data to then

have an open pipe do the RFT on the thing. I think in the enterprise,

we've seen a lot of that, especially for chatbots. It's the less sexy use case,

but they work with a lot of financial services companies where their customers go in

there and say, what's my balance? Like, when did I do this transaction? And those

are all tool calls, you know, and they need a way to test and improve

that behavior. And the models haven't gotten that much better because these tools are like

badly documented. They're like badly named. I think that's kind of like the problem with

a lot of the agent builders that are not AI native companies. It's like, they

just put this like very generic tools in the thing and then They expect it

to work like magic and the simulations kind of help them also have the usual

compliance things. It's like before shipping this, we tested that it doesn't give financial advice.

We tested that, you know, there's all these different things. So I'm curious to see

how much the companies generalize, you know, I think like there's a lot of success

in like highly regulated environments because of different. But I'm curious if you have a

different way to segment the market of like when you think about RL, there's like

environments that are like low stakes. There's like environment that are like high stakes. There's

environment that have implicit rules that are made by the SEC or other government agencies.

How do you think about it? Yeah, I don't know that that segmentation is

necessarily the most relevant. I'd have to think more about that segmentation, whether it's, you

know, there's like a strong difference in how useful RL is across those sectors. Where

I see the segmentation is something basically just like capabilities based, where it's like, hey,

if I'm trying to do something that's like much more advanced and, you know, maybe

like long horizon, then RL can probably give me a much better behavior. And I

might almost think that like, yeah, those sort of like more compliance, like I feel

like in those kind of environments, you probably don't want your agent doing very much

because then it's like you can't make any guarantees about what it might do. And

so you're probably not doing these long horizon things and maybe RL is not gonna

get you what you want, but I don't know. Yeah, I haven't thought about it

too much. Yeah, I think like a lot of the customers don't necessarily end up

doing RL anyway. It's almost like the simulation and the environment. It's like a way

for them to understand the paths that the agent can take and less about we

need to then use that data to do fine tuning. But I think it's like

a, it's gonna be a spectrum. Yeah. What replaces your PO? Yeah, it's a good

question. We need the alpha. Yeah, I mean, I don't know is the short answer.

I do think this is like a fairly high salience question in the research community.

I think there's a lot of folks like trying to figure that out. Every paper

has a variance. Yeah, but a lot of, but I think the big question is,

are we doing normalization based on grouping or in some other way?

I would claim we're just going to keep calling it GRPO as long as the

normalization is done within a group, even though there's a lot of things that probably

should get their own names. A lot of things that have tried to get their

own names and have failed on the marketing side. I think something that doesn't require

group-level normalization, which a lot of older things didn't, probably works, but I think that

the older things also are really finicky. So there may be other kinds of simplification.

And I don't know exactly what those will be. Where do you put the prompt

optimization thing? We did a DevDay episode, and we mentioned JEPA. And then everybody came

out of the woodwork on Twitter. DS5 Bros. Yeah, exactly. OK, tell me, have you

or people you've talked to tried JEPA? I want to know what. I read the

paper. I'm just like, look, the prompt layer updates are not the same as weights

updates, which they're just comparing apples and oranges and i i talked with a few

people i respect on on this on on the rl side and they kind of

validated like the way that these grad students market their papers is their thing beats

the current hot thing and the current hot thing is grpo but like i they're

just not that comparable i disagree with that like i actually think they are comparable

in the sense that like but it depends on for what purpose right but like

If I'm a company and trying to like get the best performance out of my

agent, like I don't care if you're changing my prompt or if you're changing my

way. So if you get better performance on my agent, you know, I'm, I'm, I'm

happy on that front. I do think they're comparable and we've evaluated, I mean, we

evaluated like, like the, so their answer was you are going to do both. If

you really want max performance, you're going to do both. Yeah. We've evaluated everything from

dispute and we, we evaluated JEP as well. And it's like, it just doesn't work.

Okay. Like, okay. That's going to be the whole boat. Fighting words. JEPPA doesn't work.

It didn't work on the problems we tried it on. It just didn't. It got

like a minor boost over the sort of like more naive prompt we had and

was just like, it was like, okay, just kind of like our naive prompt with

our model gets maybe like 50% on this benchmark and like JEPPA got to 56

and we do our own, we get to like 96. I mean, it was just

like not even. Yeah. comparable. And so maybe we were holding it wrong. You see,

both sides are claiming skill issue, right? So what they would say is you probably

used it wrong. And then people are saying that probably JEPA guys, when they set

up the GRPO benchmark, it wasn't a very fair comparison, which is exactly what my

source said. It's hard to tell. Everyone is trying to get to some version of

the truth. Yeah. But what I will say is we want it. I mean, I

don't know if I would say it goes so far as to say we want

it to work, but we certainly want to know if it works. Like that's like

actually very relevant to like the power. Yeah. And if it's more efficient to get

there, uh, then you should be working. That's yeah. It's actually kind of more credible

now that you're like, you know, you, you're part of a larger core weave that

you're not obviously cause I think JEPA maybe is, uh, makes a open pipe like

less relevant. I totally would disagree with that because like the level we are see

ourselves operating at is actually, we're not like, RL bros trying to figure out the

use case for RL. We're like, hey, we're working with all these enterprises, all these

big companies we're talking to, and we're trying to figure out how we make their

stuff work better. And so I personally am very motivated. If something like JEPA works,

OK, let's build a product around that. That's how I think about OpenPipe, at least.

No, I mean, that's a good clarification to make. Even more so, you actually took

a sincere look at it, and you concluded that there was nothing to do, nothing

to build. Well, maybe we were holding it wrong. So we had Shen Yu on

the podcast a while ago and like, I think he's been a proponent of automatic

prompt optimization and this idea that like you can do a lot more in the

prompts than you can do in the weights. And in principle, I'm biased inclined to

believe that something like a DSPY, something like a JEPA works. So I'm very surprised

to hear this. Yeah, like we keep trying it, you know? We tried the Mipro

V2 stuff that was hyped before that. Also, okay, I should not bury the lead

on the best argument for this, which is it basically JEPA models how the big

labs do their system prompts. It's genetic evolution, you know, and they sort of

incrementally evolve based on like the overall evals that they have. It's slow because it's

done by humans, but JEPA theoretically improves it. It automates this. Okay, hold on.

Is the clinic of the big labs have something? This is news today. No, no,

no, no, no. This is philosophically the same. I'm not saying like... Oh, sure. But

like you're injecting a whole lot of human intuition and kind of like potentially out-of-band

information. We have the best model in the world, which is humanity or like smart

humans. And now we're doing JEPA using dumb... LMs. Right. But they're also like the

humans can bring in out of bad information that like maybe is not captured in

the actual like, you know, the eval. Like they can be like, oh, yes, technically

this did well on the eval, but it's like not really, you know, like I

would suspect that a lot of that ends up getting injected through that human being

in the loop. Yeah. I've always been very surprised at how these guys work on

their system prompts, which are tens of thousands of words long. And there's no ablations.

They just kind of pick what seems to work and then chuck it in there.

And that is the Cloud system prompt. Can't argue with success. Is GPT-5

the first model that had a prompt optimizer by one of the large labs? I

believe so, but I don't remember. Cloud Workbench had this like a year and a

half ago, if you see it that way. It just wasn't like fully automated, but

it was extremely good for its time. I kept telling people about it and nobody

believed me. Do we know if they used it internally? Cloud Workbench? Yeah. Okay. Why

not? Oh, I don't know. Like I, I just, my experience, you know, knowing a

lot of people at these labs is like they launch a lot of products because

like some team is super excited about this product, but that, I wouldn't put that

much weight on it just because they launched it. For some measure of use internally,

I, I, I'm sure I'm, I'm talk the guy, the people I talk to are

biased. I don't know if you fully explored that thread. Yeah, no, I think that's

a, it just interesting that now it's been a knowledge that like the LLM can

improve your prompt. And so I think like Japan always also writing this way of

like, okay, Maybe we can do this programmatically. But I also think the long tail

of people just prompts really badly. And so I think there's some value there. Versus

once you go into RL, you already have a more sophisticated audience. Like who gets

to do GRPO? People that are really smart. Who gets to do prompt optimization? Everybody's

trying to do it. So yeah, that's fair. Maybe our baseline was too high. I

know. Your naive prompt is probably like top 10 percentile of prompts that people put

in these LMs. I'll take it. Yeah. And then the other thing that comes to

mind as you were talking about injecting things out of ban and all that, I

think there's a broader trend that I'm tracking for WorldSphere 26, which is the move

to online evals. The way that we do evals today is probably too locked down.

You're kind of fighting the war that you already know should be fought, and you're

not fighting the wars that you don't about because he didn't plan for it, whatever.

How can we move more online evals into our JEPA process? Maybe that's what it

is. That part I'm much more bullish on. And we can make the analogy. We

can pull in RL intuition here, which is if you're doing JEPA on a static

data set of like, oh, this is the input. This is what makes a good

or bad output. Then as you're updating your prompt, your information, the data you're training

on, becomes less useful, right? Because it's generated by, you know, because it's based on

kind of like the problems you were running into before. And that's the same problem

you have with RL, where you have this concept of being off policy, where it's

like, as you're doing training, you really want to be training on rollouts that came

from the latest version of your model. Because if you train on some that came

from further back, then it's like, it's sort of still data. And it's like not,

it's no longer representing the current issues with your model. And so if you try

and correct for the issues that existed back then, it may not actually be helping

you that much. And I think, you know, for either RL or prompt optimization, that's

definitely true. I think that like one way to apply that in practice is exactly

what you're saying, where you're using the actual data from your, your real evals. You

have some way of saying like, Hey, either people flagging these or no, I'm flagging

these or some way of saying like, this was a good or bad output. I

totally agree with you that like, if you're bringing that into your process, I'm like

much more optimistic that you're going to get good results. Yeah. And the pipelines are

not set up. This is like analytics and UX people being drawn into the ML

process, which they've never been done before. If I had to make a bet as

a big theme for next year, this is going to be it. No, I agree.

And I mean, I think that all of the sort of observability people, like platforms,

see that and are trying to figure out what the right shape is. I haven't

seen the right shape yet, but yes, it seems like a theme for next year.

Statsig. Maybe. Yeah. I haven't, I haven't used them, but opening eye seems to like

them. Yeah. I mean, like, uh, I do think like buying, you know, an

experimentation platform makes sense. And like, you know, I think it's sort of like I've

said before on the podcast, I think that I'm very bullish on model routing as

a feature, but less bullish on model routing companies because of exactly stuff like this,

where like, it is just going to get, get absorbed into the model. It's, it's

a very big part of building the process. You probably don't want to And it's

not that hard. Like it's not rocket science. It's you're just like connecting pipes and

making sure things are set up so that it's easy to use that data. I

have a question for you, a general question. So what fraction of tokens generated by,

say, like the end of 2026, do you think are going to come from open

source models versus proprietary models? Oh, that's a fun question. So we have an

answer from Ankur, from Free Interest, where he was like, it's 5% and going down.

I think it's going to go up because of the amount of enterprise adoption

of open models that I'm seeing. And also... Because there's a lot of demand. The

enterprises would much rather be on open models if they actually could get the performance

they're looking for. Yeah. For cost, for privacy, all that stuff. And I think basically,

honestly, it's just literally... We may have hit... quote unquote, AGI in a sense of

like, it is the average LLM is capable of the work of the average human,

not the best human, but the average human, sure. Like it's actually pretty decent at

customer service. And it's actually pretty decent in like, I don't know, transcribing things from

PDFs, whatever. So like, yeah, I mean, totally, I think that should rise, but people

who believe that it should rise to like 50% are out of their minds. And

I think it's a true question. We should take coding out. I think once you

take coding out, I think, yeah, it can be like 15%, 20%. But I think

with coding, it's still going to be very low. Because these max plans are so

subsidized and so many tokens are being generated. Like Anthropic is like 50% of the

revenue is like- Is your claim that coding will mostly be closed models because

the tokens are subsidized or because the models are just so much better than people

are using anyway? I think as long as- I mean, I'm paying 200 bucks a

month and it's like I'm spending- thousands of dollars. Like by accident, by accident, I

pay with like my credit card and I spend like a hundred bucks in like

an hour. And it's like, this is like the thing that no one wants to

talk about for a topic. Like a topic went from like 1 billion in revenue

to 5 billion. And it was like, Ooh, yay. And then like, what's the margins?

You have this like goose me and going like, what's the margins? Um, they say

it's like 6%. You are part of the 6% that is abusing everything. So everyone

else. I'm not abusing. You're the loss leader. It's not like I'm rotating accounts. I'm

just using the one that I paid for. You know, it's like, yeah. Yeah. But

like through you, people like hear about cloud code, they pay the $200 a month

and then they don't use it and they pay for your inputs. Yeah. Thank you.

Thank you everyone. Keep doing it. Right. So I don't want to have to go

away. But I think like, I don't really see, it's hard to see a world

in which QuantCoder or whatever model replaces that. between quality

and cost. It's like to generate this amount of tokens for 200 bucks a month,

I don't know how anybody can offer together fireworks. They cannot really offer it at

that price. And the quality is not as good. But the reason they can't offer

it at that price is because of the subsidies, right? Which is not like the

long-term sustainable dynamic. I mean, it's interesting because both

Anthropic and OpenAI are building their own infra, right? And they're going to get to

a place where they're going to have idle GPUs that they own. And so they

will also be incentivized to have 100% utilization. And so they will subsidize some of

it. Just the same way if you go on SF Compute, you pay $1.40 for

an H100 instead of the $2.20 listed price on AWS. So I think it

will continue. But again, it depends on whether or not they actually have the 500

billion, like they were saying, which I think they do. You know, just to be

clear, I think Stargate will go online. But once it goes online, then it's like,

well... If they figure out how to pay for $500 billion worth of compute, then

they probably can subsidize for a while. I think they have the 500B, they're going

bigger. Isn't it obvious? What do we mean by have? At the start of this

year, when they announced Stargate, people were like, oh, you don't even have 10. Elon

was like, you don't even have 10. whatever. And then Satya's like, I'm good for

my 80. But like now, now we're seeing all the money start coming in and

like, Probably it's in the order of like 200, 300 billion, like that you could

probably get raised and committed and they're going to get the rest. Like it's, it's

fine. Like I think that the plan is actually a lot bigger. Can I just

say, I love this industry. It's like, yeah, they've got like two or 300 billion

and like what's another couple hundred billion? There's no other industry in the history of

the world where you can say. Yeah. Yeah. It is stupid, but like also like,

do you doubt it? Like I don't, I like. Yeah. That's fair. No, like I

literally like after last week, I think maybe maybe two weeks ago with the whole

Oracle, NVIDIA, and then even AMD deal. I'm like, oh, like these guys, not only

they've locked down Stargate one, they're working on Stargate two, whatever that, that is. And,

and like the sheer ambition is like freaking crazy. There is still one more shoe

to drop, which is the non-sovereign wealth funding that OpenAI needs to get, which

they've promised to drop by the end of this year. And my money is on,

they have to do a coin. Like, I'm not a crypto guy at all. But

like, you know, this is going to be like an open AI coin. This is

the one AI founder that has his own coin already. Yeah. And like, he needs

more money. And he said that they will come up with new innovative financing methods.

What else is there? Yeah, I mean. They're already in a token selling business. Like,

but you got to. That's a great line. Like, buy an open AI token. It

translates to a GPT-5 token. Like, you sure? It's a stable coin.

You'd have to get a lot of political buy-in, I think, to take that level

of risk. What, the White House that is most crypto-friendly since the dawn of time?

Well, I guess Elon's out of there now, so maybe they can make the friends.

I think it's doable. We'll see. Who knows? For what it's worth,

this is a me theory. I don't have any insider information. Should we go back

to Ruler? Yeah, sorry. Right. Open fly. Anyways, we were saying. I think this story

takes us to July 25 when you released Ruler, which you call Easy Mode for

RL Rewards. And then, I mean, shortly after, you get acquired in September. So maybe

you just want to talk through the summer. What was the vision? Then maybe how

the acquisition came together. Yeah, absolutely. So I mentioned my initial

opinion of how likely this direction was to work was maybe 25%. We're up to

55% or so. And Ruler is actually a big update that got me from the

25 to the 50. So let me, I guess, just for context there. So basically,

there are several problems you have to solve if you want to use RL successfully.

The problems you have to solve, I mean, some of them are just really dumb,

basic, like, hey, you got to get the infra and the libraries. have all really

sucked and been built by PhD students who don't know how to build reliable software.

So there's all these practical issues that we're working through. That's one thing. And that's

kind of what we're trying to solve with art. But even after you've got that

solved, you've got major issues, which is you've got to know if your agent is

actually, or whatever system you're using on RL, is doing a good job. That's fundamental.

You have to have a reward. You have to know it's doing well or poorly.

Sometimes that's easy to do. If you're solving a math problem or something, you can

come up with a data set of math problems and the known solution and check

if it's the same. On the coding side, there's been a lot of innovative work

around, I mean, there's, first of all, a lot of open data and a lot

of, I think the approach a lot of companies take is you find existing test

cases and then you break them, but there's sort of a way to figure out

if, you know, you can run the test case, right, and see if your code

fixes it or not. In a lot of other domains, it's much more murky. It's

like, what is a good job versus a bad job? How do I know if

I did a good job? And you really need that information. So we've tried a

bunch of different things. Ruler is a library that we released. Which, let me, relative

universal LLM elicited rewards. Thank you. Yes. And the way it works is, basically, this

depends on the sort of GRPO insight, which I was mentioning earlier, that you actually

don't, with GRPO, it has this nice property where you don't have to have like

an absolute judge the truth. You just have to judge relatively. And so simplifying it

a lot, it's basically just LLMS judge on a whole group. So you say, okay,

this is the task I'm trying to achieve. Here's four different runs of an agent

trying to achieve it. Which of these did best? And it stack ranks them. And

it turns out that works phenomenally well with GRPO, like way better than I expected,

way better than, you know, anyone who kind of like I talked to before we

actually tried this expected. Because it's sort of in, in, in the, the LM years

in his judge, it can have sort of like self ground because it's, it's just

getting these relative ranks. Right. So it doesn't have to like, have like an omniscient

view of like what good or bad looks like. So that, has worked at basically

everything we threw it at. We've done it with a bunch of client projects. We've

done a bunch of our own customers. It basically just works. I honestly kind of

feel like the reward assignment problem is fairly solved. Yeah, it's fantastic.

Just any LMS judge off the hook? We've tried it with so many things. So

one of the results we published was we used QN 2.5 14B as the model

we're training. And as the judge, we used QN 2.5 32B, which is like, not,

I mean, it's fine, but it's like not a, it's, it's much worse than any

frontier model. Right. And even with that combination, we were able to get our, our

agent doing like state of the art better than any frontier model on, on the

tasks we tried it on, even with like an extremely weak judge model. So it's,

it really doesn't depend on having like a really great judge model, um, in practice.

So yeah, it's, it's just like, it's just not something we've had to worry about

since then at all. So that's kind of like checked off. So that's sort of

like got me a significant increase in like, OK, this is actually something people can

apply. This is now something that's packaged up. People can just use our, we open

sourced everything. You can use it off the shelf. If you stick in your train

to run, it will probably just work. So that leaves the remaining problem, which I

guess we were talking about them out of order, but that leaves the environment problem.

That's the one big remaining piece that we don't know yet how to automate or

remove and requires a lot of manual work for every single task. For listeners, you

know, this is why I kind of refer to it as self-supervised because it is

like removes more and more of the human judgment and like the history of machine

learning all the way from like, I guess the, the, the, the, the start of

like, uh, uh, image net and everything, uh, is, is really like that, that insight

of like, you should just take humans increasingly out of it and scale up the

data. You can just throw in there with no supervision. Yeah. Yeah, totally. Yeah. It's,

it's really awesome. Are you bullish on, um, dedicated LMS judge models? Have you looked

at those? Bespoke Labs, we did an episode with them, and they're really trying to

carve out a niche in there. We've looked into it. We've trained some ourselves. We've

also used some off the shelf. There's an evaluation benchmark that the AI2 people put

together, a reward bench. And so reward bench is kind of like trying to benchmark

models on serving as LMS. And reward models are LMS judges in your mind. It's

the same thing. Yeah, yeah, yeah. Mildly different. It depends on the task. LM is

judged is usually more product-facing, and reward modeling is much more

specific within a chat task. That used to be the old meaning of reward model.

I don't know. Maybe terminologies change. I think they're pretty equivalent. I understand that. Yeah.

I can see you guys side. Anyway, so yeah, RewardBench is kind of like, and

so we've tried a bunch off that. The thing is, I guess my maybe meta

take on this is any task that is extremely common is

going to end up in like as a specific, like part of the training data

for the frontier labs. And LMS judge is just something everybody's doing in so many

different contexts that you have to assume that all of the frontier labs have a

bunch of like LMS judge style tasks that they're training their models on. And I

do believe that if something does kind of like make it in, in a like

more than minor way into their training data, that like they're going to do at

least as good a job as, as a dedicated model. So I don't think there's

probably a lot of alpha in dedicated LMS judges just because it's something that like

the, let me caveat that and say, like, if you've got like a very, very

specific task that's like weird and has weird requirements and you have a lot of

data on what's good or bad, then like training a reward model for your specific

task, I think could still work. Um, or, you know, fine tuning an LMS judge

on your specific task could work. I'm pretty bearish on like a, hey, this is

a model that is trained as an LMS judge, but it's a generic LMS judge

that can be used to judge anything. I just don't think you're going to beat

the Frontier Labs on that. Yeah. One other version of this that is not quite

an LLM, but some people are thinking about it, is something that we're working on

for a future episode, which is world models. Sexy. Yeah, very sexy. First

applied in video, as far as I can tell, for Genie 1, 2, 3, and

now with code, and potentially with virtual cells for AI Bio. Any

exploration there that's interesting to you? Yeah. So we've been playing around with it a

little bit. It's one of the directions that I'm fairly optimistic on for solving the

environment problem specifically. Because if you think about it, like a world model, it's a

simulated environment. That's its whole purpose, right? But in an LLM-like thing, not like a

Docker. Yes. Yeah, yeah. you know, whatever, hallucinating, generating, imagining

the responses you'll get from the world. So you can imagine, right, if you had

like a really, really great world model that you're training on. Yeah, it's like your

agent that you're using, it would go and make some tool call. And then this

world model model would generate, hey, this is like probably what the tool call. And

if you have a smart enough, strong enough one, then it could keep its own,

you know, effective internal state of like the changes you made so far and how

that affects. So we've played around with it some. I think if we can get

it to work really well, then that could be a solution for the environment problem,

where you just take a bunch of production traces and use those to condition your

world model so it understands your specific system and what its failure modes are, and

then train against that world model. And the resultant

agent that you train with that would then be able to perform in your real

environment. So I do think it's a really interesting area of research. Yeah. And did

you see the meta-code world model work? I don't think I saw that one. OK.

Yeah, it was like two weeks ago. We just confirmed that the guy for AIU

code in November. And it's really interesting. Like the world model is. Oh, sorry. You're

talking about the meta one? Yeah. OK. Yes, I did. I saw that one. I

said a lot of syllables. I may not have parsed. But like, yeah, it's literally

like having a debugger as the environment, as the world model, and opening up the

execution trace to the model to see what's going on and see the state and

track the state as the code executes. seems to be smart and exploits the unique

situation of code environments where we can actually do these things. Yeah, I think the

way they envision that model being used is a little different.

Actually, I'm curious. I'll have to see the talk. But my understanding from that paper

is the goal they're imagining is this is almost sort of like a pre-training step.

And then now that this model understands code really, really well, we can then use

it as basically like a code generation or a coding agent of some kind. OK,

yeah, which I think makes sense. That's almost more like a different kind of pre-training,

I would say. The way I'm interested in applying world models is basically as its

own end, right, where it's like, actually, the goal is to come out of this

with something that simulates the world, which is not something you really need in code

at all because it's so easy to run code. And you don't need to model

what will happen if you execute this code, typically, because you can just execute the

code and see what happens for training purposes. But it closely models how we think

about code when we code is we kind of mentally execute. the model as we

type. And we go like, is that what we really want? Yeah, I don't know.

Anyway, it's the first model that Meta's released since the MSL reorganization. We know, just

based on our context, that they're very, very interested in code models as a path

to AGI, which I'm also, of course, very interested in. I know we kept in

here for a while. Let's wrap up on the acquisition. So a lot of people

say companies are not sold, they're bought. What was that process like for you? Did

it just Like what was the behind the scenes? Yeah, so that was driven by

actually mostly the weights and biases founding team. Lucas. Yeah. So, so yeah, Lucas and

Sean, particularly. So they, you know, had recently been acquired by CoreWeave and

CoreWeave was looking to, you know, continue growing up the stack. And so, yeah, they

approached me were like, hey, you know, like no pressure, but like this is like

an area that we think is really promising and we, you know, would you like

to work here? And so that's how the conversation started. It was like long. It

was pretty painful. There were points as late as, you know, like the week before

we actually signed where it was like unclear if it was actually going to happen.

So that part was super painful. However, we've been there a month now. We just

shipped a product yesterday, which I'm super excited about. It's been fantastic working there so

far. Like I was like very concerned. I was like, okay, yes, this is great.

We make a lot of money by selling our company, but like is the work

environment going to like really, really suck? And I was like, well, I guess that's

just a risk we'll have to take. It's been fantastic. Like it's honestly been great.

way, way better than I could have imagined. Do you go down to the office,

the one down here? I was there today. We work for it, so I'm based

in Seattle. And they have a small office up there that we work for. Ways

and Biases office in San Francisco is fantastic. If you have the chance, go visit.

They do a lot of hackathons and co-working things. Yeah, there's a hackathon going on

in a month or so. Every week there's a hackathon. But yeah, I mean, so

do you consider yourself working for Ways and Biases or Core Reef? Or both?

No, yeah. So we, so we, I report to the weights and biases, like, yeah,

founders. So we're within that organization and in the org chart, we're there. I don't

know, like branding wise, they're trying to say everything kind of that's not being

sold to like big labs is kind of weights and biases. So like our stuff

we're launching is weights and biases branded. It's not, yeah, not, not core we've branded

as much. I don't know. It's still like, they're still figuring it out. And what's

the product you launched? We launched serverless reinforcement learning. Basically, it lets you offload all

of the GPU management. You don't have to worry about crashes and out of memories

and scaling up and down. We handle all that for you. And you just define

your environment. You define your reward function. And then you just, every time you run

a step, you kind of ship back to our back end. Hey, these are the

trajectories. These are the rewards. Now update my model. And we just make it work

for you. It makes it way easier. Yeah. OK. Very thinky. It is very thinky-like.

I love the thinking machines launch. I think they have a really good idea. It's

also very validating. How did this take so long to appear? Like, it seems... I

don't know. Yeah, we would know. But that's... I felt this way about everything. Like,

there's so many things that should exist. Like, clearly. I just think there's, like, still

not enough people, like, smart people working in this space. Like, honestly, we need... Like,

I realize that there's, like, you know, like, a lot of people. It just feels

like there's still a lot of low-hanging fruit nobody's doing. Okay. One thing I saw

from your... post was your North Star as the RL team at CoreWeave is to

build an old world where every agent learns continually from his real world experience. So

you're touching on the hot topic of the moment, continual learning. What else do we

need to get there? I super believe that. And like, that's basically the vision where

I'm like, you know, I keep talking about these percentages, 25, 50. Like if we

get to the world where we build that, then I think it's just like the

advantages are huge. They're clear. Everyone should just deploy their agents that way. We want

to be like the team that builds the software that makes that easy to do.

So I talked to a lot of engineers at our customers and they're trying to

deploy agents and it's so easy to get the initial prototype and like something that

like kind of works well. It is so hard to get from that to something

that like you are confident is reliable enough to actually deploy in production. And when

you actually look at what those failure modes look like, it's like, oh yeah, like

we know if it gets in this situation or if it gets like these kind

of like inputs, like it behaves funnily. But then it's like, yeah, you can update

your problem to address that. But that's not scalable, because at a certain point, it's

going to start breaking other things. You don't know what it's breaking. You really want

some way to just say, OK, look, this thing you did there, that was the

wrong thing. Just adjust this behavior when you get into this, and then otherwise carry

on. And that's what we can do with RL. And that's what we can do

with continual learning, is we don't have to have this concept of, oh, up front,

I'm trying to make the perfect model that solves everything. It's like, I'm trying to

make a model that's good enough, I can deploy it in production. And then when

these errors come in, I'm going to say, oh, you know, exactly. I mean, very

analogous to how you train a human employee. Like, oh, no, actually, that's not what

you should do in that situation. All right. Fix that and carry on. And that's

just going to make this whole process so much easier. And I think that, you

know, like I think that there is today like 10 times as much inference

that could exist than is existing right now, just purely with projects that are like

sitting in the proof of concept stage and have not been deployed. because there's like

huge bucket of those. And it's, it's all about this kind of like reliability issue

where it's like, okay, like it, it works in controlled circumstances. There's other areas where

it doesn't work. And so if we can solve this problem, there's that, that like

90% of the like inference market, like addressable market today, that's just going to like

come online because we've solved that problem. So, um, that's what we want to do.

Um, I'm super excited about it. And like, I think we have very concrete ideas

on like the specific pieces we need to make that work. Uh, and we just

have to execute against them. Do you feel like the online RL is more susceptible

to like the reward hacking? Especially as you're like short in this loop and like

you don't spend as much time like looking at the different checkpoints. I'm not that

worried about it. And the reason why is because it is reward hacking is quite

easy to detect once it starts happening because once the models found some hack, it

just starts like doing it all the time. It's like, oh yes, this worked great.

I'm just going to keep doing it. And so you like notice very quickly, whoa,

it's doing this thing. And assuming you're using, at least in part, an LLM as

judge to determine which ones are good and bad, it's so easy to just throw

in an extra term and be like, hey, that weird thing that you keep doing,

if it does that, that's bad. Give it a low reward. So we've done this

with a bunch of customers. And reward hacking does happen, but you just see it.

And you adjust your reward prompt, and it just goes away. What's a thing from

YC that guided you through your entrepreneurship journey? And what's one thing that maybe you

find that you disagree with YC on? Oh, that's a good question. One thing that

I really identify with and I've tried to do a good job is kind of

like, you know, sort of, I think they say like, hold your problem tight and

your solution loosely, right? Where it's like, That's what you did. Yeah. Spend a lot

of time thinking about what is the problem people are trying to solve. And then

it's like, don't be too bought into like the way you're solving it today. I

think that's super important. Everyone, you know, it's very easy to get that balance wrong

if you're not thinking about it very consciously. Something I disagree with, that's a good

question. I think there's lots of things I disagree with, but I don't have it

like cached in that direction in my brain. I don't know. Like

I definitely have disagreed with lots of specific piece of advice, but yeah, I don't

have like a great answer right now. I'll bridge it for you in case something

comes up. Sam Altman's like, you know, everything I said as president of YC was

wrong for OpenAI, right? Like do B2B, ended up doing B2C. you should ship products

often, like ended up being installed for two years. Yeah. Yeah. Actually, I

think that second one does resonate with me a lot. We have tried to ship

really quickly and just kind of like, sort of like follow the gradient of the

market. I think if I do another startup, like, and I don't know, maybe this

is just me like being beat up by the market too much. If I do

another startup, like I would like, I think at least some points I probably would

have done better to be like heads down and execute on my vision for longer.

and like kind of like go for the more ambitious thing, but that would take

longer to sort of like prove value, which is definitely not the YC way. But

I think if you have like, I don't know, a good vision and good taste,

then like that, that can like work quite well. Yeah. We'll see what that is

whenever that comes out. But thanks for your time. This is a great overview of

everything. This has been a super fun conversation. Thanks to both of you. Awesome.

Loading...

Loading video analysis...