World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI
By Latent Space
Summary
## Key takeaways - **Medal's 3.8B Peak Human Clips**: Metal has accumulated 3.8 billion clips of the best moments and actions in games, resulting in one of the most unique datasets of peak human behavior by retroactively clipping interesting moments like Tesla bug reports. [01:10], [01:21] - **Vision-Only Agents Beat Humans**: Pure imitation learning agents see only pixels, predict actions in real-time, play against humans, make superhuman moves, and unstick themselves using 4-second memory, inheriting exceptional highlights from training data. [02:38], [05:31] - **World Models Need Actions**: World models understand full range of possibilities from current state and action to generate next frame, far more complex than video models predicting likely sequences, handling partial observability like smoke. [00:06], [11:18] - **Privacy-First Action Mapping**: Metal maps actions to visuals without logging keys like WASD for privacy, using thousands of human labels for every game action, convertible back to inputs at training but not per person. [19:09], [19:37] - **Turned Down OpenAI's $500M**: OpenAI offered $500M for Metal's data but Pim turned it down to build independent world model lab, raising $134M seed from Khosla Ventures, largest since OpenAI. [01:55], [02:11] - **Clips Enable Imitation to RL**: Metal clips are episodic memory of simulation with out-of-distribution highlights and negative events; world models make every clip playable to train reward models and transition from imitation to RL. [59:34], [01:00:20]
Topics Covered
- Gaming clips yield peak human data
- Pure imitation beats RL initially
- Games simulate spatial reasoning better
- World models predict action-driven states
- World models drive 80% atom interactions
Full Transcript
You know in a video model you might predict the next likely sequence or the next most entertaining frame. What world
models do [music] is they actually have to understand the the full range of possibilities and outcomes um from the current state and based on the action
that you take generates the next state right so the ne [music] the next frame and so it is it is a much more sort of complex problem than than traditional video models. So to me it is it is a
video models. So to me it is it is a world that is accurately generated based on the actions that you take [music] as a result of what's already been generated.
Hi listeners, as you may know, I recently wrapped up the AIE code conference in New York. And while I'm traveling, I do like to visit top AI startups in person to bring you interviews that you don't find on any other podcast that just does a Zoom
call. General Intuition, or GI for
call. General Intuition, or GI for short, is a spinout of a 10-year-old game clipping company called Metal, which has 12 million users, but in comparison, Twitch only has 7 million monthly active streamers. Metal collects
this data by building the best retroactive clipping software in the world. In other words, you don't need to
world. In other words, you don't need to be consciously recording. You actually
just have Metal on in the background while you're playing, and you hit a button to clip the last 30 seconds after something interesting happens. It's very
similar to how Tesla and self-driving does bug reporting. If you ever ever done a self-driving bug report in Teslas, the result is that Metal has accumulated 3.8 billion clips of the
best moments and actions in games, resulting in one of the most unique and diverse data sets of peak human behavior actively mining for the interesting moments. They were also very preiented
moments. They were also very preiented in navigating privacy and data collection concerns by mapping actions to these visual inputs and game outcomes. As you saw on our FE Lee and
outcomes. As you saw on our FE Lee and Justin Johnson episode with World Labs and with the recent departure of Yan Lun from Meta, there's a lot of interest in world models as the next frontier after LLMs to improve on spatial intelligence
and to work on embodied robotics use cases. Deepmind has been working on this
cases. Deepmind has been working on this with Genie 1 2 and 3 and SEMA 1 and 2.
And this year, OpenAI seem to finally agree because they've been penning on LLM a lot and they made the news by offering $500 million for Metal's video game clip data. Our guest today, Py,
turned down that money and instead chose to build an independent world model lab.
Instead, Ka Ventures led the $134 million seed round, which is Venode Kosa's largest single seed bed since OpenAI. We're able to get an exclusive
OpenAI. We're able to get an exclusive preview of GI's models, which unfortunately we cannot show you directly, but I can confirm they were incredibly humanlike, and we chose to include the first 11 minutes of the demo discussion. Even though I couldn't show
discussion. Even though I couldn't show it to you, it may be hard to follow, but I tried to call out what was noteworthy for you to know as your likely reaction if you were watching along with us. Now,
enjoy the world's first look at my first look at general intuition. So, what I'm about to show you is a completely vision- based agent that's just seeing pixels and predicting actions the exact same way a human would. Um, and so,
yeah, what I'll show you here is what this looks like uh 4 months ago. So this
was uh so again this is just an agent that's seeing that's receiving frames um and it's just predicting action. So you
can see it has like a decent sense of um uh of being able to you know navigate uh around um it tabs scoreboard just like gamers always tap the scoreboard. So
these are purely these are pure imitation learning.
>> I see. So I see slicing the knife.
>> Yeah, exactly. So it's doing everything that like humans would in this case here's here was the first interesting part that we saw like it gets stuck and then it has they have memory as well. So
you see it can get unstuck. Um,
>> how long is the memory?
>> Uh, 4 seconds. Yeah. Uh, 4 seconds.
Okay. So, this was 4 months ago. This
was maybe a few weeks after that. So,
you can you can see there's like it's still doing the scoreboard thing, but it's they're still there's still uh quite like [laughter] And these are bots, too. So, you can see that
>> it's very human. Let's just say that.
>> Yeah. Uh, and then um Right. So, this was really like the
Right. So, this was really like the early days of researcher. You can see, right, it does one thing and then goes for another. Um and then we've been
for another. Um and then we've been scaling right um uh on on data and compute and also we've just been making the models better. Um and this is where we are now. So what you're seeing is
um pure like I said pure mutation learning. This is just a base model.
learning. This is just a base model.
There's no RL, no fine-tuning. Uh this
model sees no game states. It is purely capable not sequence uh etc. >> It's purely predicting the actions um from the frames. That's it. Um, and it
this is playing against real humans uh just like um uh like a human would play and it's also it's running completely in real time. So there's absolutely
real time. So there's absolutely everything here plays exactly like human.
>> Do you give it a goal?
>> Yep.
>> It just figures out it goal because obviously it's trained on a scene. Yes.
>> Um and I I I picked right I picked a sequence where also it doesn't do well initially. So you can see like this is
initially. So you can see like this is this is just like a a sequence, a random sequence, >> but this is the I mean it looks like it's doing well.
>> So um >> Oh. Okay. Yeah. Watch.
>> Oh. Okay. Yeah. Watch.
[music] >> Yeah, that's pretty good.
Maybe too good.
Um this is what my favorite part. So you
can see it does something that like um here like Newman would never do this.
then gets unstuck then has four realizes which and then in the distance.
[music] >> So you're saying one it makes a mistake that a human will never make but it unsticks itself >> and two what we just saw is it is doing superhuman things.
>> Yeah.
>> Okay. Yeah. Um I mean there are things that that that demons sit obviously. Um
but because it is trained on on the highlights of things that all the exceptional things it's inheriting those.
>> Yeah. So it's not like move 37 where we RL their way into something.
>> Yeah. Replicating superhuman or like >> P human the baseline of our data set is big human performance.
>> Yes. Yeah.
>> Um okay. So that that's the agent. Uh,
so now what I'm going to show you is we then are able to [snorts] take those action predictions and we're able to label any video on the
internet using those actions. So um
um and so this is this is just frames in actions out. Yellow is the uh model
actions out. Yellow is the uh model prediction or sorry yellow is ground truth, purple is the model prediction and then bottom left is compound error over the entire sequence and then this
is reset per prediction.
>> Reset meaning you every now and then you reset.
>> Yeah. So this just means it resets the baseline. Um and so this basically a
baseline. Um and so this basically a single error in the entire sequence compounds here but it doesn't compound here if that makes sense.
>> Yeah. Um so and again this is just seeing frames, right? it's not it's not seeing any of the any of the actions.
Um, and so, you know, so what we did, right, is we we we trained it on less realistic games and we transferred it over to a more realistic game. And then,
and this is where it gets really exciting, we transferred it over to a real world video, which means that you can use any video on the internet as pre-training. What is it predicting? Um,
pre-training. What is it predicting? Um,
it's predicting it as if you were controlling it using keyboard and mouse.
So if you were if you're basically playing the sequence as a human is there some sense of error or >> uh so that's why you you transfer to more realistic games first.
>> Yeah.
>> And then you transfer to real world video because you can't get a sense from ground truth from from the real world video yet.
>> Um let's see. And then um so what they don't so I'll show you here. Uh this one is also um this is the same
uh uh agents that I just showed you.
This is playing against other AIs.
>> This one's playing against bots. Yeah.
Um the previous one was against players.
Uh but with the sniper, it doesn't really matter that much as she'll say.
[laughter] It's like uh so one one thing that's really interesting is you notice that it behaves differently as it has like different items, right?
>> That makes sense. Yeah. Intuitively.
>> Yeah.
>> I think there's also a question about egoentricity versus the third person.
doesn't matter.
>> Um the third person I think will be very very helpful if you're for instance trying to control multiple objects in an environment later on.
>> Uh right now I think having fully imperception uh first person is quite helpful.
>> Um this one's also this is the policy itself.
>> What do you mean this is the policy?
>> The agent.
>> Yeah. Same constraints that I just told you about.
>> Yeah.
[music] like this where right the where it hides that to me was just incredible like just from from knowing being able to to >> also high when you see it.
>> Exactly. Yeah. Yeah.
>> Um >> and it needs the spatial intuition to go well this is hiding and that's not hiding.
>> Exactly. And and and right while it was reloading. Yeah. Um okay. So that so
reloading. Yeah. Um okay. So that so those are that that's the policy and this is a completely general recipe meaning we can scale this to any environment. Uh
environment. Uh >> is this work closest? Okay. No let's
keep going on demos until >> um I was going to go into research.
>> Yeah. Yeah. Um okay. So and then this is this is these so what I'm about to show you are the world models. Um there's a few really really interesting parts about our world models. So the first is
uh we actually made the decision to uh transfer um sorry we made the decision to um pre-train world models from scratch but
also we've actually been able to fine-tune open source video models to get a better sense of physical transfer.
Um, and so one of the things that you'll notice here is like our world models have mouse sensitivity, which is something that like gamers absolutely want, right? So you can have these like
want, right? So you can have these like very rapid movements which you couldn't do in any other world model. Um, and so this is a hold out set. So this clip was never seen before um at training time.
Um, and so you can see it has it has spatial memory. This is this is about a
spatial memory. This is this is about a 20 secondish uh generation. And here's
what's fascinating. This is an explosion that occurs. And you can see that in the
that occurs. And you can see that in the um in the physical world, right, the camera would shake and in the game that would never happen. So you see you see the world model inherits the the
physical world camera shake, but the the the actual um uh uh game never does that. Uh which which is which is sort of
that. Uh which which is which is sort of that that to us was quite fascinating, right? Also did the models that I just
right? Also did the models that I just showed you that we used to transfer over from video, the two of those combined will allow us to like push way beyond games in terms in terms of training. Um,
this is another interesting. So, this is the world model. This is rapid camera motion. So, like again, this is stuff
motion. So, like again, this is stuff that we're literally just taking one second from here in the context and the actions and replaying it here, right?
Um, and so you you'll you never essentially have um uh like what we're saying is the skill that you see in the clips that like the speed and the movement that also pays off at training
time when you're doing role models. Um,
this is my favorite example. Uh so this shows that the role model is capable um of performing u with partial observability. So what you're going to
observability. So what you're going to see is um again you're replaying the actions from here in here just using one second of video context. Everything
after that is completely generated. Um
so what you're going to see is the model is going to encounter in this case smoke. Normally now models break down.
smoke. Normally now models break down.
What you actually see is comes out of the same place. Um and so it's capable of um of even with partial observability still maintaining um its position in the
world. Um and then here it is also
world. Um and then here it is also interesting. So this is uh this is
interesting. So this is uh this is sniping. So this is give this gives you
sniping. So this is give this gives you like a um >> reaction time >> uh like the fact that it can do depth and like sequences in completely different views, right? So this is a completely different view than if you
were to be outside of that view, right?
And so it's it's it's able to maintain consistency. um
consistency. um >> while zooming in.
>> Yeah, exactly. Um uh and so um yeah, so you can see uh so even while this goes out of scope, right? Watch. And then it can and it comes back and you'll see
it's still still there.
>> Yeah. Um and so uh yeah, this is the work that that Anthony who has been working on.
>> I'm just wondering how much game footage you have to watch in order to find these things. Um, [laughter]
things. Um, [laughter] we can ask Anthony, you know, it's it's I'm I'm sure he's not going to be too excited to play these games uh [laughter] afterwards. Um,
[laughter] afterwards. Um, >> you're not playing. You're just
watching. Yeah. Yeah. Yeah. Um, great.
Okay. So, those were the models. Um,
see, these are interesting. So, we also were able to distill into like really really tiny models. Um, so this is for instance a um a long sequence on a very
very tiny one. You can see it makes like a bit more stupid mistakes. Uh like it it does things that are not as optimal.
Um but >> I haven't seen anything yet.
>> Uh at the beginning it was running into a wall for free. Yeah. Exactly.
>> Um uh I mean I do that too.
>> Yeah. [laughter] Yeah.
>> Um >> it's looked I mean it's doing pretty well.
>> Yeah. And and again all these models are running completely in real time there.
So there's no uh >> Okay. So I was thinking your main model
>> Okay. So I was thinking your main model does real time anyway. What's the goal of distilling? Is it cost or
of distilling? Is it cost or >> uh yeah parameters?
>> Yeah.
>> Yeah.
>> Yeah. This is the interesting one. Peaks
the corner. That's what we mean by like the fish and the poor reasoning aspect is humans actually they they sort of simulate the optical dynamics of their eyes and how they actually spat right. You you've seen all this.
right. You you've seen all this.
>> Yep. Um exactly. And so uh like even in like real this is kind of interesting like even in like the real world um with uh for instance YouTube data right you have to first solve for pose estimation.
Then once you have pose estimation, maybe you do something like inverse dynamics, right? Where you basically are
dynamics, right? Where you basically are able to like somehow label some of the actions that you're seeing. And then you still have to account for optical dynamics of like where your eyes actually looking before the decision cuz like there's just three levels of information loss where when you're
playing video games, you're actually simulating the optical dynamics with your hand, right? And I think that like that's I think why games are a better representation of sport reasoning
initially than um uh than YouTube videos for instance. [music]
for instance. [music] Okay, we're in the GI offices with the CEO. Welcome.
CEO. Welcome.
>> Thank you.
>> Thanks for having us in your office.
>> Yeah, excited to be here.
>> If I'm in New York and you're one of the hardest raises of the year, I have to come and visit and uh thanks for taking some time on the weekends or >> Yeah.
>> Yeah.
>> So, you've raised 133 million seed uh for general inspiration. Most people
didn't care about you. I I guess cuz GI is new, but uh more gamers would have a middle.
>> Mhm. And before that you ran probably statemer like the largest depth we stage. Um what's your reflection on just
stage. Um what's your reflection on just that then journey of like now you're an AI founder?
>> Yeah you started off root stage.
>> Yeah I think um I grew up with Tourette's. Uh I spend most of my time
Tourette's. Uh I spend most of my time as a teenager coding and playing video games. Uh so in that sense it doesn't
games. Uh so in that sense it doesn't feel that much different. Um but I think for uh so yeah so I started the largest private server road runscape worked at Dr. Suborders for three years for Ebola and then on like satellite satellite
based map generation for disaster response um which was already like very AI related adjacent I built some models back then and then started metal which became one of the largest social
networks and video games. I've always
been kind of like AI like adjacent, you know I I'm a selftaught engineer. Uh so
for me the modeling itself always felt a little foreign. I actually had to uh
little foreign. I actually had to uh take a ton of tons of classes over the summer and and early this year to get better at it. Uh because I I it still felt like like I was really really good
at at the infrastructure side and I had written like our our our transcoders for metal myself. So I was very very
metal myself. So I was very very familiar with CUDA and like the GPU side and all the video infrastructure that we were using uh for this stuff. But the
the modeling side itself was was still quite foreign. Um, luckily obviously we
quite foreign. Um, luckily obviously we have I have really really good f co-founders but they they they essentially put a bunch of coursework together for me to to go complete to get really really good at understanding the
fundamentals better. I think for me I
fundamentals better. I think for me I had seen inside of the labs that had really really good uh leadership with fundamentals at top and also the ones that didn't and I think the ones that did were just like much better. Um and
so for me um yeah I wanted to be more like that. So in that sense it was a bit
like that. So in that sense it was a bit it was first very foreign and then now I feel pretty comfortable with everything and but yeah like I think for um there's
a lot to be explored starting in video games and also reverse engine like I think the interesting thing about reverse engineering is it kind of teaches you to look at problems very differently. It's like the ultimate form
differently. It's like the ultimate form of deductive reasoning in a way. Uh and
so um uh so this is just how I think how I operate and so for me it's it's been a really really interesting journey. uh
you know I don't claim to have any of the credentials or or skills that some of the other guests do have had on but hopefully it will make for a good time.
>> Yeah. Well, your co-founders uh definitely bring a lot of that different ability and you bring a lot of the I guess gaming expertise with cheese. We'll see what I bring to
with cheese. We'll see what I bring to the table.
>> Yeah. Just just a little bit of history of metal. Let's establish metal for
of metal. Let's establish metal for those who don't know. uh ability uh clicks yeah the year um that's you have more active users concurrent users in
Twitch something like that >> yeah on the creator side I think and the reason is because metal is a lot more like Instagram than it is like Twitch so people um so the way you think about metal is it's it's a native video recorder like unlike something like
Twitch where you actually have to use other software to record and stream to Twitch um it's not a streaming software it's actually a video recording software and a lot of gamers love to put things like overlays on top of their
Um, and as a result of that, we have sort of the largest data set of ground truth action labeled video footage on the internet by maybe one or two orders of magnitude.
>> Yeah.
>> What what's an example of an overlay like Naomi overlay? I usually think of as CAD.
>> Yeah. Yeah. Also um controller overlays for instance, if you're playing um like let's say you're playing uh console.
>> Yeah. Like flight simulator. You get
like you know the joystick and all all the things. So you get the actual
the things. So you get the actual actions that people take inside the games as well as the frames of the games themselves which is a loop right because it's essentially you perceive then you act and there's a state update and then
you perceive again you act state update which is like roughly precisely what you use in order to trace to train these agents.
>> Yeah it's it's almost perfect training data we were showing you showing me in the demo when we show some B-roll here on uh how you don't log key it's very important for you to log action. Yeah.
When did you figure this out? Oh um
maybe starting a year and a half ago.
Yeah. And and we realized that like fig figuring out this side of the research for us was we very much never wanted to be in a position where we eroded privacy or something like that. So we never
wanted to actually log like a W or A or S and a D which for researchers the fact that we don't do that like often it sounds strange like why wouldn't you do that but I think for us the privacy >> we get the data. Yeah, I I think you
know a lot a lot of the the um the researchers they did they hadn't quite understood yet that you can actually just get away with just doing the actions. Um and the reason is like at
actions. Um and the reason is like at training time having the actual keys as noise anyways like if there is text in the screen and you would want to in theory make that um part of the training then like reading text from a frame is
like really easy. And so for us if we actually can so we convert basically hit you hit the uh the input we convert it to the actual action. So we had thousands of humans label every single action you can take in every single
video game over the past year and a half. Uh which is an enormous amount of
half. Uh which is an enormous amount of action labels. Um yeah. So when you act
action labels. Um yeah. So when you act you we we get the actual um action itself and then it being at training time you you can for like a the general
set of that of that game convert back into computer inputs if you want to but you can never do it for any individual person. And so that for us from from
person. And so that for us from from like a design perspective was was important. So we we we figured all that
important. So we we we figured all that stuff out. Then we actually started
stuff out. Then we actually started pushing um like we already had features as well with this. So for instance like gamers already love to be able to navigate their clips by like things that happened. So we have an events capture
happened. So we have an events capture system and then we also have the overlays where you actually just want to overlay and render the actions on top of your clip. We developed kind of in
your clip. We developed kind of in tandem with the feature set itself. And
then obviously when when world bottles became a thing and it's very [clears throat] very clear that all the all the data for this was precisely like that sequence yet we were able to sort of be first to market recruit the best
researchers and start a lab.
>> Yeah that's uh that's incredible. Uh one
more question on metal before I we move forward. It's been 10 years.
forward. It's been 10 years.
>> Yeah. What is the I don't even know how you roll something like this. I'm just
kind of curious and like the opportunity to ask you what really worked.
>> Yeah. uh that you became so so huge because I'm you're not the only one.
Yeah. But uh I'm sure it's performance and everything, but >> a few things had really worked. I think
the first was a lot of our competitors were focused on solving the social network and a recorder at the same time.
And that never like our bet was really that we could get so many people to record with us that we could bootstrap the network on top of that and that worked. So well, everyone was sort of
worked. So well, everyone was sort of distracted trying to bootstrap a social network. We were just focused on
network. We were just focused on building a really really good capture tool and then we got tens of millions of people to use that which then we were able to bootstrap a network on top of the share behaviors. We already had like the profile behaviors and the share behaviors obviously but the actual
content consumption piece and and the sharing piece really only came after we hit critical mass. It was actually early days during co when like the network really accelerated. Fortnite happened
really accelerated. Fortnite happened which was really important and I think also the fact that Discord existed um uh made it quite a different time than uh when other types of networks of these types had launched because Discord
essentially was like the connective tissue already between gamers that like never really existed before. And so I think those combination of things really really made it I think we also built a product that for instance with with most video recorders you have to remember to
start and stop the recorder. So, you
have to go into the application, then hit start, then start your game, and then um you know, maybe you'll play games for three hours, then you'll close the game, then you have to close your video application.
>> Then you well, then you have to process like a multi- gigabyte file. Uh then you have to upload those somewhere. And so
like this was a pain for people. And so
what we did is we just ran this kind of recorder. When you hit that button, it
recorder. When you hit that button, it does a retroactive video record. So all
the recording initially is in memory.
And then when you hit that button, it exports only that sequence to disk and syncs it to your phone. And so that that became super popular. It also was interesting about it also means that you're not sort of behaving or acting differently because it's always there
and you can just export whatever happens which is also very very helpful for for trading obviously. Um the thing you were
trading obviously. Um the thing you were the first to do that.
>> Yeah.
>> The thing you were explaining just before this was is similar to how Tesla does their bug reports, right? You're
driving from the disengage autopilot.
You're like, they're like, "Well, tell us what happened."
>> Exactly. Exactly. See, see, you're driving. Tesla doesn't want to train on
driving. Tesla doesn't want to train on the like 10 hours of you driving through a desert where nothing interesting happens. You have the clip button on the
happens. You have the clip button on the steering wheel. Something interesting
steering wheel. Something interesting happens either while FSD is engaged. And
I'm not sure if you can use it without FSD as well, but you hit the clip button, it basically uses that precise sequence to mark which is then more helpful for training because it's more unique as a training time.
>> Yeah. Yeah. I mean, so one thing, we're going to get to this on the agent side.
One thing I I that does pop up is well a lot of life is boring. A lot of life is going for me. A lot of a lot of playing games is doing the boring stuff that is not capable.
>> Yeah. Somehow using the generalized fight.
>> Yeah. [laughter]
>> Yeah. Yeah. It makes you think, right?
>> It makes you think.
>> Yeah. Yeah. It's also quite interesting like I showed you the models like what happens when you increase the size of the context window. Um and how behaviors actually are largely shaped by the size of the context window. Yeah. that that
to me was like one of the most interesting uh parts about the research um made me think about our own behaviors in a way.
>> Yeah.
>> Let's talk about also the like forming the gene. Uh on your website you're 12
the gene. Uh on your website you're 12 for the three co-ounders.
>> Yeah.
>> And just let's talk about how this team comes together because you may not yourself don't have that academic network. Yeah. You manage the people.
network. Yeah. You manage the people.
>> Yeah. I started reading all the research papers by that time. I was already pretty deep into like having a decent understanding of of of not world models
in particular in particular LMS and and transformerbased models. And so um there
transformerbased models. And so um there was Genie, there was SIMA. Those two
were really really interesting. In Simma
in particular was interesting because what they do is they basically take 10 games and then they they have a graphic in Simma uh where you can see kind of the precise actions that are inside of
those games that they mapped. And I
believe they found something like a 100 um which are actually actions that also exist in the real world. And uh what they did was they then I believe it was specifically for navigation. They did a
9-1 hold out set. So they they they trained um an agent on the nine games and then um they had it play the 10th game, the hold out game, but then they also trained a specialized agent just on
the 10th game and they compared how good they did. And my and if I recall
they did. And my and if I recall correctly, it did roughly as well um playing the 10th game on navigation specifically on the hold out on the nineame agent than it did on the one game agent. And that to me was really
game agent. And that to me was really interesting because that's precisely the type of data that we had right and so for us the thinking was okay what if we did exactly what LM did. What if we use
right this um uh right so LM were trained on predicting like text tokens on words on the internet. What if we predict action tokens on essentially what is the equivalent of the common
crawl data set uh but for interactivity >> vision input? Yeah, action output.
Correct. That's it. Well, I I think actually I'm going to double back a little bit to like a question I had which is one of the one of the reasons why I thought you would prefer keyboard
and mouse over actions is the action is potentially unbounded. Right? You can
potentially unbounded. Right? You can
jump, walk left, walk right, but then also look up, look left, bench. It's
it's unmounted. So, it's huge, isn't it?
>> Yeah, I think problem.
>> Yeah, there there's benefits to the action space being small to start with.
So, I think we're we're going to start with anything that you can control using a game controller, but yeah, long term, we want to actually predict maybe like action embeddings and have models sit inside a general action space to be able to transfer out to other inputs as well.
Got it.
>> Yeah.
>> Okay. And then let's keep going on on the research side. So, uh genius. Yeah.
>> And then do the co-founders.
>> Yeah. So there was the diamond paper, there was Genie, and then there was Simma. The Diamond Paper for me was
Simma. The Diamond Paper for me was really interesting because they had actually managed to get this world model called Diamond running on a consumer
GPU. I believe it was 4090 at 10 FPS and
GPU. I believe it was 4090 at 10 FPS and you could play it. And they did that on like 90 hours of data, like 95 hours. I
think it was 87 hours and I think 80 or something like that. That was just incredible, right? That they had
incredible, right? That they had something playable on that little data.
So I actually cold emailed the entire group of students and I was and and I I told them, hey, I think we have this thing. And then it was pretty
thing. And then it was pretty interesting. So like right when that
interesting. So like right when that happened, a lot of the labs uh also started started understanding what we had. And so we started very aggressively
had. And so we started very aggressively multiple labs tried to bring us in in various ways and and they were part of that. Like they basically were seeing
that. Like they basically were seeing that happen. And I think for them that
that happen. And I think for them that also kind of like solidified how real it was. And then when we chose to do our
was. And then when we chose to do our own thing, you know, initially we thought that we were going to have to just work on role models, right? So, so
we thought, okay, the main benefit of this data set is is is like Genie is is is world models. What we didn't realize at the time is that we have so much of this data is that we can essentially do these role models in parallel and take
the equivalent of like the LMET mostly on imitation learning and then use the world models after that to get into like our off stage, right? And so for us >> and eventually getting rid of robotis something like
>> I mean ideally you got rid of the imitation yeah the imitation learning but yeah we essentially realized that that we could get so far on just imitation learning the way to look at it is we essentially like let's let's take
the element analogy we essentially have sort of the internet or like common crawl if you will and every single lab is trying to simulate that right in order to get similar data in order to train their agents and so for us the
reason why we stayed independent and we just did our own thing was we think we essentially leap every single company that's forced to either be consumers of world models or or build world models and and and take this take this
foundation model bet for switch the board spatial temporal agents uh and be in a place where you know we have a lot of customers years before any of the labs even get there and maybe the most
similar um comparison is like what Enthropic did with code right anthropic just focus really really hard on nailing the code use case their models are incredible for it a lot of their customers use it for it so we just want to become incredible at this spatial
temporal agent use case and likely that starts in like game simulation and then using role models we can then start expanding out to to other um areas. So
would you show me a little bit of how it does generalize up? Yeah. Dims. Um but although games is kind of common error.
Yeah.
>> Games and simulation. Um I I I would I would specify as game engines and particular. So even if you're for
particular. So even if you're for instance uh simulating human behavior in Omniverse because they're trying to create better training data for factory floors. Um you can use it.
floors. Um you can use it.
>> Yeah. Maybe Meta has a similar data set because of the quest.
>> I never really asked them. I never
really looked into the meta quest specifically. So you need a few things.
specifically. So you need a few things.
You can't just like there's lots of companies that have like maybe recorders but you also need the public graph otherwise you can't train on the data right you can't train on people's like private videos that they have saved
somewhere right and so I think you you you need the social network graph components um because these videos need to be on the internet to rank >> no to train on them yeah I I mean I
think I think generally people like people don't want to train on like like because these things they live on your device usually right um and you can't train on anything that lives on your device like you actually need to go and upload and do your thing, right? For
meta specifically, I think also VR, the scale of VR is still pretty small. The
amount of um environments in VR that are that that have like consumption at scale is probably in like the hundreds. Um
whereas on PC, it's probably in the tens of thousands, right? And so you get a lot less diversity. Um the
threedimensional input space of VR is pretty interesting. We see some of this
pretty interesting. We see some of this too, obviously. And so yeah, I do
too, obviously. And so yeah, I do suspect, you know, Meta Meta starts using these types of things, but it's unclear to me whether they can get to like a similar scale of data or
diversity on the environments as we can.
>> Yeah, there's a lot of challenges there.
>> Yeah. Um, okay. I want to take this in in like a few different ways, but I guess let's let's fill out the the papers. Uh, maybe one more to mention is
papers. Uh, maybe one more to mention is tire. Yeah. which uh I actually I
tire. Yeah. which uh I actually I interviewed the guy authors but that too seems like the particular uh insight that brought it overseed.
>> Yeah. So so Anthony too who led the um research on Gaia 2 is is also one of the engineers that joined our team. Uh so
it's all the diamond uh the core contributors for diamond and then Anthony um and we just had three more researchers join this week. It's been a good week and yes I I think a lot of the approaches in Gaia too were heavily
inspired by diamond and then Vansa who was one of the um authors of diamond also already was at wave by the time that I emailed them and he also realized what this was and realized that that you
know you could scale world models to a much larger like scale and decided just to make the leap as well. So I think everybody that sees a data set makes a leap because it's but it takes a while
to wrap wrap your head around it because it's like oh it's video games right like intuitively it doesn't make sense and then when you actually understand and you see right how we've been able to transfer it to physical world video and
things like that then it makes sense and then everybody tends to jump. I would
call it video games and call it RL. So
then yeah, >> if I lived in San Francisco, maybe I would. Yeah. [laughter]
would. Yeah. [laughter]
Uh uh just a quick note cuz we actually cover all these papers in in latest club.
>> Uh SEMA 2 did not seem to have as much in Tekken Suma one and I don't really know why they they did a lot more work.
G3 had a ton of impact and but I also felt like because you could play with the model or people it just seems an extension of all those things. I guess
like any quick takes on Simo 2 G3 which both this year is what?
>> Yeah, I'll I'll talk about Simma 2. The
steerability of Cinema 2 was to me the most impressive part because lighting up the action sequences and the the text conditioning is is quite hard to do,
right? And so that and the fact that
right? And so that and the fact that they were like it's also quite interesting that it means that they can sort of use Gemini as as part of the flywheel, right? where um where you can
flywheel, right? where um where you can sort of scale scale this orchestrator as like an independent almost like a puppet master if you will and then like in theory Gemini could orchestrate many
instances of of SEMA right that to me is the most most interesting part is where I I tend to agree with this where like I think our models will initially be used as like like you'll have like an
orchestrator VLM of sorts that's kind of like managing instances and instructing them um and I think SEMA showing that you can do this was was fascinating Also, the fact that you could um they
didn't just have text conditioning, but they also were able to do like drawings and markings uh of where to go. They
really took an interesting end to- end approach to me uh that I I look forward to seeing a lot more of. Um
>> but you talking to them like you saw is that the one collaborative room?
>> Yeah, I I think the um yeah, we're very friendly with Deep Mind. We like them a lot. I just saw the team not too long
lot. I just saw the team not too long ago and I think you know big fans of their work.
>> The the head line that kind of shade from Alice Heath's coverage review.
Yeah. Is you're the biggest bet that Invido cross has made since open.
>> Yeah. How did that conversation start?
Okay. So what if I know it's style and may maybe I'll get slapped in the fingers for revealing this or whatever but uh >> forgive me if I were bad.
>> Um is he asked you to like draw a 2030 picture of your company and I think he just picked n plus 5 years but whatever I don't know >> I did the same to you. Yeah. Um he asks
you to like walk that back from first principles all the way from today and and and he ask he expects you to do that flawlessly where he can challenge any assumption any part of the vision that
that and he asks you questions right he has a very technical background he also has a bunch of technical people on his team and he truly backs people that have these like very large visions on that vision and the ability to defend it
alone um and that's what he did for us um and I think that's why he made that so I think also So through this uh through through this question he he he gets to know a lot of things about how
technical you are. He gets to know how well you think from first principles because if that if that vision is not connected to something real it's very easy to sus it out by asking good
questions. Um and then and then he just
questions. Um and then and then he just backs fully I think like he he really gets in your corner um if it's the right fit. And yeah they they've been
fit. And yeah they they've been incredible partners. they they they've
incredible partners. they they they've opened so many doors for us.
>> I had to ask the question any like it's a it's a very notable story. Uh
obviously a lot of work went into it and but it's also worth it when come out of side >> for sure. One of the things also wanted to I I guess I kind of asked this question out of sequence but um one of
the things that excites me about talking to you is there are a lot of people like you who are founders of business and businesses that along the way have a ton of data and yours happens to be highly
valuable you pursued before deciding to do an independent journey and also talk to other companies about potential licensing or acquisition and something
is your learnings from those periods.
But also like one one version of this is very simply how do you value data?
>> Yeah, I don't think you can value it unless you actually model it yourself and see what the capabilities are.
That's my that's my real outcome.
>> You say model but train a model.
>> Yeah, >> but that's obviously like not doable for for everyone. Um and also I think my
for everyone. Um and also I think my general advice would be as model capabilities increase you and models are also like you know the these fuel ML they're very very good at labeling as
well generally right what I was afraid of when I was having some of these conversations was okay like you know as as the cap the capabilities increase you're just going to need less uh ground data and like you can do more model
based data generation or synthetic data generation I would recommend if you're going to do large data deals like just try to get like a large chunk of equity in the company that you're doing it with. Um, if you can now a lot of them
with. Um, if you can now a lot of them won't do this, but I think that to me would or just go do the research, figure out what's actually possible. In our
case, we were quite lucky in the sense that this is actually the foundation data, >> right? And I think right like that's not
>> right? And I think right like that's not true for for every data set. I think you know we just happened to to hit a particular gold mine.
>> But you you also did you read Kab you did the action thing like one or five years ago. Yeah. So you word.
years ago. Yeah. So you word.
>> Yeah. That's the thing like you you have to be grounded, right? And I think a lot of the um and and I think that's the hard part and I think a lot of what's interesting is you can also kind of look
for if like scaling laws already exist on your data type which like for video there were some but for these like input action labeled uh sets there really wasn't any. The other question is like
wasn't any. The other question is like does it go into LMS? Does it go into uh world models? does it go into like what
world models? does it go into like what type of model is it going to be used for? And I think that's an important
for? And I think that's an important thing to know. And so I just want to you know if if you're having these conversations with labs about data just like make sure that you actually understand like what it's going to be used for cuz that's a very very good way for you to like make the decision
yourself about whether you want to pursue that. Now a lot of them won't
pursue that. Now a lot of them won't tell you that and I think >> you know I think in in in that case you don't generally just don't want to do it because like I think I think for our case like we really cared that like for instance there weren't going to be
competing products with game developers built right because we didn't want to like bite the hand that feeds us and I think we are part of the games industry.
So those questions I think are normal and then we eventually decided you know he just has the data we're just going to go do it ourselves and that's when the rest happened. Yeah and
rest happened. Yeah and >> he assembled the team that didn't uh take advantage of that. I I feel like that's you've aligned a lot of stars in order to make GI burn.
>> Yeah.
>> That other data founders they at the beginning of the journey.
>> Yes. Or I'm a data founder. Founders who
happen to have beta but they have a main business, right? I I don't know if you
business, right? I I don't know if you there's two sides to this, right? There
it's really easy to be super naive about it and like I had a lot of people tell me initially, oh, it's not that valuable. you're just like making this
valuable. you're just like making this up and and and so for me like doing the work and actually understanding it myself was a really really big part of of building that confidence and go start the company but a lot of times it is
true that like model capabilities increase so quickly that like the certain data you just don't need anymore um and so I think it is it's really important to like get people to do the
work such that you can make these types of distinctions and and and so so my recommendation would be go build models with your data see if you can create any sort of cap abilities that that aren't clearly already there um or on path to
being there and then figure out um where you go.
>> Yeah. I didn't want to ask this earlier but you gave me the opportunity to uh when you say do the work thing do course work and all that and your co-founders gave you some homework.
>> Yeah.
>> Uh is this like some books I mean Corser?
>> No, this was um Franuis Flores Flores.
So he has a little book of deep learning and then he also has a full course that he's published um uh on his website. I
went through the entire course uh over the summer. I believe it's like
the summer. I believe it's like something like 30 or 40 lectures which also take home projects and things like that. Um and I would recommend anybody
that. Um and I would recommend anybody uh uh does this. It it goes through right history of deep learning like the the topology. It takes you through um
the topology. It takes you through um the linear algebra the calculus eventually end up with like chain rule and by this time you've you've done like all the the the more important concepts.
It takes you through how do you create neural networks using uh using these concepts that you've learned.
>> Wow. This is super first principles.
this guy and I've I've I've had the the the uh opportunity to spend some time with him as well. He is one of the most first principles people I've met in my entire life. I'm convinced like I
entire life. I'm convinced like I actually asked him why did you curse? He
said oh cuz I thought all the other curses weren't right and because because he is so first principles and he can only explain things from like everything you see and how he explains this thing.
It's everything is from first principles including like the history of deep learning itself was part of of the course and um yes he goes uh um so all so he goes through everything and then
uh and by the end of it I think you like I now have like a pretty good intuitive understanding of how everything works but obviously still right like I I like to describe it as um I'm like the the guy who just got his driver's license. I
can drive the car and like my co-founders are like the F1 drivers that like have done this for years. they know
where all the um uh where all the the gaps are and and so I I enjoy getting to learn from them. The cool thing is also that world models is just like a very very new space and so you know I I get to bring ideas to the table that like no
one thought of and not because I'm great at this just because it's such a new space that like people just haven't tried it yet.
>> Um so >> to get a hit on definition.
>> Yeah. What are role models to you? You
know in a video model you might predict the next likely sequence or the next most entertaining the next most entertaining frame. Um what world models
entertaining frame. Um what world models do is they actually have to understand the the full range of possibilities and outcomes um from the current state and
based on the action that you take generates the next state right so the ne the next frame and so it is it is a much more sort of complex problem than than traditional video models. So to me it is
it is a world that is accurately generated based on the actions that you take as a result of what's already been generated.
>> And just to fact check uh that needs to understand physics. It needs to
understand physics. It needs to understand if I'm building a type of material you need how it interacts with some type of material.
>> Yeah I think the interactions is the most important part. I think the reasons why role models are so fascinating. One
of the things that I did when I was studying over the summer was I tried to actually build a super rudimentary um PyTorch based physics engine which I would not recommend writing a physics engine in PyTorch for obvious reasons
but I wanted to be able to um because it's differential so you can uh you can generate the model but >> yeah exactly you can and then you can um uh uh train and so I wanted to you know
I got so many people ask me about you know why aren't you just using uh why aren't you just simulating or generating this data um and I really wanted to understand from first principles why and I think the most important thing that I
figured out was the compute complexity of simulation goes up really really rapidly with three variables first the numbers of agents in an environment second uh their doth so their individual
freedom >> yeah and then third the information that each action reveals um so like um for instance if you if you have a te if you have a text action or a speech action
the environment can change so much based on whether you say right water or fire that the outcomes are going to be completely different of like how a human would behave in that type of situation.
And so it goes up so quickly with those three variables that at some point you just hit a point where you just want to maximally bet on either video transfer or generation of these environments using world models because that type of
soicity is just incredibly difficult.
But it's already very very present in a lot of the video pre-training uh that goes into into these world models, right? And so I think for for us it is
right? And so I think for for us it is more so about making a maximal bet on video transfer and interacting with things that are difficult to simulate.
And the steerability is also really interesting with text uh than it is on betting against simulation or something like that. And so I think there's still
like that. And so I think there's still a large market for for traditional simulation engines specifically in areas where video is really hard to get.
>> Is this exactly what the big lads are also saying when they're talking to that?
>> I honestly haven't talked about the big to the bigs like since we started working on them ourselves. I think
people are more reserved with what they share with us. Yeah,
>> of course. With him said from your question. How would you contrast your
question. How would you contrast your version of war models with Fi Yamun?
>> Yeah. So I don't know exactly what Yangun is doing today. My understanding
it's based on the Fija like L Japa approach which is so I'll start with Fei. I think what's really interesting
Fei. I think what's really interesting about Fil's approach is that you in some way are able to reuse the the um the splats right in game engines and in
things that let you stay in verifiable domain. Um which I think is a really
domain. Um which I think is a really interesting approach. Um, however, my
interesting approach. Um, however, my understanding is they're currently not interactive, which in my opinion is like the whole point of of world models, right? It's it's environments. They're
right? It's it's environments. They're
great environments. And I think from a business perspective, I think they they they picked a really important part of the tool chain, but to me, that's not really uh a role model, but my my guess is they'll get there, right? They'll
they'll start generating >> Yeah. They just really use it.
>> Yeah. They just really use it.
>> Yeah. Exactly. Exactly. And I think right, FE is one of the like founders of the entire space. Um, so I think it's going to be really interesting to me on
on on what maybe that interactive piece looks like for me to really judge their approach. I I think we interviewed this
approach. I I think we interviewed this before uh we interviewed her with Justin Johnson uh her co-founder. He was he was more focused on the physics side of
things and game trying to have good news. I I I I do think that basically
news. I I I I do think that basically that the splats if you just add more dimensions on I guess the forces acting on them then then you get to drag 2D out
of the box because you basically these are virtual atoms that then has all the blow physics applied to them.
>> Yeah, I'm uh I'm excited to see what that looks like when they actually release it. It's really hard really hard
release it. It's really hard really hard for me to comment on anything. I really
like the um uh the the frame based approach um because all of our video or all of our training data is in this format.
>> Yes. Yeah. So they we actually asked them about this and they were like yeah it's possible but we they're choosing the splat.
>> Yeah. Yeah. And you can also go from splat to frames right. I'm sure you can write like at some it's it wouldn't be easy like you'd have to actually render out the environment. So sure it's not it's not going to be a simple problem but like in theory it has to be
something that you can do if you really wanted to. So like because it's almost
wanted to. So like because it's almost like having a more sort of ground truth threedimensional representation of the underlying world, right? So I think it's an interesting approach. Um it might be
overkill, right? Uh uh you're also
overkill, right? Uh uh you're also dealing with like a much larger like degrees of freedom on the output space, right? So so who knows how well it
right? So so who knows how well it scales. I like the fact that like I
scales. I like the fact that like I think these video models also use things like autoenccoders, right? You can
actually have the world models predict like much smaller um uh maybe like a >> resolution or size.
>> Yeah, exactly. And then you can use like diffusion upscaling or methods like this to actually um uh enrich. And so I think that world models just allow a much more or world models in my sense for a much
more like controlled space that that that we know really well. Um I'm not suggesting their approach is wrong. I'm
just you know like this is I think what we really like about it. Honestly, Yan's
podcast that he did, I don't remember which one it was, but a long time ago where he where he basically proclaimed LLMs to be a dead end, um was one of the things that inspired me to do this. I
think this is very consensus around models people. Basically, everyone heard
models people. Basically, everyone heard this is like stops with their LM and just filter to role models. I would say that the main perspective I asked this exact question to Nome Brown from Open
Eye and he was like well learning the civil models, right? So there's
basically that the different things in our system uh >> yes so >> yeah I I I'm not one to proclaim LMS are dead ends personally I think um I think they're actually quite useful in
particularly as orchestrators like the way I think about is as humans right we had sort of a threedimensional world then we invented text as like a in a way in compression method right so you had
we invented text in order to communicate with each other in a in in a common way uh in a way that actually compresses all this information that we are perceiving in threedimensionals space into just
like a single sequence and I think that allowed sciences to emerge it allowed so many literature like so many parts of the world that we that we cherish so I
think it's a critical part of uh of the whole picture I also agree that that uh it's very very clear that they do build sort of the internal implicit world models inside LMS um and so I think
they'll be very helpful as things like orchestrators Um the problem is when it comes to the generalization I think text has a generalization backbone when most of the the uh um when most of the the
pre-training is is is text right or or or largely text sequences then I think you want that backbone to be kind of more spatial and plural in nature and then also just have text like as one of
the as as part of that and I think the actual argument of um of LLMs is also for instance the auto reggressive nature of the prediction itself. So the um the
fact that it's running the entire output right through the transformer and then in order to predict the next token which doesn't like the environment in the real world is continuous right it's always it's always changing and elements kind
of just forget about that right I think a lot of the the the argument isn't first right so I think the the fact the fact that like text doesn't necessarily generalize well to sufficient coral um
context and then the auto reagentive nature of the prediction and using text for that right so I Those are those are the two main arguments. Um I think I think text prediction is just one of the actions that is going to come out of of
these you know these these policies and world models. I think speech and text
world models. I think speech and text generation will just be >> one of the actions that that can that can be a part of that. I think that there will just be labs coming at this
problem from both sides. Um and everyone ends up in roughly the same place and the same place will be whatever people think is cool. uh right like whatever the consumer gets >> whatever is closest to the GI
>> yeah and so I don't think there's like a clear answer I think it's really interesting to come to come at it from the world modeling side but it's also because we have to right because like
text is largely commoditized we can import all the text I think it's interesting and tenting yeah it makes sense that you can probably recover it's sort of like you're taking
a step back you're starting your branch of the ML research sheet but you might guess there just end up recovering all the other tech stuff emergingly.
>> Yeah. Yeah. We can import a lot of that research, right? A lot of that is um
research, right? A lot of that is um >> that's really cool on the on the research side. Let's talk about the
research side. Let's talk about the stuff that GI is producing more like the the biggest of research and products output. You mentioned the word
output. You mentioned the word customers. What are your current
customers. What are your current customers?
>> Yeah. So, we're already working with some of the largest game developers in the world.
>> Yeah.
>> Uh we're also working with game engines directly. And so really what we're doing
directly. And so really what we're doing at the moment is replacing essentially the player controller inside of a game engine. So anything that you're
engine. So anything that you're currently that maybe like behavior trees or things that you're deterministically coding, we hope to replace with a single
API which is just you stream us frames and we predict actions and that can be inside an engine or it can be um eventually even inside the real world.
Hopefully those are then also steerable.
So the models that you saw weren't texturable yet, but I think we want to get to a point where they're fully texturable.
>> Well, steerable means like, well, I want you to share figure anything else out in the pre.
>> Yeah, I think it's it's sex conditioning on the generation. So yeah, the ability to to you're right, we want to get to a point where you can generally and that's why it's called general intuition where
we can sort of can mimic the intuition of all these gamers into humanlike behaviors in any situation. Um as I mentioned also the lab is named after this quote from AlphaFold which is
wouldn't it be amazing if we could mimic the intuition of these gamers who are by the way only amateur biologists um on his path to um he tried to get an AI to train fold it to generate a lot of data
for for AlphaFold. And so for us really the the northstar right what we hope to get to one day is being able to represent scientific problems in threedimensional space and then have a space temporal agent capable of perceiving that space and using
hopefully also the the right the the text reasoning capabilities that LMS have today in addition to the space temporal capabilities to be able to work on the other side of that problem. So
that for us is is sort of the northstar.
That's why like you know we're we're sort of trying to be hyperfocused based on core workloads the same way that entropic was hyperfocused code and use that to then get into organizations and expand from there.
>> Yeah.
>> Just as a side note since you mentioned entropic um any idea what they did on this to to solve.
>> No out of any lab I probably know entropic at least to be honest. Yeah. I
admired him though.
>> Yeah. Well, the the the current working theory is that they had a super lucky um roll of the ducks, [laughter] but we'll and and then he compounds from there. That sounds like a nice story.
there. That sounds like a nice story.
I'm sure it's not that.
>> Yeah. Okay. So, um why did a game developers want this?
>> So, if you're a game developer, how well you're actually retaining players is like um if you have a game that's already at skill is like decently dependent on how good your bots are. So,
if you're logging in at an obscure time, let's say 3:00 a.m. in America, and your player liquidity is low, then you need really really good bots to keep those players engaged.
>> Is this known is this a thing?
>> Yeah, for sure. Like Fortnite and whatever.
>> A lot of human. Yeah. Um, and so if if you're like as a human, do I want to play against bots?
>> Usually, it's not just bots. It's like
players mix in with bots because you don't want to play just against bots, but it's better to have a full game than to have like an empty game.
>> Yeah. Um and so I think as long as it's part of the environment, I think it's okay.
>> That means you also have to sort of grade that skill level.
>> Yeah. Yeah. Which we can do. Um because
we have we know exactly how good people are at these games.
>> Yeah. Yeah. I think for us um bots is kind of like step one. Uh right. So what
what I was showing you is we're building a general agent that can sort of play any game um in real time. But really
that extends into all of simulation, right? Like in GTA V for instance,
right? Like in GTA V for instance, people are genuinely role playing real life, >> right? And so they're actually behaving
>> right? And so they're actually behaving in quite aligned ways with with the goals they set for themselves. So you
have all these examples represented in video games, right? You have truck simulator, power wash simulator, >> power wash, >> power wash simulator where like actually the behaviors that you'd want uh in nature to be able to perceive. They're
all there.
>> Most Yeah. It's it's really like how seriously some gamers take truck simulator. Um if you haven't seen this,
simulator. Um if you haven't seen this, you should watch it. Yeah. They buy the whole like truck driving set and they're doing the job of a truck driver.
>> Yeah. What I mentioned to you, we have more people at any given time on metal playing with steering wheels in like truck simulator and these types of games than Whimo has cars on the road. Um it's
a ridiculous stat, but it's true.
>> Yeah. Yeah. I mean, so you know, I I I used to think that quality to self driving, you kind of just the interplay a lot of GTA 5. Um he Yeah. I mean, so I'm bad at it.
>> Yeah. Our bet is not that we can zero shot any of these things. It's just that like the next self-driving company can maybe have collect 1% of the data because right also for instance clips already self- select into negative events and and adversity right and so
like a lot of our data set because already the highlights >> is is really um precisely what a lot of these companies spend like their last 20% doing >> right and I think that's the main argument if you're if you're if if
you're another company that's looking at what we're doing I think the thing that people are not that people won't understand is that anything that that you're currently doing in pre-training as long as your robot can be controlled using a game controller. We hope that we
can move that to post training for you.
So our bet is not that we can create the next self-driving car company. It's just
that the next self-driving car company hopefully only needs 1% of the data or maybe 10% of the data, I don't know, right to be able to deliver a really good product.
>> Yeah. Yeah. It's also the the term that comes to mind a lot is active learning.
I don't know if you've uh used to identify with that. start it got less [clears throat] cool for a bit in order to see the uptrend uh bit which which obviously you have the best data set for
the sort of high intensity or you say negative but feeling for negative it could be negative all part of it >> yeah for sure I think negative events is just because it's the most common term that people use for like if you're if you're Tesla you want the crashes you
want like >> right right but it's only gaming it's yeah >> so you know the model that you saw obviously had really really incredible moments and and that was
Yeah. That um that it had a large
Yeah. That um that it had a large representation of people at their best.
>> Yeah.
>> And worst. Yeah.
>> Yeah. Yeah. Amazing. Okay. Cool. Uh any
anything else on the customer development side that you want to sort of fetch off?
>> Yeah. Um uh we're also already working with robotics companies, but again the and manufacturing, but the key is that the robot has to have gaming inputs. So
our bet is not that we can transfer over to like higher do robots than the keyboard and mouse. It's really just that we can move the hard work of of of pre-training hopefully to post training.
>> Yeah. It's like kind of like the foundation model that is a very good basis to start.
>> Yeah. You're going to you're going to give us frames and and likely some text >> or you'll license the model to because they've been wondering.
>> Yeah. Our our business model is initially going to be an API like the anthropic API. Um but you also saw for
anthropic API. Um but you also saw for instance some of the video labeling models that we've been able to develop.
So um the goal is for any company to be able to take in their uh their video data as well and we can create first obviously custom versions of the policy for you the agent. Um if that doesn't work then um we we've already working
with a customer that that is doing we distill a model and and they uh turn that into a product for themselves.
>> So people can engage with you on the agent level API level people can engage with you on the sort of model level. Can
you also buy data?
>> No we don't sell data.
>> Okay cool. So that's the that's the business. Um and is there a world in
business. Um and is there a world in which I I mean I I think this is on your landing page if you are you know Frontier Labs for for world models. Is
there a world in which there is a more sort of application layer thing that you that comes out like a chat GBT for whatever.
>> Yeah. You're going to see us launch a few things on on metal itself that are going to blow your mind uh as a result of this this um this agent. I'll I'll
leave it to the imagination for now.
>> If people typically grate out, email, >> yeah, on the world modeling side, like I think one people underestimate is that metal is already one of the largest, you know, video consumption platforms as well. People watch millions and millions
well. People watch millions and millions of videos a day. Um, so um, World Autobased Entertainment and things like that. While it's not like a focus for us
that. While it's not like a focus for us right now, I think we'll be like on the consumer side, we have the ability to move very very quickly here, um, and and get it integrated in a way that I don't I don't think anyone else can. Yeah, you
could theoretically do video gen like Sora like what is what is that one?
What's the meta one? Meta m not not real.
Yeah, you could theoretically generate clips that nobody play but you know it's a device.
>> Yeah, I I I think for us the games being so humanentric is like a really big part of what makes us special. Like I I actually actually just don't think that would work. Like one thing that we are
would work. Like one thing that we are really excited about though, I'll give you one sneak peek of what we're thinking about is what if you could literally replay any of the clips that you have inside a world model or your friends can play them. Like I showed you a model that already took part of your
clip as a context >> instant replay enter that world.
>> But it's also how we go from imitation learning to RL, right? Cuz like it's part of our research road map anyways to make every single every single clip on metal playable. Um so uh yeah, who's who
metal playable. Um so uh yeah, who's who is to say that that doesn't apply to just the actual clips that you take?
>> Yeah. Yeah. Interesting. Can you say more about the RL potential?
>> We describe metal as as the episodic memory of humanity in simulation. So
when you take a clip, really the way to think about it is you get the highlight of what is maybe 3 hours of play time, right? You maybe get like 2 to 3 minutes
right? You maybe get like 2 to 3 minutes of the things that were the most out of distribution, right? It is genuinely
distribution, right? It is genuinely your episodic memory um of that play time in simulation, the things that you most want to remember and share. We want
to be able to load uh and this is the work that Anthony who is doing. The
reason why we built world models is every crash that you run into in Euro Truck Simulator or American Truck Simulator or a driving game. We want to be able right and again these are ground truth labels so we know precisely the
actions that lead up to the negative events. Um they're also title labeled
events. Um they're also title labeled when people upload it onto the platform they say okay it's a crash right and so we can select all these events and if we can put them inside a world model we can
go into right we can um uh we can train reward models to then uh reward based on how you perform in clips that actually contain negative events [snorts] for example and so for us it's it's very
much about um uh right we can we can create this this this like LLM moment on imitation learning but actually making every single clip on the platform playable um at billions of clips scale is how we go from imitation learning to RL.
>> Cool. Uh we covered a lot of it. Uh is
there anything else that you want to do before we to grapple with the the long-term vision stuff?
>> Yeah. Yeah. I think I think for us um this is a very very ambitious long-term vet. We need the best researchers in the
vet. We need the best researchers in the world that that that that want to work on this stuff. Um it's really exciting not being extremely data constrained. Um
like we really get to like we get so many learnings every week that we didn't think were possible and it makes it for for a joy working here. Also the other thing is because we have such a large data modes we don't have to be as concerned as the LM companies about
publishing because >> we don't want to be able to >> Exactly. No one can replicate the models
>> Exactly. No one can replicate the models right and so for us um we really want to bring back like the original culture of of open research which is why we did the partnership with Qoutai in France. Um,
can you say I I actually didn't.
>> Yeah, we just did a um we just announced our partnership with QA in French, which is an an open science lab in Paris, one of the best research labs in the world.
Um, Eric Schmidt, I believe, funded in addition to some some French people.
They are essentially acting as the partner that's currently doing a lot of open research on the data. We also want to partner with universities who um because like we do believe this is the frontier, but it's so data constrainted that really no everyone has their hands
tied behind their back right now. And so
we want to help fix that. So for
instance um we want to work with the universities to build like negative event prediction models for maybe like trucks in India on all the truck data where all these crashes occur. We have
all these things that we know we can do that we just haven't at the time to do.
Um, and so if if you're listening to this and and you're uh maybe an academic institution or something and you want access to some of this data and a research um in educational research fashion, I think we're we're quite open
to doing that cuz we want to educate people and uh Yeah. And other than that, we just want to work with the best infrastructure and uh research engineers on the planet as we're going into scaling, you know, runs that have thousands, tens of thousands, eventually
hundreds of thousands of GPUs.
>> Yeah. Yeah. Amazing. Uh I primed you this as like the closing question on flight. It's a little bit that no cost
flight. It's a little bit that no cost 330 there was I didn't know.
>> Yeah.
>> So what does GI become in?
>> Yeah. In 2030 we want to be the gold standard um of intelligence. Uh and any sequence uh long enough is fundamentally coral right which I think is um so by
nailing sporal reasoning you go after the root problem of intelligence itself.
What the world looks like is we want to have eight. So I sort of group um the
have eight. So I sort of group um the sequences of AI in three stages and I credit Andre Gaparthy for for teaching this bits to bits atoms to bits and bits atoms and then atoms to atoms. In the
atoms to atoms stage I want like I want GI models to be responsible for 80% of all the atoms to atoms interactions driven by AI models. uh uh and and and the reason the reason for that is
because we were able to unblock intelligence so quickly and robotics like intelligence is the bottleneck that supply chains actually converged on gaming inputs as their as their primary input methods and and they converge on
essentially simpler systems that let us do a lot more a lot quicker. So we are essentially the the 80% market approach and then you have lots of companies that have kind of like specialized maybe human robot OS stacks that are that are
the other 20 and then so I want to be responsible for 80% of all the atoms atoms interactions driven by uh by these models and be the goal center for intelligence and maybe 100x more in simulation because I think simulation
will actually be the larger market initially. So I think in simulation um
initially. So I think in simulation um because you have very little constraints uh also from a safety perspective simulation is much easier. So I think uh a lot of the takeoff initially assistant simulation. So a lot of the simulation
simulation. So a lot of the simulation use cases like what I mentioned scientific use cases I'm really really excited about and so um yeah 80% of Adams Adams interactions uh coming downstream from these types of phases of
world foundation models and then 100x more in simulation. Yeah. Yeah. It
reminds me a lot of that uh what Mark and Villa from the Chaz Institute are doing with virtual biology because you can do a lot simulation and you can do Yeah. Oh, you can do it a lot faster uh
Yeah. Oh, you can do it a lot faster uh with interest. Uh amazing. Thank you for
with interest. Uh amazing. Thank you for inviting us to your office. Yeah. And
thank you sharing a little bit about your training. Thank you. Yeah.
your training. Thank you. Yeah.
Loading video analysis...