After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs
By Latent Space
Summary
## Key takeaways - **Deep Learning = Compute Scaling History**: The whole history of deep learning is the history of scaling up compute; from AlexNet's GPU jump, we've gained 1000x performance per card and can now train on thousands of GPUs, a millionfold more than early PhD days. [04:08], [04:13] - **Marble Generates Editable 3D Worlds**: Marble is a generative model of 3D worlds from text or images, allowing interactive edits like changing a water bottle's color or removing tables, with precise camera control via Gaussian splats for real-time rendering on phones. [31:42], [34:36] - **Spatial Intelligence Beats Language Bandwidth**: Language is a lossy, low-bandwidth channel for the rich 3D/4D world; picking up a mug requires seeing it, context, hand geometry, and affordance points—tasks hard to narrate but effortless spatially after 540M years of evolution. [44:37], [48:46] - **Models Fit Patterns, Not Causal Physics**: Current models fit patterns like orbits but fail on force vectors or deriving F=ma; latent modeling won't yield causal laws, needing scaling, data, and possibly distilling physics engines into neural nets. [24:15], [52:47] - **Gaussian Splats Enable Precise 3D Control**: Marble outputs Gaussian splats—tiny 3D particles with position and orientation—for efficient real-time rendering on iPhones and VR, enabling precise camera placement and recording unlike frame-by-frame video models. [34:36], [33:16] - **Transformers Model Sets, Not Sequences**: Transformers are fundamentally set models, not sequence models; order comes only from positional embeddings, with operators like attention being permutation-equivariant, opening architectures for world models beyond 1D. [56:47], [57:16]
Topics Covered
- Deep Learning Equals Compute Scaling
- Academia Thrives on Wacky Ideas
- Spatial Intelligence Beyond Language
- Transformers Model Sets Not Sequences
Full Transcript
I think the whole history of deep learning is in some sense the the history of scaling up compute.
>> When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem which is
a lot of AI as a field as a discipline is inspired by human intelligence. We
thought we were the first people doing it. It turned out that was also
it. It turned out that was also simultaneously doing it.
>> So marble like basically one way of looking at it, it's the system. It's a
generative model of 3D worlds, right? So
you can input things like text or image or multiple images and it will generate for you a 3D world that kind of matches those inputs. So while Marble is
those inputs. So while Marble is simultaneously a world model that is building towards this vision of spatial intelligence, it was also very intentionally designed to be a thing that people could find useful today. Um,
and we're see starting to see emerging use cases um, in gaming, in VFX, um, in in film where I think there's a lot of really interesting stuff that Marvel can do today as a product and then also set
a foundation for the for the for the grand world models that we want to build going into the future.
Hey everyone, welcome to the Len Space podcast. This is Allesio, founder of
podcast. This is Allesio, founder of Colonel Labs, and I'm joined by Swix, editor of Blade in Space, >> and we are so excited to be in the studio with Fifi and Justin of uh World Labs. Welcome.
Labs. Welcome.
>> We're excited, too.
>> I nearly said Marble.
>> Yeah, thanks for having us. I think
there's a lot of interest in world models and you've done you've done a little bit of publicity around spatial intelligence and all that. Um, I guess maybe one of the part of story that is a rare opportunity for you to tell is how
you two came together uh to start building world apps. That's very easy because Justin was my former student.
Yeah. So Justin came to my I you know in my the other hat I wear is a professor of computer science at Stanford. Justin
joined my lab when which year?
>> Uh 2012 actually the the semester that I the quarter that I joined your lab was the same quarter that that uh Alex net came out.
>> Yeah. Yeah. So Justin is my first >> Were you involved in the whole announcement uh drama I guess? No, no,
not at all. But I was sort of watching all the imageet excitement around Alexnet at that that quarter.
So he was my one of my very best students and uh and then he went on to have a very successful early career as a
professor in Michigan University of Michigan and Arbor in Meta and then when we um I think around you know more than
two years ago for sure I think both independently both of us have been looking at the development of the large models and thinking about what's beyond language models and and this idea of
building world models, spatial intelligence uh really was natural for us. So we started talking and decided
us. So we started talking and decided that we should just put all the eggs in one basket and focus on solving this problem and started world labs together.
>> Yeah, pretty much. I mean like I after that seeing that kind of imageet era during my PhD um I had the sense that the next sort of decade of computer vision was going to be about getting getting AI out of the out of the data
center and out into the world. Um so a lot of my interests post PhD kind of shifted in uh into 3D vision a little bit more into computer graphics uh more into generative modeling. Um and I was uh I thought I was kind of drifting away
from my adviser post PhD but then when we reunited a couple years later it turned out she was thinking of very similar things. So if you think about
similar things. So if you think about AlexNet, the core pieces of it were obviously imageet, it was the move to GPUs and neural networks. How do you think about the AlexNet equivalent model
for world models? In a way, it's an idea that has been out there, right? There's
been, you know, Young Lagoon is maybe like the most the biggest proponent most prominent of it. What have you seen in the last two years that you were like, hey, now is the time to do this. And
what are maybe the things um fundamentally that you want to build as far as data and kind of like maybe different types of uh algorithms or approaches to compute uh to make world models really come to life?
>> Yeah, I I think one is just there is a lot more data in compute generally available. Um I think the whole history
available. Um I think the whole history of deep learning is in some sense the the history of scaling up compute. Um
and if you think about you know Alexet required this jump from CPUs to GPUs but even from Alexet to today we're getting about a thousand times more performance per card. um than we had in Alexet days
per card. um than we had in Alexet days and now it's common to train to train models not just on one GPU but on hundreds or thousands or tens of thousands or even more. So the amount of compute that we can marshall today on on
a single model is is you know about a millionfold more than we could have even at the start of my PhD. So I think language was one of the really really interesting things that started to work quite well the last couple of years. But
as we think about moving towards visual data and spatial data and world data, you just need to process a lot more. And
I think that's going to be um a good way to soak up this uh this new compute that's coming online more and more.
>> Does the model of having a public challenge still work or should it be centralized with inside of a lab?
>> I think open science still uh is important. you know AI obviously
important. you know AI obviously compared to um the image that Alex that time has really evolved right that was
such a niche computer science discipline now it's just like civilizational um technology but I'll give you an example right uh recently my Stanford uh
lab just announced a open uh data set and benchmark called uh behavior which is for benchmarking robotic learning in
uh simulated environments and that is a very clear effort in still keeping up this open science model of doing things
especially as in academia but I think it's important to recognize the ecosystem is uh is a mixture right I I think uh uh a lot of the the very
focused work in industry some of them are more seeing the daylight in the form of a rather than an open challenge per se.
>> Yeah. And and that's just a matter of the like funding in the business model like you have to see some ROI from it. I
think it's just a matter of the diversity of the ecosystem right even during the so so-cal Alex image net time I mean there were closed models there
were proprietary models there were open models you know um you know if or you think about iOS versus Android right there are different business model I I wouldn't say it's just a matter of
funding per se it's just how the the market is there are different plays >> yeah but do Do you feel like you could redo imageet today with the commercial pressure that some of these labs has? I
mean to me that's like the biggest question, right? It's like what can you
question, right? It's like what can you open versus what do should you keep inside? Like you know if I put myself in
inside? Like you know if I put myself in your shoes, right? It's like you raise a lot of money, you're building all of this. If you had the best data set for
this. If you had the best data set for this, what incentives do you really have to publish it? And it feels like the people at the labs are getting more and more pulled in the PhD programs are
getting pulled earlier and earlier into these labs. So I'm curious if you think
these labs. So I'm curious if you think there's like an issue right now with like how much money has taken, how much pressure it puts on like the more academia open research space or if you
feel like that's not really a a concern.
>> I do have concerns about less about the pressure. It's more about the resourcing
pressure. It's more about the resourcing and the imbalanced resourcing of academia. This is a little bit of a
academia. This is a little bit of a different conversation from World Labs.
You know, I have been the past few years advocating for uh uh resourcing the healthy ecosystem. you know, as the uh
healthy ecosystem. you know, as the uh founding director, co-director of Stanford's uh Institute for Human Center AI, Stanford High, I've been, you know,
working with policy uh makers about uh resourcing public sector and academic uh AI work, right? We work with the first Trump administration on uh this bill
called national AI research uh resource NAR bill which is uh scoping out a national AI compute cloud as well as
data repository and I also think that open-source open data sets continue to be important part of the ecosystem. Like
I said, right now in my Stanford lab, we are doing the open data set, open benchmark on robotic learning called behavior and many of my colleagues are
still doing that. I think that's part of the ecosystem. I think what the industry
the ecosystem. I think what the industry is doing, some startups are doing are running fast with models creating products is also a good thing. For
example, when Justin was a PhD student with me, none of the computer vision programs work that well, right? We could
write beautiful papers. Justin has
>> I mean like actually even even before grad school like I wanted to do computer vision and I reached out to a team at Google and like wanted to I potentially go and try to do computer vision like out of out of undergrad and they told me like what what are you talking about
like you can't do that like go do a PhD first and come back.
>> What was the motivation that that got you? Oh, I had done some computer vision
you? Oh, I had done some computer vision research in during my undergrad with uh with actually Fay's PhD adviser.
>> There's a lineage.
>> There's a lineage here. So, I had I had done some computer vision even as an undergrad and I thought it was really cool and I wanted to keep doing it. Um
so then I I was sort of faced with this sort of industry academia choice even coming out of undergrad that I think a lot of people in the research community are facing now. Um but but to your question I I think like the role of of
academia especially in AI has shifted quite a lot in the last decade. Um and
it's not a bad thing. Um it's it's a sense of it's it's because the technology has has grown and emerged, right? Like 5 or 10 years ago, you
right? Like 5 or 10 years ago, you really could train state-of-the-art models in the lab um even with with just a couple GPUs, but you know, because that technology was so successful and scaled up so much, then you you can't
train state-of-the-art models with a couple GPUs anymore. And that's not a bad thing. It's a good thing. It means
bad thing. It's a good thing. It means
the technology actually worked. Um but
that means the the expectations around what we should be doing as academics shifts a little bit. And it shouldn't be about trying to train the biggest model and scaling up the biggest thing. It
should be about trying wacky ideas and new ideas and crazy ideas. Um, most of which won't work. And I think there's a lot to be done there. If anything, I'm worried that too many people in academia
are hyperfocused on this notion of trying to pretend like we can train the biggest models or or treating it as almost a vocational training program to then graduate and go to a big lab and then be able to play with all the GPUs.
I think there's just so much crazy stuff you can do around like new algorithms, new architectures, like new systems that you know there's a lot you can do as as one person
>> and also just um academia has a role to play in understanding the theoretical underpinning of these large models. We
still know so little about this or extend to the interdisciplinary, you know, Justin calls wacky ideas.
There's a lot of uh basic science ideas.
There's a lot of blue sky problems. So I agree. I don't think the problem is open
agree. I don't think the problem is open versus closed, productization versus uh open sourcing. I think the problem right
open sourcing. I think the problem right now is that academia by itself is severely underresourced. So that uh you
severely underresourced. So that uh you know the the researchers and the uh students do not have enough resources to try these uh these uh ideas.
>> Yeah. Just for people to nerd snipe uh what's a wacky idea that comes to mind when you talk about wacky ideas? Oh,
like I had I had this idea that I kept pitching to my students uh at at Michigan, which is that I really like hardware and I really like like new kinds of hardware coming online. Um, and
in some sense the the emergence of the neural networks that you we use today and transformers are really based around matrix multiplication because matrix multiplication fits really well with GPUs. But if we think about how GPUs are
GPUs. But if we think about how GPUs are going to scale, how how hardware is likely to scale in the future. I don't
think the current system that we have like the GPU like hardware design is going to scale infinitely and that we start to see that even now that like the unit of compute is not not the single device anymore. It's this whole cluster
device anymore. It's this whole cluster of devices. So if you imagine
of devices. So if you imagine >> node >> yeah it's a whole node or a whole cluster but the way we talk about neural networks is still as if they are a monolithic thing that could be coded like in one GPU in pietorch um but then
in practice they could distribute over thousands of devices. So are there like dra just as you know transformers are based around matmal and mattal is sort of the primitive that works really well on on GPUs as you imagine hardware scaling out are there other primitives
that make more sense for large scale distributed systems that we could build our neural networks on um and I think it's possible that there could be drastically different architectures that fit with the next generation or like the the the hardware that's going to come 10
or 20 years down the line and we could start imagining that today it's really hard to make those kinds of bets because there's also the concept of the hardware lottery where let's just say you know Nvidia has on and we should just, you
know, scale that out in infinitely and write software to patch up any any gaps we have in the in the mix, right? I
mean, yes, yes and no. Like, if you look at if you look at the numbers, like even going from Hopper to Blackwell, like the performance per watt is about the same.
Yes. Um, they mostly make the the number of transistors go up and they make the chip size go up and they make the the power usage go up. But even from Hopper to Blackwell, we're kind of already seeing like a a scaling limit in terms of what is the what is the performance
per watt that we can get. So, um, I think I think there are there is room to to do something new and I don't know exactly what it is and I don't think you can get it done like in a three-month cycle as a startup, but I think that's the kind of idea that if you sit down
and sit with for a couple years, like maybe you could come up come up with some breakthroughs and I think that's the kind of longrange stuff that is is a perfect match for academia. Coming back
to the little bit of of background in history, we have this sort of research note on the scene storytelling work that you did or new newer image captioning uh that you did with Andre and I just
wanted to hear you guys tell that story about you know you you were like sort of embarking on that for your PhD and and Fe you like having that reaction that you had.
>> Yeah. So I think that line of work started between me and Andre and then Justin joined right. So um Andre started
his PhD. He and I were looking at what
his PhD. He and I were looking at what is beyond imageet object recognition and at that time we you know the
convolutional neuronet network was uh has proven some power in uh imageet tasks. So so convenet is a great way to
tasks. So so convenet is a great way to represent images. In the meantime, I
represent images. In the meantime, I think in the language that the space a early sequential model is called LSTM
was also being experimented. So Andre
and I were just talking about this has been a longterm dream of mine. I thought
it would take a 100 years to to solve which is telling the story of images.
When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem which is given a picture
or given a scene, tell the story in natural language. But things evolve so
natural language. But things evolve so fast. Uh when Andre and I uh when Andre
fast. Uh when Andre and I uh when Andre started, we're like maybe combining the representation of convolutional neuronet network as well as the the the language
sequential model of LSTM, we we might be able to learn uh through training to match uh um caption with images. So
that's when we started that line of work and I I don't it was 2014 or 2015. It
was a CVPR 2015 was the captioning paper. So it was uh our first paper that
paper. So it was uh our first paper that uh uh Andre got it to work that was you know given an image the image is
represented with convet the language model is the LSTM model and then we combine it and it's able to generate one sentence and that was one of the first
time it was pretty I I think I wrote it in my book uh we thought we were the first people doing it turned out that Google at that time was also simultaneously doing it and uh a
reporter it was John Marov from uh uh New York Times was breaking the Google story but he by accident heard about us and then he realized that we really
independently got there together at the same time. So he wrote the story of both
same time. So he wrote the story of both the Google research as well as Andre and my research but after that I think Justin was already in the lab at that time.
>> Yeah. Yeah. I remember the group read the the group meeting where Andre was presenting some of those results and explaining this new thing called LSTMs and RNN's that I had never heard of before and I thought like wow this is really amazing stuff I want to I want to
work on that. So then he had the paper at 20 CVPR 2015 on the first image captioning results. Then after that we
captioning results. Then after that we started working together and we did a we first we did a paper actually just on language modeling um back in uh 2015 I clear 2015. Yeah. Yeah. I I should have
clear 2015. Yeah. Yeah. I I should have stuck with language modeling that turned out to be pretty lucrative in retrospective. But we did this language
retrospective. But we did this language modeling paper together um me and Andre uh in uh 2015 where it was like really cool. We trained these little r these
cool. We trained these little r these little RNN language models that could you know spit out a couple sentences at a time and poke at them and try to understand what the neurons inside the neural network inside the things were doing.
>> You guys were doing analysis on the different like memory and >> Yeah. Yeah. It was it was really it was
>> Yeah. Yeah. It was it was really it was really cool. And even at that time we
really cool. And even at that time we had these results where you could like look inside the LSTM and say like oh this thing is reading code. So one of the like one of the data sets that we that we trained on for this one was the the um the Linux source code right
because the whole the whole the whole thing is you know open source and you could just download this. So we train an RNN on this on this data set and then as the network is trying to predict the tokens there then you know try to
correlate the kinds of predictions that it's making with the kind of internal structures uh in the RNN and there we were able to find some correlations between oh like this unit in this layer of the LSTM fires when there's an open
PN and then like turns off when there's a closed PN um and try to do some empirical stuff like that to to figure it out. So that was pretty cool and that
it out. So that was pretty cool and that was just like d that was sort of like cutting out the CNN from this language modeling part and just looking at the language models in isolation.
>> But then we wanted to extend the the image captioning work. Yeah. I remember
at that time we even have a sense of space because we feel like captioning does not capture different parts of the image. Right. So I was talking to Justin
image. Right. So I was talking to Justin and Andrea about can you go what we what we end up calling dense captioning which is you know uh describe the scene in
greater details especially different parts of the scene. So that's
>> yeah and so then then we built this system then it was um me and Andre and and Fe on a paper the following year CVPR 2016 where we built this system that did dense captioning. So you input a single image and then it would draw boxes around all the interesting stuff
in the image and then write a short snippet about each of them. It's like,
oh, it's a green B water bottle on the table. It's a person wearing a black
table. It's a person wearing a black shirt. And this was a really complicated
shirt. And this was a really complicated neural network because um that was built on a lot of um advancements that had been made in object detection around that time, which was a major topic in computer vision for for a long time. And
then it was actually like one joint neural network that was both uh you know learning to look at individual images because they actually had like then three different representations inside this network. One was the representation
this network. One was the representation of the whole image to kind of get the gestalt of what's going on. Then it
would propose individual regions that it wants to focus on and then look at you know represent each region independently and then once you look at the region then you need to spit out tax for each region. So that was a pretty complicated
region. So that was a pretty complicated neural network architecture. Um this was all pre pytorch >> right >> and does it do it in one pass?
>> Yeah. Yeah. So it was a single forward pass that did all of that.
>> Not only it was doing it in one pass.
You also optimize inference. You're
doing it on a webcam. I remember. Yeah.
>> Yeah. So, I had built this like crazy real-time demo um where I had the network running like on a server at Stanford and then a web front end that would stream from a webcam and then like send the image back to the server. The
server would run the model and stream the predictions back. So, I was just like walking around the lab with this laptop that would just like show people this uh this like this network in real time.
>> Identification and labeling as well.
Yeah, it was it was uh it was pretty impressive cuz most of our graduate students would be satisfied if they can publish the paper, right? They they
package the the the research, put it in a paper, but Justin went a step further.
He's like, I want to do this real time web demo.
>> Well, actually, I don't I don't know if I told you this story, but then um we had a there was a conference that year in Santiago uh at ICCV. It was ICCV 15.
Um and then like I had a paper at that conference for something different, but I had my my laptop. I was like walking around the conference with my laptop showing everybody this like real-time captioning demo and the model was running on a server in California. So,
it was like actually able to stream like all the way from California down to Santiago. Well, latency is it was
Santiago. Well, latency is it was terrible. It was like
terrible. It was like it was like 1 FPS, but the fact that it worked at all was pretty was pretty amazing.
>> I was going to briefly quip that, you know, maybe vision and language modeling are not that different. you know, DeSQL CR recently uh tried the crazy thing of let's language let's model text from
pixels and and just like train on that and it might be the future. I don't
know. I don't know if you guys have any takes on whether language is actually necessary at all.
>> I just wrote a whole manifesto on space intelligence.
>> This is my segue into this. Yes,
>> I think they are different. Um I do think the architecture uh of uh these generative models will share a lot of uh
sharable components but uh I think the deeply 3D 4D spatial world has a level of structure that is fundamentally
different from a purely generative uh signal that is one-dimensional.
>> Yeah, I think there's something to be said for pixel maximalism, right? Like
there's this notion that language is this different thing, but you we see language with our eyes and our eyes are just like, you know, basically pixels, right? Like we've got sort of biological
right? Like we've got sort of biological pixels in the back of our eyes that are processing these things. And you know, we see text and we think of it as this discrete thing, but that really only exists in our minds. Like the physical manifestation of text and language in
our world are, you know, physical objects that are printed on things in the world and we see it with our eyes.
Well, you can also think it's sound, but even sound even sound you can translate into sound. Yeah, you can corog
into sound. Yeah, you can corog signal, >> right? And then like you actually lose
>> right? And then like you actually lose something if you translate to this like purely tokenized representations that we use in LLMs, right? Like you lose the font, you lose you lose the line breaks, you lose sort of the 2D arrangement on
the page. Um, and and for a lot of
the page. Um, and and for a lot of cases, for a lot of things, maybe that doesn't matter. Um, but for some things
doesn't matter. Um, but for some things it does. Um, and I I think pixels are
it does. Um, and I I think pixels are this sort of, you know, more more lossless representation of what's going on in the world and in in some ways a more general general representation that more matches what what we what we humans
see as we as we navigate the world. So,
so like there's an efficiency argument to be made like maybe it's not super efficient to like you know render your text to an image and then feed that to a vision model.
>> That's exactly what Deep Seek did, right? It was like kind of worked.
right? It was like kind of worked.
I think this ties into the whole world model like one of the my favorite papers that I saw this year was about inductive bias to pro for world model. So it was a Harvard paper where they fed a lot of
like orbital uh patterns into an LLM and then they asked the LLM to predict the orbit of a planet around the sun and the model generated looked good but then if
you asked it to draw the force vectors >> it would be all wacky you know it wouldn't actually follow it. So how do you think about what's embedded into the data that you get? And we can talk about
maybe tokenizing for 3D world models like what are like the dimensions of information. There's the visual but like
information. There's the visual but like how much of like the underlying hidden forces so to speak you need to extract out of this data and like what are some of the challenges there? Yeah, I think there's different ways you could
approach that problem. Um, one is like you could try to be explicit about it and say like, "Oh, I want to, you know, measure all the forces and feed those as training data to your model, right? Then
you could like sort of run a traditional physics simulation and, you know, then know all the forces in the scene and then use those as as training data to train a model that's now going to hopefully predict those. Or you could hope that something emerges more
latently, right? that you kind of train
latently, right? that you kind of train on something end to end and then on on a more general problem and then hope that somewhere some something in the internals of the model must learn to model something like physics in order to
make the proper predictions. And those
are kind of the two big paradigms that we have more generally.
>> But there's no indication that those latent uh modeling will get you to a causal law of uh of space and dynamics.
Right? That's where today's deep learning and uh human intelligence actually start to bifrocate because fundamentally the deep learning is still fitting patterns.
>> There you sort of get philosophical and you say that like we're trying to fit patterns too but maybe we're trying to fit you know a more broad array of patterns like over a with with a longer time horizon a different reward function. Um but but like basically the
function. Um but but like basically the paper you mentioned is sort of you know that problem that it learns to fit the specific patterns of orbits but then it doesn't actually generalize in the way that you'd like. doesn't have a sort of causal model of gravity,
>> right? Because even in marble, you know,
>> right? Because even in marble, you know, I was trying it and it generates this beautiful sceneries and there's like arches in them.
>> But does the model actually understand how you know the arch is actually, you know, drawing on the center kind of like stone and like you know the actual physical structure of it. And the other
question is like does it matter that it does understand it as long as it always renders something that would fit the physical model that we imagine? If you
use the word understand the way you understand, I'm pretty sure the model doesn't understand it. The model is learning from the data, learning from
the pattern. Um, yeah. Does it matter?
the pattern. Um, yeah. Does it matter?
Especially for the use cases for it.
It's a good question, right? Like for
now, I don't think it matters because it renders out what you need assuming it's perfect.
>> Yeah. I mean, it depends on the use case. It's like if the use case is I
case. It's like if the use case is I want to generate sort of a backdrop for for virtual film or production or something like that all you need is something that looks plausible and in that case probably it doesn't matter.
But if you're going to use this to like you know if you're an architect and you're going to use this to design a building that you're then going to go build in the real world then yeah it does matter that you model the forces correctly because you don't want the thing to break once you actually
actually build it. But even there right like even if your model has the semantics in it let's say I still don't
think the understanding of the signal or the or the the output on the model part and the understanding on the human part is a different word but this gets again
philosophical >> yeah I mean there there's this trick with understanding right like these models are a very different kind of intelligence than human intelligence um and human intelligence is interesting because you You know, I think that I
understand things because I can introspect my own thought process to some extent. And then I believe that my
some extent. And then I believe that my thought process probably works similar to other people's so that when I observe someone else's behavior, then I infer that their internal mental state is probably similar to my own internal
mental state that I've observed. And
therefore, I know that I understand things. So there I assume that you
things. So there I assume that you understand something. But these models
understand something. But these models are sort of like this this alien form of intelligence where they can do really interesting things. They can exhibit
interesting things. They can exhibit really interesting behavior. But
whatever kind of internal the equivalent of internal cognition or internal self-reflection that they have if it exists at all is totally different from what we do. So
>> it doesn't have the self-awareness.
>> Right? But what that means is that when when we observe seemingly interesting or intelligent behavior out of these systems, we can't necessarily infer other things about them because their their model of the world and the way
they think is so different from us.
>> So would you need two different models to do the visual one and the architectural generation? you think
architectural generation? you think eventually like there's not anything fundamental about the approach that you've taken on the model building. It's
more about scaling the model and the capabilities of it or like is there something about being very visual that prohibits you from actually learning the
physics behind it so to speak so that you could trust it to generate a cat design that then is actually going to work in the real world. I think this is
a matter of scaling data and and and bettering model. I don't think there's
bettering model. I don't think there's anything fundamental that separates these two.
>> Yeah, I would like it to be one model, but I think like the big problem in deep learning in some sense is how do you get emergent capabilities beyond your training data? Are you going to get
training data? Are you going to get something that understands the forces while it wasn't trained to predict the forces, but it's going to learn them implicitly internally? Um, and I think a
implicitly internally? Um, and I think a lot of what we've seen in other large models is that a lot of this emergent behavior does happen at scale. Um, and
will that transfer to other modalities and other use cases and other tasks? Um,
I hope so, but that'll that'll be a process that we need to play out over time and see.
>> Is there a temptation to rely on um physics engines um that already exist out there that are, you know, basically the gaming industry has saved you a lot of this work or do we have to reinvent
things for some fundamental mismatch? I
think that's sort of like climbing the ladder of technology, right? Like in
some sense, the reason that you want to build these things at all is because maybe traditional physics engines don't work in some situations. If a physics engine was perfect, we would have sort of no need to build models because the problem would have already been solved.
So in some sense, the reason why we want to do these is because classical physics engines don't solve problems in the generality that we want. Um but that doesn't mean we need to throw them away and start everything from scratch, right? We can use traditional physics
right? We can use traditional physics engines to generate data that we then train our models on. And then you're sort of distilling the physics engine into the weights of the neural network that you're training. I think that's a
lot of what if you compare the work of other labs, people are speculating that, you know, Sora had a little bit of that.
Genie3 had a bit of that. Um, and Genie is like explicitly like a video game.
Like you have controls to to walk around in. And I I I always think like it's
in. And I I I always think like it's really funny how the things that we invent for fun actually does eventually make it into serious work.
>> Yeah. The whole AI revolution started by graphics chips. Yeah. Partially
graphics chips. Yeah. Partially
>> misusing the GPU for uh for generating a lot of triangles into generating a lot of everything else basically.
>> Yeah.
>> We touched on marble a little bit. I
think you guys chose marble as like kind of your like your sort of a little bit coming out of stealth moment if you can call it that.
>> Yeah. Uh maybe we can get a concise explanation from you on what people should take away because everyone here can try marble but I don't think they might be able to link it to the
differences between what your vision is versus other I guess generative worlds they may have seen from other labs.
>> So Marbo is a glimpse into our model right we are a model spatial intelligence model company. We believe
spatial intelligence is the next frontier. And in order to make spatial
frontier. And in order to make spatial spatially intelligent models, the model has to be very powerful in terms of its ability to you know understand, reason,
generate in very multimodal uh fashion of worlds as well as allow the level of interactivity that we eventually hope to
be as you know complex as how humans can interact with the world. So that's the grand vision of spatial intelligence as well as the the kind of world models we
we uh see. Marble is the first glimpse into that. It's the it's the the first
into that. It's the it's the the first part of that journey. It's the first in-class model in the world that generates uh 3D worlds in this level of
fidelity that is in the hands of the the public. It's the starting point, right?
public. It's the starting point, right?
Uh we actually wrote this tech blog.
Justin spent a lot of time writing that tech blog. I don't know if you had time
tech blog. I don't know if you had time to browse it. I mean Justin really broke it down into what are the inputs uh we can multimodal inputs of uh marble what
are the kind of uh editability which is you know allows user to be interactive with the model and what are the kind of outputs we can have.
>> Yeah. Uh so so marble like basically one way of looking at it it's the system it's a generative model of 3D worlds right so you can input things like text or image or multiple images and it will generate for you a 3D world that kind of
matches those inputs and it's also interactive in the sense that um you can interactively edit scenes like I could generate this scene and then say I don't like the water bottle make it blue instead like take out the table like
change these microphones around and then you can generate new worlds based on these interactive edits and export in a variety of formats and with marble we were actually trying to do sort of two things simultaneously and I think we we managed to pull off the balance pretty
well. Um one is actually build a model
well. Um one is actually build a model that goes towards the grand vision of spatial intelligence and models need to be able to understand lots of different kinds of inputs need to be able to model worlds in a lot of situations need to be
able to model counterfactuals of how they could change over time. So we
wanted to start to build models that have these capabilities and marble marble today does already have hints of all of these. But at the same time we we're we're a company. We're a business.
We were really trying not to have this be a science project but also build a product that would be useful to people in the real world today. So while Marble is simultaneously a world model that is
building towards this vision of spatial intelligence, it was also very intentionally designed to be a thing that people could find useful today. Um
and we're see starting to see emerging use cases. um for in gaming, in VFX, um
use cases. um for in gaming, in VFX, um in in film where I think there's a lot of really interesting stuff that Marvel can do today as a product and then also set a foundation for the grand world models that we want to build going into
the future.
>> Yeah. I noticed one tool that was very interesting because you can record your scene inside.
>> Yes.
>> Yes. It's very important the ability to record means a very precise control of camera placement. In order to have
camera placement. In order to have precise camera placement, it means you have to have a sense of 3D space.
Otherwise, you don't know how to orient your camera, right? And how to move your camera. So, that is a natural
camera. So, that is a natural consequence of this kind of model and and this is why this is just one of the examples.
>> Yeah. I I find when I play with video generative models, I'm having to learn the the language of being a director because I have to move them like pan uh you know like
you cannot say pan 63 degrees to the north, right? You just don't have that
north, right? You just don't have that control. Whereas in marble you have
control. Whereas in marble you have precise control in terms of placing your camera. Yeah, I think that's one of the
camera. Yeah, I think that's one of the first things people need to understand is like it's not you're not generating frame by frame which is like what a lot of the other models are.
>> Yeah.
>> What are you know people understand that LLM generates one token. What are like the atomic units? There's kind of like you know the meshes, there's like the splats, the voxels, there's a lot of pieces in a 3D world. What should be the
mental model that people have of like your generations?
>> Yeah, I I think there's like what exists today and what could exist in the future. Um, so what exists today is the
future. Um, so what exists today is the model natively outputs splats. Um, so
gausian splats are these like, you know, each one is a tiny tiny particle that's semi-transparent, has a position orientation in in 3D space. Um, and the scene is built up from a large number of these gausian splats. Um, and gausian splats are really cool because you can
render them in real time really efficiently. So you can render on your
efficiently. So you can render on your iPhone, render render everything. Um,
and that's that's how we get that sort of precise camera control because the splats can be rendered real time on just on pretty much any client side device that we want. So for a lot of the scenes that we're generating today, uh that kind of atomic unit is that individual
splat, but I don't think that's fundamental. I I could imagine other
fundamental. I I could imagine other approaches in the future that would be interesting. So there like there are
interesting. So there like there are other approaches that even we've worked on at World Labs um like our recent RTFM model that does generate frames one at a time. Um and there the atomic unit is
time. Um and there the atomic unit is generating frames one at a time as the user interacts with the system. Um or
you could imagine other architectures in the future where the atomic unit is a token where that token now represents you know some chunk of the 3D world and I think there's a lot of different architectures that we can experiment with here over time. I do want to press
on double click on this a little bit. My
version of what Allesia was going to say was like what is the fundamental data structure of a world model because exactly like you said like it's it's either Gaussian splat or it's like the frame or what have you. uh you also in
in the the sort of previous statements focus a lot on the physics and the forces which is something over time which is loosely I don't see that in marble I presume it's not there yet
maybe if there was like a marble 2 you would have movement or is there a modification to go splats that makes sense or would it be something completely different yeah I think there's a couple modifications that make
sense and there's actually a lot of interesting ways to integrate things here which is another nice place of working in this space then there's actually been a lot of research work on this. Like when you talk about wacky
this. Like when you talk about wacky ideas, like there's actually been a lot of really interesting academic work on different ways to imbue physics.
>> You can also do wacky ideas in industry.
>> Yeah.
>> Right. But but it's then it's like gausian splats are themselves little particles. There's been a lot of um
particles. There's been a lot of um approaches where you basically attach physical properties to those splats and say that each one has a mass or like maybe you treat each one as being coupled with some kind of um virtual spring to nearby neighbors and now you
can start to do sort of physics simulation on top of splats. So one kind of avenue for adding uh adding physics or dynamics or interaction to these things would be to you know predict physical properties associated with each
of your splat particles um and then simulate those downstream either using classical physics or something learned or you know the kind of the beauty of working in 3D is things compose and you can inject logic in different places. So
one way is sort of like we're generating a 3D scene. We're going to predict 3D properties of everything in the scene.
Then we use a classical physics engine to to simulate the interaction. Or you
could do something where like as a result of a user action, the model is now going to regenerate the entire scene um in in splats or some other representation. Um and that could
representation. Um and that could potentially be a lot more general because then you're not bound to whatever sort of um you know physical properties you know how to model already. But that's also a lot more
already. But that's also a lot more computationally demanding because then you need to regenerate the whole scene in response to to user actions. But I
think this is a this was a really interesting area for for future work and for uh adding on to to into potential marble too as you say.
>> Yeah. And there's opportunity for dynamics right?
>> What's the state of like splats density I guess? Like do we can we render enough
I guess? Like do we can we render enough to have very high resolution when we zoom in? Are we limited by like the
zoom in? Are we limited by like the amount that you can generate? the amount
that we can render like how are these gonna get super high fidelity so to speak?
>> You have some limitations but depending on your target use case. So like one of the one of the big constraints that we have on our scenes is we wanted things to render cleanly on mobile. Um and we wanted things to render cleanly in VR
headsets. So those are those devices
headsets. So those are those devices have a lot less compute than you're used than you have in a lot of other situations. Um, and like if you want to
situations. Um, and like if you want to get a splat file to render at high resolution, high like 30 to 60 fps on like an iPhone from 4 years ago, then you are a bit limited in like the number
of splats that you can handle. Um, but
if you're allowed to like work on a recent like even this year's iPhone or like a recent MacBook or even if you have a local GPU or if you don't need if you don't need that 60 fps 1080p like
then you can relax the constraints and and get away with more splats and that lets you get higher resolution in your scenes. One uh use case I was expecting
scenes. One uh use case I was expecting but didn't hear from you was embodied use cases. Are you you're just focusing
use cases. Are you you're just focusing on virtual for now?
>> If you go to World Labs home homepage, there is a particular page called Marble Labs.
>> There we showcase different use cases and we actually organize them in more visual effect use cases or gaming use cases as well as simulation use cases.
And in that we actually show this is a technology that can help a lot in robotic training. Right? This uh goes
robotic training. Right? This uh goes back to what I was talking about earlier in uh speaking of data starvation. Uh
robotic training really lack data. You
know highfidelity real world data is absolutely very critical. But uh it's you're just not going to get a ton of that. Of course the other extreme is
that. Of course the other extreme is just purely internet video data. then
you lack a lot of the the controllability that you want to train your your embodied agents with. So
simulation and synthetic data is actually a very important middle ground for that. I've been working in this
for that. I've been working in this space for many years and one of the biggest pain point is where do you get this synthetic simulated data? you have
to curate assets and and and and build these uh compose these complex situations and in robotics you want a lot of different states. You want the
embodied uh agent to to uh interact in the synthetic environment. Marble
actually is a really potential for uh helping to generate these synthetic uh simulated worlds for embodied uh agent training.
>> Obviously that's yeah that's on the it's on the homepage. It'll be there. I just
I was like trying to make the link to as you said like you also have to build like a business model the market for robotics obviously is very huge maybe you don't need that or maybe we need to build up and solve the virtual worlds
first before we go to embody it and obviously stepping stone >> that is to be decided I I do think that uh >> because everyone else is going straight there right
>> not everyone else but there is a there is an excitement I would say but you know I think the world is big enough to to have different uh approaches.
>> Yeah. Approaches.
>> Yeah.
>> Yeah. I mean, and we always view this as a pretty horizontal technology that should be able to touch a lot of different industries over time. And, you
know, Marble is a little bit more focused on creative industries for now, but I think that the technology that powers it should be applicable to a lot of different things over time. And
robotics is one that, you know, is maybe going to happen sooner than later.
>> Also, design, right, is very adjacent to creative.
>> Oh, yeah. Definitely. Like, I I I think it's like the architecture stuff.
>> Yes. Okay. Yeah. I mean, I I I was joking online. I posted this uh this
joking online. I posted this uh this video on Slack of like, oh, who wants to use marble to to plan your next kitchen remodel? It actually works great for
remodel? It actually works great for this already. Just like take two images
this already. Just like take two images of your kitchen, like reconstruct it in marble, and then use the editing features to see what would that space look like if you change the countertops or change the floors or change the cabinets. And this is something that's,
cabinets. And this is something that's, you know, we didn't necessarily build anything specific for this use case, but because it's a it's a powerful horizontal technology, you kind of get these emergent use cases that that just
fall out of the model. We have early beta users using a um um API key that uh is already building for interior design uh use case.
>> I just did my garage. I should have known about this. I got to >> next time you remodel, we can be of help.
>> Well, kitchen is next, I'm sure.
>> Yeah.
>> Yeah. I'm curious about the whole spatial intelligence space. I think we should dig more into that one. How do
you define it and like what are like the gaps between traditional intelligence that people might think about a LLMS when you know Dario says we have a data center full of Einstein that's like
traditional intelligence it's not spatial intelligence what is required to be spatially intelligent >> first of all I don't understand that sentence a data center full of Einsteins
that >> I I just don't understand that it's not a >> it's an analogy it's analogy.
>> Well, so a lot of AI as a field, as a discipline is inspired by human intelligence, right? Because we are the
intelligence, right? Because we are the most intelligent animal we know in the universe for now. And if you look at human intelligence, it's very multi-intelligent,
right? There is a psychologist, I think
right? There is a psychologist, I think his name is Howard Gardner in the 1960s actually literally called multiple intelligence to describe human intelligence. And there is linguistic
intelligence. And there is linguistic intelligence, there's spatial intelligence, there is logical intelligence and and and emotional intelligence. So for me when I think
intelligence. So for me when I think about spatial intelligence, I see it as complimentary to language intelligence.
So I I personally would not say it's spatial versus traditional because I don't know tradition means what does it mean? I do think spatial is
mean? I do think spatial is complimentary to linguistic and uh and uh how do we define spatial intelligence is it's the capability that uh allows
you to reason, understand, move and interact in space.
And I use this example of the deduction of DNA structure, right? And and of course I'm simplifying this uh story, but a lot of that had to do with the
spatial reasoning of the molecules and the chemical bonds in a 3D space to to eventually conjecture a double helix and
that ability that humans or uh Francis Cricken Watson had had done. It is very very hard to reduce that process into
pure language. And that's that's a
pure language. And that's that's a pinnacle of a civilizational moment. But
every day, right, I'm here trying to grasp a a mug. This whole process of seeing the mug, seeing the context where
it is, seeing my own hand, opening of my hand that geometrically would match the mug and touching the right affordance point. All this is deeply deeply spatial
point. All this is deeply deeply spatial there. It's very hard. I'm trying to use
there. It's very hard. I'm trying to use language to narrate it. But on the other hand, that narrated language itself cannot get you to to pick up a a mug.
>> Yeah. Bandwidth constraint.
>> Yes.
>> I did some math recently on like if you just spoke uh all day every day for 24 hours a day, uh how many tokens do you generate? at the average speaking rate
generate? at the average speaking rate of like 150 words per minute, it roughly rounds rounds out to about 215,000 tokens per day and like your your your
world that you live in is so much higher bandwidth than that.
>> Well, I think that is true. But if I think about Sir Isaac Newton, right, it's like you have things like gravity at the time that have not been formalized in language that people
inherently spatially understand that things fall, right? But then it's helpful to formalize that in some way or like you know all these different rules that we use language to like really
capture something that empirically and spatially you can also understand but it's easier to like describe in a way.
So I'm curious like the interplay of like spatial and like linguistic intelligence which is like okay you need to understand some rules are easier to write in language for then the spatial
intelligence to understand but you cannot you know you cannot write put your hand like this and put it down this amount. So I'm always curious about how
amount. So I'm always curious about how you leverage each other together.
>> I mean if if anything like the example of of Newton like Newton only thinks to write down those laws because he's had a lot of embodied experience in the world watching baseball.
>> Exactly. And actually it's useful to distinguish between the theory building that you're mentioning versus like the embodied like the daily experience of being embedded in the threedimensional world. Right. So so to me spatial
world. Right. So so to me spatial intelligence is sort of encapsulating that embodied experience of being there in 3D space moving through it seeing it actioning it. And as Fay said you can
actioning it. And as Fay said you can narrate those things but it's a very lossy channel. It's a just like the
lossy channel. It's a just like the notion of you know being in the world and doing things in it is a very different modality from trying to describe it. But because we as humans
describe it. But because we as humans are animals who have evolved interacting in space all the time, like we don't even think that that's that's a hard thing, right? And then we sort of
thing, right? And then we sort of naturally leap to language and then theory building as mechanisms to abstract above that sort of native spatial understanding. And in some
spatial understanding. And in some sense, LLM have just like jumped all the way to those highest forms of abstracted reasoning, which is very interesting and very useful. But spatial intelligence is
very useful. But spatial intelligence is almost like opening up that black box again and saying maybe we've lost something by going straight to that fully abstracted form of of language and reasoning and communication.
>> You know, it's funny as a vision scientist, right? I always find that
scientist, right? I always find that vision is underappreciated because it's effortless for humans. You
open your eyes as a baby, you start to see your work. We're somehow born with it, >> right? We're almost born with it. But
>> right? We're almost born with it. But
you have to put effort in learning language including learning how to write, how to do grammar, how to express and that makes it feel hard. Whereas
something that nature spend way more time actually optimizing which is perception and spatial intelligence is underappreciated by humans.
>> Is there proof that we are born with it?
You said you said almost born. So it
sounds like we actually do learn after we're born. When we are born, our visual
we're born. When we are born, our visual acuity is less and our perceptual ability does increase. But we are most
humans are born with the ability to see and most humans are born with the ability to link perception with motor um movements, right? I mean the motor
movements, right? I mean the motor movement itself is takes a while to uh refine, but uh and then animals are incredible, right? Like I was just in
incredible, right? Like I was just in Africa earlier this summer. These little
animals, they're born and within minutes they have to get going and otherwise the you know the the the lions will get them. And in nature, you know, it took
them. And in nature, you know, it took 540 million years to uh optimize perception and spatial intelligence and
language. The most generous estimation
language. The most generous estimation of uh language development is probably half a million years.
>> Wow.
>> Yeah, >> that's longer than I would have got to say. Well, I'm being very generous.
say. Well, I'm being very generous.
Yeah.
>> Yeah. No, I I was uh you know, sort of going through your book and I was sort of realizing that that one of the interesting links to something that we covered on the podcast is language model
benchmarks and how wow grand uh actually put in all these like sort of physical impossibilities that require spatial intelligence, right? Like A is on top of
intelligence, right? Like A is on top of B, therefore A cannot fall through B >> is >> obvious to us, but to a language model it could happen. I don't know, maybe it's like a part of the, you know, the next token prediction.
>> And that's sort of what I mean about like unwrapping this abstraction, right?
Like if your whole model of the world is just like saying sequences of words after each other, it's really kind of hard to like why why not?
>> It's actually unfair, >> right? But then the reason it's obvious
>> right? But then the reason it's obvious to us is because we are internally mapping it back to some threedimensional representation of the world that we're familiar with. This the question is I
familiar with. This the question is I guess like how hard is it, you know, how long is it going to take us to distill from? Like I I use the word distill. I
from? Like I I use the word distill. I
don't know if you agree with that to distill from your world models into a language model cuz we do want our models to have spatial intelligence right like and do we have to throw language model out completely in order to to do that or
>> no >> no right I don't think so right >> I think they're multimodal I mean even our model marble today takes language as a input >> right >> right so it's deeply multi multimodel
and I think in many use cases these models will work together maybe one day we'll have a universal model >> I mean even even if you do like there there's sort of a pragmatic thing where people use language and people want to
interact with systems using language.
Even pragmatically, it's useful to build systems and build products and build models that let people talk to them. So,
I don't see that going away. I think
there's a sort of intellectual curiosity of of saying how like intellectually how much could you build a model that that only uses vision or only uses spatial intelligence. I don't know that that
intelligence. I don't know that that would be practically useful, but I think it'd be an interesting intellectual or academic exercise to see how far you could push that. I think I mean not to bring it back to physics but I'm curious
like if you had a highly precise world model and you didn't give it any notion of like our current understanding of the standard model of physics how much of it it would be able to come up with and
like recreate from scratch and what level of like language understanding it would need because we have so many notations that like we kind of use that like we create it but like maybe we'll come up with a very different model of
it and still be accurate and I wonder how much we're kind of limited But I you know how people say human always need to be like humans because the world is built for humans and in a way it's like
the way we build language constraints some of the outputs that we can get from these other modalities as well. So I'm
super excited to follow your work.
>> Yeah. I mean like there's another I mean you actually don't even need to be doing AI to answer that question. You could
discover aliens and see what kind of physics they have, right? And they might have a >> let's face it we are so far the smartest animal in the universe, >> right? So, so what do you so I mean but
>> right? So, so what do you so I mean but that is a really interesting question right like is our knowledge of the universe and our understanding of physics is it constrained in some way by our own cognition or by the path dependence of our own technological
evolution and one way to sort of and like do an experiment like you almost want to do an experiment and say like if we were to rerun human civilization again would we come up with the same physics in the same order and I don't think that's a very practical practical
experiment to run >> you know one experiment I wonder if people could run is that we have plenty of astrophysical data now on the planet uh or or
celestial body uh movements just feed the data into a model and see if Newtonian law uh emerges.
>> My my guess is it probably probably won't.
>> That's my guess. It's not the abstraction level of Newtonian law is at a different level from what these language uh LLM represents.
>> Yeah. So I wouldn't be surprised that given enough celestial movement data, an LLM would actually predict pretty accurate movement trajectories. Let's
say I invent a planet uh surrounding a a a star and giving enough data my my model would tell you, you know, on day one where it is, day two where it is. I
wouldn't be surprised. But F equals MA or or you know action equals reaction.
That's just a whole different abstraction level. That's beyond just
abstraction level. That's beyond just the today's LLM.
>> Okay. What what model would you need to not have it be a geocentric model?
Because if I'm training just on visual data, it makes sense that you think the sun rotates around the earth, right? But
obviously that's not the case. So how
would it learn that? Like I'm curious about all these like you know forces that we talk about it's like sometimes maybe you don't need them because as long as it looks right it's right but like as you make the jump to like trying
to use these models to do more highle tasks how much can we rely on them I think you can need kind of a different learning paradigm right so like you know there's a bit of conflation here
happening where saying is it LLMs and language and symbols versus you know human theory building and human human physics and they're very different because an LLM Like the human objective
function is to understand the world and thrive in your life. And the way that you do that is by you know sometimes you observe data and then you think about it and then you try to do something in the world and it doesn't match your
expectations and then you want to go and update sort of your your your your understanding of the world online. And
people do this all the time constantly like whether it's you know I think my keys are downstairs so I go downstairs and I look for them and I don't see them and oh no they're actually up in my bedroom. So we're we're like because
bedroom. So we're we're like because we're constantly interacting with the world, we're constantly having to build theories about what's happening in the world around us and then falsify or add evidence to those theories. And I think
that that kind of process writ large and scaled up is what gives us F= MA and Newtonian physics. And I think that's a
Newtonian physics. And I think that's a little orthogonal to you know the modality of model that we're training whether it's language or or or spatially. The way I put it is almost
spatially. The way I put it is almost like this is almost more efficient learning because you have a hypothesis of here are the different possible worlds that are granted by my available data and you then you do experiments to
eliminate the worlds that are not possible and you resolve to the one that's right. To me, that's also how I
that's right. To me, that's also how I also have theory of mind, which is like I'm I have a few the thesis of what you're thinking, what you're thinking.
Um, and I try to create actions to resolve that or or check my intuition as to what you're thinking. You know that and and obviously LLMs don't do any of these.
>> A theory of mind possibly also will break into even emotional intelligence, which today's AI is really not touching at all. Right. and when and we really
at all. Right. and when and we really really need it. Uh you know people are starting to depend on these things uh probably too much and uh and that's that's a whole topic of of other debate.
I do have to ask because a lot of people have like sent this to us. How much do we have to get rid of? Uh you know is uh is sequence to sequence modeling out the window? Is attention out the window?
window? Is attention out the window?
Like how much are we requesting everything?
>> I think uh I think you stick with stuff that works right I I think attention is still there. I think attention is still
still there. I think attention is still there. I think there's a lot like you
there. I think there's a lot like you don't need to fix things that aren't broken. Um, and like it's it's there's a
broken. Um, and like it's it's there's a lot of hard problems in the world to solve, but let's focus on one at a time.
Um, I I think it is pretty interesting to think about new architectures or new paradigms or or drastically different learning ways to learn. Um, but you don't need to throw away everything just because you're working on new modalities.
>> I think sequence to sequence is actually um in world models. I think we are going to see algorithm or architecture beyond sequence to sequence.
>> Oh, but but here actually I think there's there's a little bit of um you know technological confusion and and transformers already solved that for us, right? Like transformers are actually
right? Like transformers are actually not a model of sequences. A transformer
is natively a model of sets. Um and
that's very powerful but because um a lot of the transformers grew out of earlier architectures based around recurrent neural networks and RNN's definitely do have like a built-in architectural like they do modeled
one-dimensional sequences. Okay.
one-dimensional sequences. Okay.
>> But transformers are just object models of sets and they can model a lot of those sets could be you know 1D sequences they could be other things as well.
>> So you literally mean set theory like >> Yeah. Yeah. So whatever.
>> Yeah. Yeah. So whatever.
>> Yeah. Yeah. So a transformer is actually not a model of a sequence of tokens. A
transformer is actually a model of a set of tokens, right? The only thing that gives that that injects the order into in the trans in the standard transformer architecture. The only thing that
architecture. The only thing that differentiates the order of the things is the positional embedding that you give the tokens, right? So if you if you choose to give it a sort of 1D positional embedding that's the only like mechanism that the model has to
know that it's a 1D sequence but all the all like operators that happen inside a transformer block are either token wise right so they either you have an FFN you have QKV projections like you have per token normalization all of those happen
independently per token and then you have interactions between tokens through the attention mechanism but that's also sort of um it's it's permutation equariant so if I permute my tokens then the tension operator gets a permuted
output in exactly the same way. So, it's
actually natively an architecture of sets of tokens.
>> Literally a transform.
>> Yeah.
>> In in a math term, >> I know we're out of time, but uh we just want to give you the floor for some call to action either on people that would enjoy working at World Labs, what kind of people should apply, what research
people should be doing outside of World War Labs that would be helpful to you or anything else on your mind. I do think it's very exciting time to um to be looking beyond just language models and
think about the the boundless possibilities of uh spatial intelligence. So we are actually hungry
intelligence. So we are actually hungry for talent ranging from very deep researchers right thinking about the problems like Justin just described you
know training large models of world models uh we are hungry for engineers good engineers building systems you know from training optimization to inference
to uh product and we're also hungry for good business uh you know um product uh thinkers and go to market and you know
business talents. So, so we are hungry
business talents. So, so we are hungry for for talent. We especially now that we have uh uh exposed the model uh to the world through marble, I think we
have a great opportunity to work with even a bigger pool of talent to solve both the the model problem as well as uh deliver the best product to uh to the world.
>> Yeah, I think I'm also excited for people to try Marble and do a lot of cool stuff with it. I think it has a lot of really cool capabilities, a lot of really cool features that fit together really nicely.
>> In the car coming here, Justin and I were saying people have not totally discovered the Okay, it's only 24 hours have not totally discovered some of the advanced mode of editing, right? Like
turn on the advanced mode. You can, like Justin said, change this color of the bottle, you know, change your floor and change the trees and >> Well, I I actually tried to get there, but when it says create, it just makes
me create a completely different world.
>> You need to click on the advanced mode.
It's a good UI.
>> We can improve on our UI UI. Remember to
click.
>> Yeah, we need to we need to hire people.
We work on the product.
>> Uh, but one thing we got that was clear from you guys are looking for is also intellectual fearlessness. which is
intellectual fearlessness. which is something that I think you guys hold as principle.
>> Yeah, I mean we are literally the first people who are trying this both on the model side as well as as on the product side.
>> Thank you so much for joining us. This
was fun. Yeah, thanks for having us.
>> Yeah.
Loading video analysis...