Yann LeCun | Self-Supervised Learning, JEPA, World Models, and the future of AI
By Harvard CMSA
Summary
## Key takeaways - **Current AI Nowhere Near Human Intelligence**: We're nowhere near matching human intelligence or even animal intelligence with the type of techniques that we have access to at the moment. Supervised learning and reinforcement learning are insufficient; self-supervised learning works for discrete symbols like language but not for natural signals like video. [03:00], [04:44] - **4-Year-Old Sees LLM Data Volume**: A 4-year-old has seen about 10 to the 14 bytes through the optic nerve, equivalent to the 30 trillion tokens used to train a typical large language model like Llama 3. We are never going to get to human level AI by just training on text. [11:38], [13:00] - **JEPA Predicts Representations, Not Pixels**: Joint Embedding Predictive Architecture (JEPA) predicts a representation of the next video frame rather than pixels, eliminating unpredictable details to make prediction feasible. This mirrors how humans and science find abstract representations that ignore irrelevant details for better predictions. [29:08], [30:07] - **Autoregressive Generation Causes Hallucinations**: Autoregressive prediction in LLMs uses own predictions as input, leading to divergence or hallucination because errors compound exponentially in the token tree with no way back. Humans think abstractly first before articulating. [06:52], [21:40] - **Inference by Optimization Enables Planning**: Inference by optimization searches for outputs minimizing an energy function measuring input-output compatibility, enabling zero-shot learning, reasoning, and planning like system two thinking. This contrasts with fixed-layer feedforward in current architectures. [15:07], [17:20] - **Hierarchical World Models for Complex Planning**: Intelligent systems need hierarchical world models at different abstraction levels and timescales for multi-level planning, like going from NYU to Paris via subgoals such as airport then taxi. This remains completely unsolved. [41:12], [44:03]
Topics Covered
- Text Training Fails Human AI
- Inference by Optimization Beats Feedforward
- Predict Representations, Not Pixels
- Hierarchical World Models Enable Planning
Full Transcript
- Welcome everyone.
Can you hear me? I'm Mike Friedman, representing the Center for Mathematics and Scientific Applications at Harvard, and it's my great pleasure to be introducing Yann LeCun, chief scientist at Meta. We're running a conference at CMSA
at Meta. We're running a conference at CMSA on the Geometry of Machine learning, and this is actually a lecture within that conference, but it's outside the CMSA building because we knew too many people would show up to carry on, so we were able to move it to the Science Center where it's appropriate.
As soon as we got Yann to agree to give this talk, all the other speakers accepted immediately.
So Yann, it was the easiest conference to organize.
Yann is one of these scientists that it would anesthetize the audience if I tried to go through his awards, and also I would need a script.
So I, I'll just mention that he won the Turing award with Bengio and Hinton a few years ago.
I think of him interchangeably with the idea of convolutional neural nets.
I'm a geometer as a mathematician, you know, topologist and geometer.
And I think that's something we share a confidence in the geometric imagination.
And I know it's something that Yann has always tried to figure out how to weave into artificial intelligence, and it's a, it's a, a vein of exploration that I've, I've greatly admired.
So we're, I think we're all very much looking forward to this talk.
So am I, and without further ado, let me turn the stage over to y - Thank you so much, Mike.
Well, I have a terrible confession to make, which is, I'm not a mathematician.
I'm not really a computer scientist either.
I never actually study computer science, so I'm not exactly sure what I am, but I'm, I'm going to talk about machine learning this, I was told this was a bit of a more general, general audience than the one at the workshop.
So I mean this a bit more of a what on your stock than, I mean, still technical, but not very, a little light lightweight on the series, that's for sure.
And I want to talk about the future of ai and how do we get, how can we make significant progress towards more intelligent machines beyond what they are capable of doing.
And at you right now, there is a lot of work to do.
We're nowhere near matching human intelligence or even animal intelligence with the type of techniques that we have access to at the moment.
So one big question we can ask ourself is, do we actually need AI systems with human level intelligence?
And the answer is probably yes, because there's a, the future in which each of us walks around with AI assistance kind of helping us in our daily lives at all times, perhaps living in, you know, wearable devices like s smart glasses, like the ones I'm wearing at the moment.
Actually, you guys need to smile.
Okay, you are in the box and, you know, we'll those things will be sort of helping, helping us at all times.
And it's like, we'll be their boss.
So it is kind of like we, we be kind of running around with a team of, with virtual people kind of helping us at all times.
And of course for this, we need AI systems that have intelligence that is in some ways similar to human because that's the kind of in, of, of entities that we're the most familiar with interacting with.
But the technology is nowhere near where it needs to be at the, at the moment for that.
So the main issue is that current AI architectures and machine learning techniques suck compared to what we can observe in humans and animals, the type of efficiency in, in learning that we see in animals and humans.
It's just astonishing. And we're not, we're not matching this, at least at, at the moment in, in many instances.
So, you know, early on in machine learning, the, the main technique was supervised learning, and then there was a big fashion around reinforcement learning for a while.
Now it's used a lot of course to fine tune large language models, but, but in themselves, those two techniques are really insufficient.
The, the type of learning that we observe in humans and animals is very different.
It's neither supervised nor reinforced for that matter.
It's more like self supervised learning or something that has really revolutionized AI and machine learning over the last few years, which, you know, at the principle, you know, underlying principle are very similar to supervised learning, but there is no clear difference between input and output.
I'll come back to this. This works astonishingly well for training a system to understand the structure of sequences of discrete symbols, such as language code, mathematics to some extent.
But the problem is that it only works for sequences of discrete symbols.
It doesn't really work for kind of natural signals.
Self supervised running. It starts to work, but the techniques are very different.
And that that'll be the main topic really of this talk.
There's other limitations with current AI architectures, which is that the type of inference that they perform is basically feed forward propagation through a fixed number of layers of sub neural net.
And that's computationally limited.
There's a lot of functions you cannot represent efficiently by just stacking a fixed number of layers of, you know, alternating linear operators and non-linear pointwise operators.
And the idea of, you know, training a system to predict the next item in a sequence works for discreet symbol sequences, but not really for, for anything else.
The other issue also with current architectures is that they use auto aggressive, auto aggressive prediction.
So they use their own predictions as input to make further predictions, and that leads to divergence or hallucination as people call it.
So there, there's a lot of things that really we are missing to kind of match the type of intelligence we observe in humans and animals.
Humans and animals have mental models to the world.
The behavior is driven by objectives, by tasks, goals, if you want, they can reason and they can plan complex action sequences.
All things that chat bots and LLMs are essentially incapable of, or at least not to the level that we'd like.
So we need systems that understand the physical world, systems that have persistent memory systems that can plan complex actions so as to fulfill an objective or accomplish a task systems that can reason in particular they can spend more time solving difficult problems than simple problems and systems that are controllable and safe.
Okay, so this, let's start with this idea of world model.
We have mental models of reality that allow us to predict what's going to happen, particularly what's going to happen as a consequence of our actions.
And that this is really what allows us to plan.
And the type of learning that is taking place in humans and animals in the first few months of life is a little mysterious.
So this chart was put together by my colleague and friend Emmanuel Dpu, who's a county scientist in, in Paris, and indicates that what age infants learn basic concepts about the, about the world, like object, permanence, the fact that some objects, you know, are stable when you put them on the table, they're not going to fall.
That objects belong to different categories.
Babies that are, you know, five or six months, they don't speak not many language, but they certainly know what the difference is between the table and the chair and the cat and the dog without knowing the names of it.
And it takes about nine months for infants to learn basic notions of intuitive physics like gravity, inertia, conservation, momentum, this, this kind of stuff.
So if you show a six month old, the scenario at the bottom left where a little cart is on a platform and you push it off the platform, it appears to float in the air, six months old won't pay attention much.
A 10 months old will be extremely surprised.
Perhaps look like a little girl here, because by then in France have learned that objects that are not supported are supposed to fall.
So how do, how do we get machines to learn like babies?
And we've not solved that problem.
And the reason you can tell that we're not solved that problem is, is that, you know, we don't have domestic robots.
We don't have self-driving cars that are completely autonomous level five.
We have them, but we cheat, but we have systems that can pass the bar exam, they can solve math problems, you know, do all kinds of stuff that are intellectually challenging for most of us.
But, but we still don't have robots that can do what a cat can do, or we can do that what a 10-year-old can do.
The first time the the 10-year-old tries.
You, you tell a 10-year-old for the first time, you know, clear out the dinner table if you know the dishwasher, a 10-year-old can do it without being trained to do it.
Basically the first time. A 17-year-old can learn to drive a car in a astonishingly short time, maybe 10 or 20 hours of practice without causing accidents mostly.
And we have millions of hours of training data.
We still don't have self-driving cars except with cars with lots of, you know, extra sensors like lidars and complete mapping on the environment and, you know, all kinds of tricks.
So, you know, obviously we're missing something big.
And this is another example of what's been known as the, the paradox, which is that a lot of things that we consider intellectually challenging for humans, you know, playing chess, solving integrals, stuff like that turned out to be algorithmically relatively simple.
And the same is true for producing nice sounding text or answering a question as long as you've been trained to produce the correct answer.
Yet we don't have robots that are near nearly as dexterous as a primate or even a cat.
So this may be explained by the following very simple estimate.
So a typical large language model is trained with something like 30 trillion tokens.
This is the number I got for Lama three.
I think a token is like a sub board unit.
So that's something like two tenths of the 13 words.
Each token is three bytes.
So the total amount of data used to train a typical LLM is about 10 to 14 bytes.
It would take any of us 400,000 years, maybe half a million years to read through through that.
It's just an enormous amount of text.
Now compare this to what a human child has seen a four yearold has seen during his or her life.
Four year, four years of life is about 16,000 hours for, for a young child, which by the way, a small amount of video, it's about 30 minutes of YouTube uploads and information.
Getting to the visual cortex through the optic nerve is about one byte per second times 2 million.
We have 2 million optic nerve fibers, each of which carries about one byte per second.
So during wake hours, it's about two megabytes per second.
Multiply this by 16,000 hours, and it's about 10 to the 14 bytes.
Okay? So a 4-year-old has seen as much data as the biggest LMS trained on all the publicly available text on the internet.
Now, you might say that the visual data is much more redundant than the text on the internet, which is true, but in fact, that's exactly what you want to train a system to understand, to capture structure and dependency in training data using self supervised learning.
You need redundancy. If you don't have redundancy, you can't learn anything, you can't learn anything from completely random bit strings.
So it tells you a number of things.
The first thing it tells you is that we are never going to get to human level AI by just training on text.
It's not going to happen despite what you might hear for some of the more optimistic sounding CEOs of various AI companies in Silicon Valley, it's just not going to happen.
It also means that we need to make some serious progress if we want to have robots that, you know, can be useful.
There is countless companies that are being formed that are building humanoid robots, and you see all those videos of those robots doing impressive things.
But in fact, the, the secret of all that is that none of those company has any idea how to make those robots smart enough to be useful, except in very narrow tasks for which they, they have to be carefully trained.
So that's an issue.
So that gives an opportunity for, you know, researchers and scientists trying to kind of make progress in ai.
There's still a lot of work to do and it may not require, you know, hundreds of billions of investment in GPUs.
Okay, there's a second issue, which is inference.
So I mentioned that of the, I mentioned the limitations of inference by formal propagation through a fixed number of operations.
A lot of things that we're doing require much more sophisticated computation than this.
And in fact, I would submit that a more powerful way to perform inference is through optimization.
So instead of a system computing its output by this works by just, you know, propagating through a fixed number of layers in some sort of neural net and then producing an output.
The, the design, I think is much more, that's much more preferable, would be a system that, you know, extract information from the its input, produces a representation of the input if you want, but then has another big neural net or learning machines learning machine with a single scaler output, let's call it an energy that would measure the degree of compatibility
or incompatibility between the input and a proposed output.
Okay? So propose an output, and then this function, which may be a very big neural net compares like tell you to what extent this output is compatible with this input.
Okay, I put an image of an elephant, an elephant here, and then I put the label elephant.
And I want the output of this, the scaler output of this function to be, let's say zero.
If I put another label table, chair, cat, whatever, I want the output to be large, larger than zero.
Okay? So a measure of incompatibility if you want, between input and output.
So the way you perform inference with a system like this is through search.
You basically put an input and then you search for an output that minimizes the scaler output.
The output is not represented in implicit in those kind of square boxes.
You, you search for an output that minimizes the energy function.
And really this type of inference by optimization is very classical in AI or probabilistic inference or this kind of stuff, right?
There is a lot of really classic, you know, a lot of classic work in past planning.
You know, all kinds of planning actually, you know, shortest path between cities, sort of circuit between cities, sat, you know, finding values of bullion, variables that satisfy bullion formula, logical inference, all of those things can be reduced to optimization problems, but not necessarily to forward propagation to a fixed number of layers.
So this, this kind of inference by optimization allows for what some people call zero shot learning, which means producing answers or solution to problems without being trained to produce solutions to that problem.
Basically just coming up with a new solution to a problem.
Okay? This is what search and optimization can do.
This is also perhaps a good model for the type of inference that takes place in humans, that psychological system two.
So system one is, you know, decisions you're making or actions you're taking instinctively, basically without really having to think about it too much.
And system two is the, the type of actions that you, or decisions you make deliberately by kind of thinking about it and maybe using your mental model of the world to sort of predict the outcome of particular actions you might take.
Now, this is not what LMS are doing.
Lms take a a window of a sequence over a sequence of symbols, then run, run that through some big neural net and produce the next a guess as to what the next symbol is.
And then once you have produced the next symbol, you shift it into the input and then you produce a second symbol, shift that into the input, third symbol, et cetera.
That's called auto reverse prediction.
And it's very classical, it's been run with us for, you know, seven, seven decades or something, if not more.
Nothing new about that. But there is kind of a basic limitation with this type of, of thing, which is that there's a fixed amount of computation devoted to producing any single token.
So the only way you can entice a system of this type to spend more resources, more time on complicated question is to trick it into producing more tokens.
This is a trick called chain of thoughts, right?
You, you, you tell the system like, you know, tell me all the steps of your, of your reasoning, which may not be reasoning actually.
And as a consequence, the system will just spend more, re more computation is is going to produce more tokens.
So it's going to spend more computation, but it's kind of a hack.
There's another issue which is perhaps even more dire, which is that auto, auto aggressive generation.
It's kind of a divergent process.
You can never exactly predict what token or what word follows a particular text.
What those systems are trying to produce is basically a, a probability distribution of all possible tokens, of which there is typically about a hundred thousand possible tokens, right?
So it's a big vector with of numbers between zero and one.
That's up to what, and you might, you know, pick, always pick the, the token with the highest variability and just generate the sequence this way.
Or you might, you know, sample from this distribution.
Whatever you do, there is might be some probability that at any point the token that's generated takes you outside of the set of sequences of tokens that will be correct answers, right?
So the set of all possible se sequences of tokens is a tree, okay?
Represented by this blue disc, essentially, where each leaf is kind of the terminal symbol in a tree.
They don't have all the same length, but it's a tree.
Within this tree there is a SubT tree, which, which corresponds to all the correct answers corresponding to a particular point.
And there's, there may be some probability so that every, at every token you produce, the token takes you outside of the sub, the correct SubT tree.
Because it's a tree, there's no way to come back, right?
You're out, you're out. So
if you make a, the hypothesis, which of course is most likely wrong, that this, you know, probability is the same for regardless of where you are in the sequence and the errors are independent, then the priority that a sequence of n symbols would be correct decreases exponentially like one minus the error rate to the power n and its number of tokens.
Okay? This, this is one way at LMS hallucinate.
So you know, this is not really fixable without some major redesign of how those systems produce their answers.
We don't produce answers by just blurting one word after another.
We think about the answer we're going to produce, we have a, an abstract thought that represents this answer, and then we turn it into text.
Okay? But that's kind of a, a second step if you want.
There is an advantage though to lms, which is that they're very easy to train and the training scales very well.
So this is a representation of GPT style architecture where basically secretly a large language model is actually trained to reproduce its input on its output.
You give it a, a sequence of symbols and you train it to just reproduce the sequence of symbols on its output, but it cannot just learn the identity function because the connection is designed in such a way that to produce one particular symbol, this one, for example, this green one, it cannot look at the corresponding symbol on the input.
It has to only compute it or, or predict it from the symbols to the left of it.
So simplicity is trained to produce the next symbol in the sequence.
And, but it does this in parallel over very long sequences.
It can do this very efficiently.
So those GBT architecture scale, this is why people are using, using them instead of alternatives at the moment, but it's very limited, okay?
What we really want is perhaps emulate disability that humans and animals have to have a, a mental model of the world, a world model, okay?
What is a world model? World model is given a representation of the current state of the world, which you may have estimated using by observing past, you know, observing the world and then, you know, representing the its state, let's called it sx, given an action that
you imagine taking, can you predict a representation of the next state of the world that will result from taking this action?
And the way you can train a world model is very simple.
You, you give it, you know, a bunch of observations and then you, you run through the encoder, the predictor, you give it an action that you, that you know is taking place there, and then you feed it the, the, the next state of the world basically.
I mean, it's not state, it's an observation.
You run it through the same anchor as you run the previous state that produces a representation for the new state of the world.
And then you minimize the prediction error, the difference between the representation of the current state of the next state of the world, given the prediction obtained from the previous state of the world obtained from perception.
So one idea which is very natural, which a lot of people have been working on, including me for many years, is to use the same idea as LLMs, which is to train a generative model to predict what's going to happen next in a video, right?
So take a video to a full video, corrupt it by masking the second half of it, let's say.
Okay? And so this encoder sees only the first half of the video, it produces a representation of it and then runs this through some sort of decoder predictor decoder that given the action that you know is taking place in the video, produces the rest of the video at the pixel level, okay?
So it just predicts all the details, everything that is supposed to take place in the video.
And that's essentially an impossible task.
It's an impossible task because there are many things that are, that can plausibly happen in a video that may happen in a non-deterministic way that you cannot predict.
And so if you train a neural net to make a single prediction for what's going to happen next, the best thing you can do is predict some sort of average or aggregate of all the possible futures, right?
And in fact, that's exactly what happened.
So this is from a old paper, almost 10 years ago now, we trained some, you know, big neural net for the time with four frames and then train to predict the next two frames.
And the, the predictions are really blurry.
You see the same thing here. So those are kind of little like symbolized videos of cars being, being looked at from, from the top on the highway.
The central car here is fixed, and this is you not trying to predict what the cars around it are going to do.
And you get those blurry predictions because it doesn't know if the car is going to accelerate or break.
And so it predicts the average.
Now using various techniques with latent variables, you, you can feed your neural net with latent variables that you either sample from a distribution or you optimize in some way.
And you can correct this, this flaw to some extent, and at least for simple videos, like those highways produce videos that are crisp.
And depending on the value of the late end variable, we predict multiple different features, right?
So with a late end variable, you can sort of parameterize all the potential plausible features that will happen in the video.
Unfortunately, it doesn't really work for natural videos.
If I, if I take a, a video of this room and I pointed on this side and I kind of rotate slowly and I stop here and I asked the system and predict, you know, the rest of the video, it will predict we are in a room and the, you know, the chairs are red and there's probably a wall on the side.
There's no way you can predict the texture of the floor, the texture of the wall, and it, it cannot possibly predict what all of you look like and, and where you sit, right?
I mean, that information is just not predictable.
So, so either a system like that has to kind of make up some plausible instantiation of what may happen or predict a aggregate of everything that happens.
But basically the, the problem of predicting a, the pixel level, what goes on in a, in a natural signals, particularly video, is basically impossible.
So you say, okay, well we can do like an LM and LMS don't actually predict a single token.
They predict the distribution of the tokens. Okay?
So what that, what that means is, is that we need to parameterize a distribution over a high dimensional continuous space like the space of all possible video frames.
And that's just mathematically intractable.
The best way we can represent distributions is by, we can do, we can do it the same way physicists do it.
You write down an energy function and then you do ETO is energy function and normalize most of the time that normalization term, which is a big integral, which intractable at least for interesting distributions.
So here's a proposal.
The proposal is just don't predict at the pixel level, predict at the representation level, okay?
So we're going to build this architecture, which I call jpa, that means joint embedding predictive architecture.
And basically instead of predicting all the pixels, we're going to predict a representation of the pixels.
Okay? So we're going to run the video through an encoder and this partially mass video through an encoder, maybe the same encoder and simultaneously with the encoder train a predictor to minimize this prediction error.
But the prediction is going to take place in representation space.
In the representation space.
It's an abstract representation.
It may not contain all the details about the world that are just not predictable.
We might eliminate all the details that are not predictable, making the prediction tasks much simpler.
So that's the comparison between those two architectures.
This is generative architectures, predict all the details of the variables you want to predict, and this is the joint predictive architecture.
Find a representation within which you can make predictions and that representation will not contain all the details.
Now, if you think about this, this is how we apprehend the world.
We find representations that allow us to make predictions.
We don't represent the world in all of it, all of its details.
The entire purpose of science even is to find those representations so we can make predictions, representations that ignore the details.
So in fact, if I want to predict the trajectory of planets viewed from the earth, right?
They seem kind of complicated because sometimes they go forward, sometimes they'll, they go back, et cetera, right?
And there's some periodicity and you know, people in the intuity kind of figure out how, how to predict this.
But it was kind of complicated until the appropriate representation for the problem was figured out, which is that, you know, the earth rotates around the sun or the other planets rotate around the sun and they have elliptical orbits.
And, and then it all becomes simpler and you can predict everything.
And to predict the trajectory of any planet like Jupiter, where, where is Jupiter going to be a hundred years from now?
You don't need to know all detail, all the details about Jupiter.
As a matter of fact, you only only need to know six numbers, three positions and three velocities, and that's it.
So the question of finding appropriate abstract representations that eliminate all the details in such a way that they allow us to make predictions is really fundamental to science and to intelligence in general.
I, I would argue, in fact, to expand a little bit on, on this dimension, in principle, I could describe everything that is taking place in this room at the moment in terms of quantum field theory.
I would have to measure the wave function of, you know, all the quantum field in this, in this room, which of course is an impossible task.
And then I would have to have some, you know, super gigantic powerful quantum computer that would allow me to kind of make the, make the prediction, assuming there is not too much interaction with the rest of the universe, which of course is not the case.
So it would be an impossible task. So what do we do?
We invent abstractions, okay?
We have particles, we have atoms on top of that, molecules on top of this in the living world we have proteins, organelles cells organisms, individuals societies ecosystems right?
So we, we have this whole hierarchy of representations and at each level, each level allows us to make kind of bigger, bolder, longer term predictions, while eliminating a lot of details about the level below.
In physics, there is actually kind of a two systematic ways of doing this.
One of them is called reorganization.
Reorganization, or you know, with reorganization group theory is a way of basically representing the state of a group of sites or particles or spins or whatever you want in sort of a abstract way if you want.
So as to kind of not have to deal with like the details of, of the actual state.
And similarly, in physics, there's a notion of entropy, right?
I can make predictions about the property of a box full of gas, pv equals NRT, right?
If I, if I compress the gas, the temperature is going to go up, you know, things like that.
But I've ignored the position of velocities of each of the individual molecules in the gas.
And we call this entropy, right?
We even have a name for the information we let we leave behind when we go one level up in the hierarchy.
What's interesting about this hierarchy is that every level in the hierarchy is a different field of science.
So perhaps a field of science is actually defined for natural science at least is defined by the abstraction level that we choose to make predictions.
Okay? So metro philosophy, okay?
So if we have, if we're able to train a system to, to have a, a mentor one model of the, of the world, right?
Allows you to predict what's going to happen perhaps as a, as a consequence of its action.
How can we use this as the basis of an intelligent system?
So I wrote this sort of vision paper three years ago that I put online for comments about where I see AI research will go over the next 10 years.
This was before the edit craze, but I haven't changed my mind about this.
And here's an example of how this could be implemented.
So, so this is a intelligent AI agent.
It's observing the world through a perception system that gives it an idea of the current state of the world that it can currently perceive.
Of course there is a lot that the agent probably knows about the world that is not currently perceivable.
Like, you know, we, you know, we know the state of our house to some extent and things like this.
Like we have, you know, a complete idea of the, the, the state of the world.
So, which is totally in our memory, we don't current currently perceive it.
So we might want to combine the, what we perceive about the world with the content of a memory feed this to our world model.
And a world model it's going to give, it is going to take an imagined sequence of actions that we imagine taking and is going to predict the resulting state of the world or sequence of states that the world is going to go through as a consequence of the actions that we imagine taking.
Now what we can do is feed this predicted state to a task objective that measures to what extent.
So that's an energy function that measures to what extent a particular task has been accomplished.
A goal has been reached. So this guy will produce a scaler.
I put zero if the task has been accomplished and a positive larger number if it's, if it's not, and you know, potentially indicates some distance to the, to the objective, but we might have other cost functions, other objectives that are guardrails, which would, you know, prevent the system from sort of taking actions that would not be safe, right?
So if I have a domestic robot and I ask it to, you know, get me coffee, it goes to the coffee machine and there is someone standing in front of the coffee machine, I don't want the robot to just, you know, slash that person to pieces to get access to the coffee machine.
So, you know, obviously we need to kind of hardwire some guard rail objectives into, into that robot and that robot would not be able to escape those guardrails because the way it operates, the way it produces an output is that it searches for an action sequence, which according to its internal role model,
would actually satisfy those objectives, the task objective and the guard rails.
And it can't escape that This is by construction, okay?
So if you put a guard rail in it, it has no choice but to satisfy it.
And so this is an example of this influenced by optimization I was telling you about before.
Really this, what it says is planning is an example of classical planning as it is used in robotics.
Now, if we have a one model that can make predictions to a certain horizon, we can probably apply it multiple times in a auto aggressive fashion and feed it with a sequence of actions every time.
And so perhaps this one model is just a mechanical model of a robot arm or something.
And so it's very simple, very simple thing and we can, it's a differential equation for example, that we can apply multiple times.
And in fact, this is a classical way in optimal control of planning a sequence of actions.
You have a model of the system you're controlling generally a set of handwritten equations, but in our case we're going to learn it.
And then the cost function that characterize whether whether a task is, has been accomplished.
And then you plan by optimization, a sequence of controls or actions that will minimize this cost subject to maybe some constraints.
Comput, classical and optical control is called NPC model predictive control.
But what's complicated about this here is that this world model may be really complicated, maybe some big neural net that is trained from lots of data.
The input may be video, the actions may be co complicated.
There may be some discreet and non-continuous behavior in the, the, the function, the cost function in the space of those actions may be extremely irregular and maybe non-continuous.
Maybe even if all those modules are mostly differentiable, they could be kind of non-continuous thing.
So for example, if I, if I want to go from where I, where I am standing now to the other side of the, of the desk here, I can choose to go to this side and that side.
And that's a discreet choice, which will result in two completely different costs for migrating to the other side.
Okay? It's going to be more costly if I go this side than if I go from that, that side.
Yet the difference in action that I can take to choose between those two is, is very small.
I can change my initial action from, you know, taking a step in this direction to keep taking a step in that direction.
That could be a very small change, yet it's going to result in a discontinuous change in cost.
So those functions are going to be very complicated.
This is going to pose, since we're supposed to talk about geometry, you know, this, this is going to pose some like major issues in optimization here that are not really solved that a lot of optimization people have been thinking of for many decades actually in the context of optimal control.
But, but here it's even more complicated given the fact that those world models might end up being very large neural nets.
The world is not entirely predictable.
So the, the way you handle non-determinism is, is through latent variables.
Neural nets are deterministic functions, but you can feed them with latent variables that are sampled from a distribution or maybe inferred another way, which basically parameterize the set of plausible predictions.
And then the planning problem becomes even more complicated.
Now because you don't know the value of latent and you have to plan in the context of uncertainty, essentially.
Ultimately what we want to do is build a model like this.
And I should tell you right now, nobody has done this, okay?
But build a model that is hierarchical in the way that I was describing earlier in such a way that we can use it to, to do hierarchical planning.
What does that mean? If I'm sitting in my office at NYU and I decided I want to be in Paris tomorrow, I cannot possibly plan my entire trajectory from New York to Paris in terms of elementary actions I can take, which in the case of humans are millisecond by millisecond muscle controls.
I have to plan on a much higher level, right?
Which would be okay, I mean New York, the best way to go to Paris is to go to the airport and catch a plane that requires a, a mentor, one model of, you know, what does it mean to go to the airport and to catch a plane and what, you know, what a plane can do and things like that.
But it's a very abstract model at a very abstract level where the actions are very high level things like, you know, things like taking a taxi or something to go to the airport or things like, you know, getting a t airplane ticket or something like that and, and, and jumping on a plane.
But I have a sub sub goal now, which is getting to the airport and maybe my sub goal, my my cost function, my new cost function is not my distance to Paris anymore, but it's my distance to the airport.
Okay? So now it's a shorter objectives objective.
I want to go to the airport, I mean New York.
So I can just go down on the street and hear the taxi.
Now my sub goal is going down on the street.
How do I go on the street? I'm sitting in my office, I need to go to the elevator, push the button, get into the elevator, walk out the building.
How do I go to the elevator?
I need to stand up for my chair, pick up my bag, open the door, shut my door, avoid all the obstacles, you know, say bye bye to my students, blah, blah, blah.
There's a point in this hierarchy where I have all the information I need and I may not actually need to plan formally, I can just take the action 'cause I'm kind of used to doing it.
So I can revert to system one, which is sort of reactive actions okay?
So this hierarchical planning requires hierarchical role models that work at different timescales and different levels of abstraction.
This is kind of a level of abstraction of, you know, going to the streets and catching a taxi.
And this is a high level of abstraction of going to the airport and catching a plane.
How do we train a system like this to learn the appropriate level of abstractions?
And then once we have it, how do we use it to plan hierarchically?
The way I just described this is completely unsolved.
If you are, I don't know, studying A POG and ai, this is a good problem to start thinking of.
'cause it's completely unsolved, it's wide open.
So put this whole thing together and you arrive at what some have called cognitive architectures, which is kind of a way to put all of those modules together.
Perception, memory, which is kind of like the hippocampus in the mammalian brain.
The world model, which is probably in the prefrontal cortex in humans, all kinds of cost functions, some of which are really intrinsic costs that were kind of hardwired into us by evolution, but many of them are costs that we define ourselves.
Basically some goals and things like this.
And then a way to sort of search for action sequences, which according to a world model, we produce the outcome we want.
Okay? So we have kind of an overall architecture for the AI system.
How are we going to train those world models from observation using self supervised learning?
So the idea of those joint ing architecture goes back a long time, the early nineties, in fact, it was a, a type of model we used to call Siamese networks and they've kind of evolved over the, the last few years to some extent.
And basically we have this sort of architecture with, you know, two encoders, which may or may not be the same, a predictor which may be conditioned by an action which may depend on latent variables to account for the non-determinism of the, of the world.
And then some prediction, error, cost, function, and maybe some other cost functions that drive the system to learn appropriate representations.
Okay? The way to conceptualize the, the, the way we want to train the system of this type is really what a system of this type kind of basically produces a scaler output, which as I said before, can be interpreted as an energy that measures the incompatibility between the input and the output between X and Y.
And what we need is a way to train or, or system in such a way that it produces low energy for training samples that we observe pairs of X and y that we observe and higher energies for pairs that we do not observe.
And that's where it becomes complicated.
So let's imagine that we have two scaler variables here, X and y and we have training samples that are those, those black dots.
And the, what I want is my learning machine to learn an energy function that takes high values, that takes low value on the, you know, near the training samples and higher values outside.
So basically, you know, some sort of landscape, it could be high dimensional because that depends on the, you know, on the dimension of X and Y or at least the representations of x and Y.
So how do I do this?
How do I train a parametrized function that produces a scaler output to gimme low output for things I trained it on, but higher output for things I don't train it on?
There's two methods and, and basically a big issue there is to prevent collapse.
So if I just train a system like this to just minimize the prediction error, I just show it pairs of X and y just minimize the prediction error, it will collapse, basically it will ignore X and y, it will produce SX and XY that are constant and then the prediction problem becomes trivial.
And so the prediction error is zero, it's going to be zero for everything.
Okay? Not a good way to capture the dependency between X and Y.
So I need to have a way of making sure that their energy is large for things that the system is not trained on.
And the advantage of representing a a, a a dependency between variables as an implicit function of this type is that I, I can represent dependencies between X and Y that are not functions, okay?
There's no function that maps X to Y here because there can be multiple Ys for a given X.
And so using an energy function is basically an implicit function that represents dependency between the two.
Okay? So this energy function can collapse if I, if I merely train it to minimize the energy of those training samples, which are those blue beads.
Let's say this is X and this is YI might end up with a energy surface that is completely flat.
So the way to prevent this from happening is, there's two methods that I know about.
One is contrasting methods.
So you generate those green points which are outside the manyfold of data if you want, and you push the energy up.
So change the parameters of your neural net so that the energy goes up that, I mean is high for those, those green dots.
And the, the big question is how you generate those green dots.
And then there's another big question, which is that if the dimension of the space within which you do this is high and the learning machine is fairly flexible, then the number of those contrasty points you're going to have to generate is going to go exponentially with the dimension.
And that's not a good idea. It doesn't scale very well.
I used to be a big fan of those methods contributed to inventing them, but I became very pessimistic about them.
What I prefer is regularized methods.
So those are methods that basically have a, a term regularizing term that tries to minimize the volume of space that can take low energy so that when you push down the energy of certain parts of the space, the training samples, the rest has to go up because there is only a, a limited amount of low energy volume to go around.
So those are the two, those are the two methods, the two categories methods, and I, I became kind of more of a fan of the, the second category, I'll come to this in a second.
So you can sometimes turn energy based models into poly models by using a Gibbs distribution, take exponential minus the energy and normalize, and you get a properly normalized conditional distribution of a given X.
The problem is that most of the time for any reasonable f the bottom is intractable.
Intractable. And so you don't need to deal with this, just deal with the energy directly.
I made a list of various methods that people have proposed over the decades as to they can be interpreted in this framework in terms of whether they're contrastive or, or regularized.
I'm not going to go through the list, but it's interesting to go through that exercise.
So how are we going to use this self supervised learning, perhaps let's say contrastive to start with, to train the system, for example, to represent images so that we can do image recognition, for example.
So the, the process is that we're going to train this joint embedding architecture in some way, and then once it's trained, we chop off the predictor.
We just use the encoder, the encoder to produce a representation.
And then we train a very simple classifier on top of it using a supervised learning to do, for example, image recognition or a depth estimation or something like this.
Contrastive methods are very simple eist in showing pairs of images that are basically d different version of the same, the same content using distortion or or corruption of some kind.
And then training the system to produce a representation of the original image from, from the distorted or corrupted one.
And then you have to have contrastive samples, which are pairs of images that you know are different.
And then you push the predicted representation and the actual representation away from each other.
Okay? So you have some loss function that is going to pull those two guys together is going to push those together away.
And this kind of works, but it never produces representations that fill spaces that are more than about 200 dimensions when you trend them on things like ImageNet.
There's another type of method called distillation, and those have been considerably more successful.
Those are sort of like regularized methods also, although the main issue is that we don't really understand why they work, although there is some theoretical work.
So basically you take a, you take an input, you transform or corrupt it, you get a different version of it.
You run this through an encoder producer representation, run this through an encoder with the same architecture, but slightly different weights, then run through a predictor and then minimize his prediction error.
But you don't back propagate gradient through this encoder because you're not going to trend this.
The weights of this encoder through gradient descent, the weight of this encoder are going to be essentially the weights of that encoder, except you're going to accept those, those weights.
This weight vector is going to be a running average of the weight vectors of that encoder over time.
Okay? Take the past several values of the weight vector and average them and that gives you the weight of this encoder.
Basically this encoder, the weight of this encoder cannot move as quickly as the weight of that ankle.
This guy gets gradient back propagated and updates its weight and then it basically updates the weight of this guy, which kind of is a pass filtered version of the previous weight.
Somehow this works, somehow this doesn't collapse. Why?
I don't know, it's kind of mysterious.
The idea came from intuitions from reinforcement planning for some, some reason, but there is a bunch of methods.
This one came from DeepMind, those four came from, from, from my, my colleagues at at fair.
And, and there is some theoretical work also from some of my colleagues at fair and that at Stanford that attempt to explain why this does not collapse all, all every time.
And if you make the hypothesis that the encoder and the predictor are linear, then you can show that there are fixed points of the gradient and dynamics that are not collapsed.
But that's the best explanation we have for, for why this works.
But it's not entirely satisfactory.
Also, it's very weird because there is no function that you can monitor that goes down as you train because you're not actually minimizing anything, even though you're doing great in this end.
It's very strange, but it works really well.
And so there's a, a technique, a particular instance of this technique called Dino.
It's made by French people at fair Paris.
So they pronounce it Dino on this side of pond de Dino.
But, and it works really well.
It produces really good results when you train it on, you know, distorted versions of ImageNet and whatever.
And when you scale it up, you have a very large network, a lot of training data.
This, this paper is a few months, a few months old.
You can show that the performance of those self supervised running systems surpasses or at least matches the performance of purely supervised systems, but with considerably less data.
Okay? So this is the first time it is, it is only a few months old where, where it's very clear that four image understanding self supervised running now surpasses the the best supervised running methods.
So if you have money to spend, you're better off kind of spending it on scientists to kind of fine tune self supervised running methods and collect unsupervised unlabeled data rather than spending it on people to label your data.
And it wasn't clear until, you know, March or April, this Dino model is really kind of ama amazing.
It can produce generic representations of images that can be used for all kinds of applications, not just, you know, image like object recognition, but all kinds of stuff in medical imaging, in biological image analysis, in astrophysics, in all kinds of, all kinds of domain remote sensing, all kinds of stuff.
And basically is, you know, produce state of the art performance when you train ahead on top of the representation for a wide variety of visual tasks that are either very cal semantic or all level.
Okay? But can we use those representations to train a world model so that we can do planning as I was explaining earlier.
And the answer is yes. So this is work by led by Lerrel Pinto, who is a colleague of mine at NYU Roboticist and myself with two of our students, Gaoyue Zhou and Hengkai Pan.
And what did, what we did here is take the, the, the Dino encoder, so feed images to the Dino encoder and then train a predictor on top of it, which is action condition.
So that you have a view of the world and an action that you, the robot is taking.
Can you predict a representation, the Dino representation of the, the next view of the world that results from taking this action?
And then can you use it for planning a trajectory so as to arrive at a, at a goal and fulfill the task and the answer is you can do this in certain cases and the performance is better than sort of previous systems that people have worked on.
Dream V three is a system from deep mind and, and this is, you know, model PT control essentially.
So start with an initial state, run this with a Dino encoder, then run your world model with a hypothesized sequence of actions.
Measure the distance with a encoded target image and then through optimization, figure out the sequence of actions that will minimize this distance and then, you know, take the first few actions in the actual environment and then perhaps replan.
And this works really well, let me skip ahead a little bit and show you a, a, a demo of that system for a particular task here.
So the predictor has been trained sort of relatively generically, but these are target things.
This is the initial state and these are the actions that are planned by the system so as to move those blue chips as close as possible to those configurations as possible in something like 25 actions.
Okay? It's limited to 25 action.
And this, this works pretty well.
The, the dynamics of the environment here is, is is pretty complicated because those blue chips kind of interact with each other and everything and, and the the same technique kind of, you know, works pretty well for, for a wide variety of different environments.
Let's see.
Okay, we're going to have to watch this video again because it doesn't want to switch to the next slide.
So this is far from perfect in various respects, but it's kind of a good example of kind of learning a task zero shot.
You don't need to train the system to accomplish a task.
It has a good role model.
It will accomplish the task by planning.
No, no need for training, for reinforcement learning for anything, for learning a policy or anything.
Planning purely a similar project done by LED by AMIA Barr, who is until recently a postdoc with me at at fair who is now research scientist at fair.
There's some demos you can, you can look at here.
And this is for navigation.
So he took videos from mobile robots where you get a view from the robot and the robot moves, it translates and rotates and you know, the transformation because you have a dormitory from the, from the wheels, you get a different view.
Can you predict the next view of the world at the representation level from the previous view and the displacement, the transformation matrix basically.
And if you, if you can do that, can you use it to plan?
So, you know, can you tell a robot here, like go to the blue trash can.
It actually is, it's very far in the back, but it sees the blue trash can and it can sort of, you know, plan a, a sequence of actions to go to the blue trash can.
And this works pretty well.
This paper actually won the, this paper award mention at the last CVPR conference.
It's pretty cool. I mean, I, I think there is, you know, a lot of situations now that we're going to be able to handle with those role models with slightly more generic ways of training them, perhaps like the ones I'm going to tell you about just now.
So this is work, this is I jpa v jpa and V JPA two video JPA two, which is more recent where it's again one of those distillation type model where you, you have two encoders and they share the weight with this exponential moving average trick.
And you, you train the system to predict a representation of a full image from a representation of a partial mass image using an encoder.
And what we, we show with this experiment is that this system, which is not trained by reconstruction, purely joint embedding works, you know, trains really quickly and produces really good performance much better than an alternative project done by our colleagues at fair.
This is MAE master to encoder and this one is trained by reconstruction to predict pixels, right?
So you take an, you take a an image, you corrupt it by removing some patches and you train a gigantic system to reconstruct the full image.
This basically was not a big success.
You, the representations you learn from this are not that great and it takes a long time.
Also, more recently, there's a version of this too that works on video.
So you take a video, you corrupt it by masking a whole bunch of areas within the, the video from, from the full video.
And then you train again, the system to predict the representation of full video from the representation of the partially masked one to a predictor.
Perhaps the, the variable that is fed here is the location of the places that are masked.
And with you get at the end is that you get a good representation of videos that you can use for classifying actions for things like that and, and et cetera.
But what's interesting about this is that it can learn some level of common sense where it's able to do, if you show it a video where something impossible occurs, like this ball is be, you know, is, is thrown in the air and just all of a sudden disappears, you apply this video JPA system with the sliding window over this video,
the prediction error will shoot to the roof when this occurs because it knows it's impossible.
And so that's interesting because that's kind of the first models that we have that have learned a little bit of common sense if you want, or intuitive physics completely unsupervised.
So we have a long paper that describes a whole bunch of experiments about this, which I don't have time to go through.
And a new version of this called VJPA version two more recent where you can, you can see some examples there.
And there there is two phases.
One where we just train on video, the other one where we train a predictor, which is action conditions that we can use to plan action sequences for robots.
And let me show you a short video of, of that.
So this is an unfamiliar environment the system has not been trained on and you know, it doesn't know a priority like the, there's no calibration of the camera or, or whatever.
And it's pretty robust to the particular anatomy of the robot and the, and, and the position of the camera.
And it basically plans the sequence of actions so as to reach a particular goal, which in this case is moving this cup, you know, down on the table.
So let me skip ahead a little bit, not bore you with tables of results of VGIV two.
One technique that we are working on now, and we have some, some, some results about this in the small cases is really how, how to prevent those systems from, from collapsing using regularized method.
And one trick is to basically have an estimate of the content, the quantity of information coming out of the encoder.
If you can maximize the information that comes out of the encoder, you'll prevent the system from collapsing.
And imagine that you pass a bunch of samples through the encoder.
So each row in this matrix is a different sample and each column is a different variable of the representation coming out of the ENC coda.
You have two ways to kind of maximize the information coming out, you know, contained in this matrix.
One is you can make sure that all the rows of this matrix are different.
So basically every sample has a different representation.
They don't all collapse to having the same representation.
And so this correspond to contrast methods or sample contrast methods.
And then the alternative is to make sure that all the columns of this matrix are different.
In other words, every variable in the representation carries a different information.
Okay, one way to do this in the first case is to compute the gram matrix of this matrix, basically the product of this matrix based transpose and make sure that gram matrix is close to the identity so that all the samples are different or orthogonal.
And this one is the converse.
You compute the transpose of this matrix times itself, which is the conveyance matrix, and try to make that conveyance matrix close to identity.
So this is a way of basically, you know, kind of maximizing information content in a representation, but it's approximate because we like to maximize information content and we don't have any know bound on information content.
We only have upper bounds for very deep, deep reasons.
The fact that we can't kind of model all the, all the possible dependencies between variables.
All the estimate of information content that we have are upper bound, are, are overestimations.
And so it's a bit of a, it's a bit of an issue there, but this technique that works by getting the ance matrix with identity works really well.
Let me skip ahead to the last slide essentially.
Okay, so I have a bunch of recommendations here.
Essentially, abandoned generative models in favor of those joint embedding predictive architectures that don't predict in the input space with predicting representation.
Space predicting input space works only if you have discreet symbols, but in the real world, physical world, you have to learn representations.
Use the energy based framework to really understand how this works.
Probably stick modeling basically leads to intractability and, and is unnecessary abandoned methods in favor of those regularized methods I was telling you about.
And I wouldn't say abandoned reinforcement learning, but at least minimize the use of reinforcement learning.
'cause reinforcement learning is extremely inefficient, requires many trials, and so it, you have to use it as a last resort.
So when I say all of this, these are all the pillars that are the most popular concepts in machine learning at the moment.
Doesn't make me very popular, particularly the first, the first one basically I have to walk around with, with bodyguards in Silicon Valley.
I'm joking. So basically if you are interested in sort of getting AI to the next level to a human level, AI possibly, or maybe cat level, don't work on LLMs, work on Gepa.
Thank you very much.
Loading video analysis...