Yann LeCun "Mathematical Obstacles on the Way to Human-Level AI"
By Joint Mathematics Meetings
Summary
## Key takeaways - **LLMs are doomed, not the path to human-level AI**: Autoregressive Large Language Models (LLMs) are fundamentally flawed due to their error-prone, exponentially divergent prediction process. This inherent limitation means they are not a viable path towards achieving human-level artificial intelligence. [11:56] - **AI struggles with the physical world, unlike cats**: Current AI systems, despite passing exams and solving complex math problems, fail to grasp the physical world as effectively as a cat. This is a significant obstacle, as understanding intuitive physics, causality, and object permanence is crucial for advanced AI. [12:48], [12:52] - **Sensory data dwarfs text for true AI understanding**: The sheer volume of information available through human senses (vision, touch, audition) vastly exceeds that found in all human text. To achieve human-level AI, systems must learn from observing the world, not just from text data. [16:11], [16:39] - **Inference by optimization is key for advanced AI**: Instead of fixed-layer computations in LLMs, AI should adopt inference by optimization. This involves searching for solutions that minimize a cost function, similar to model predictive control, enabling more complex reasoning and planning. [20:29], [20:34] - **JEPA architectures outperform generative models for AI**: Joint Embedding Predictive Architectures (JEPA) are superior to generative models for AI development. JEPA focuses on learning abstract representations by predicting in a representation space, filtering out unpredictable details, unlike generative models which often produce blurry predictions. [34:01], [34:47] - **AI needs world models for planning and common sense**: To achieve advanced machine intelligence (AMI), systems require world models learned from sensory input. These models enable intuitive physics, common sense reasoning, and hierarchical planning, allowing AI to predict consequences and act effectively. [18:42], [26:30]
Topics Covered
- AI can pass the bar but can't empty a dishwasher.
- A child sees more data than all internet text.
- The future of AI is objective-driven, not generative.
- Why we must abandon generative AI.
- Autoregressive LLMs are doomed. Here is why.
Full Transcript
good
afternoon welcome everybody i'm Bry Cra
and I'm president of the AMS and it's my
pleasure to welcome all of you to this
AMS Josiah Willard Gibbs lecture these
lectures were established in
1923 to show the public some some of the
aspects of mathematics and its
application they are named in honor of
Gibbs mathemat mathematical physicist
who made deep theoretical contributions
in physics chemistry and mathematics
something all of us do right yeah gibbs
is one of the founders of statistical
mechanics he coined the term and of
vector calculus his influence was so
profound that he had even been featured
on a US stamp in 2005
the list of Gibbs previous Gibbs
lectures lecturers is a veritable who's
who in broad mathematics including GH
Hardy John Fonoyman and Albert
Einstein so today it's my pleasure to
add another one to this list and
introduce Yan Lun for the Josiah Gibbs
lecture jan is the Jacob T schwarz
professor of computer science data
science neuroscience and electrical and
computer engineering competing with
Gibbs in terms of fields at the Corrant
Institute of Mathematical Sciences at
New York and he is the chief AI
scientist at Meta he is well known for
his work in numerous areas especially in
computer vision and deep learning and
he's one of the developers of things you
may be using all the time the DJ VU
format that probably somewhere on your
computer he's won numerous awards i
won't list them all because then there
would be no time for his lecture but I
will just note that he won the touring
award recently in 2018 and he's a member
of numerousmies including the National
Academy of Sciences the National Academy
of Engineering and the French Academy of
Sciences without further ado let me
introduce Yan
[Applause]
All right now that Bryer has listed all
the prominent speakers uh of the Gibbs
lecture I'm uh intimidated
um and I I don't believe I'll I'll kind
of fill the the shoes of uh of those
names but um but let me talk about AI
obviously everybody is talking about AI
and particularly about obstacles towards
human level AI so a lot of people in the
AI research and development community
are perceiving the idea that perhaps we
have a shot within the next decade or so
of building machines that have a
blueprint that might um eventually reach
kind of human level intelligence uh the
estimates about how long it's going to
take vary by huge amounts u the most
optimistic people say we've we're
already there some people who are
raising lots of funds are claiming it's
going to happen next year but I don't
believe so myself um but I think we have
a good shot and and so I'm going to tell
you where I think uh research in AI
should go and what are the obstacles and
some of them are really mathematical
obstacles um okay
so why would we need to build AI systems
with human level intelligence and it's
because uh you know in the near future
we're all going to be walking around
with AI assistants um helping us in our
daily lives that that we're going to be
able to interact with uh through various
smart devices including smart glasses
and things like that through voice and
through various other ways of uh
interacting with them um so we have
smart glasses with cameras and displays
in them etc currently you can have smart
glasses without displays but soon um the
displays will will exist right now they
exist they're just too expensive to be
commercialized uh this is the Orion uh
demonstration built by our colleagues at
at Meta um so the the future is is
coming and the vision is that you know
all of us will be basically walking
around with AI assistants uh all our
lives it's like you know all of us will
be kind of like a uh you know high level
uh CEO or politician or something
running around with a staff of smart
virtual people working for us that's
kind of the a possible
picture um but the problem is we don't
know how to build this yet and and
really
um the current state of machine learning
is that it
sucks i mean in terms of learning
abilities compared to humans and animals
um it it really it's really very
inefficient in terms of the number of
samples or trials that machines have to
go through before they can reach a
particular level of performance um so in
the past the dominant paradigm of
machine learning was supervised learning
right so supervised learning is you give
an input to the system you wait for it
to produce the output then you tell it
what output you wanted and if the output
you wanted is different from the one
that the system produces the system
adjusts its internal parameters to get
the the output closer to the one you
want okay it's just learning an input
output uh uh function uh reinforcement
learning you don't tell the system what
the correct answer is to just tell it
whether it's good or bad whether the
answer it produced was good or bad and
the main issue with this is that it
requires the system to basically produce
multiple outputs and ask you is this
good is this bad is this better um and
that's even less efficient um so it only
works for games basically or for things
that you can simulate really quickly on
a computer um so one thing that has
revolutionized AI in the last few years
is called self-supervised learning and
it works really great it's really really
revolutionized
um AI um but it's still very limited so
self-supervised learning is the basis of
large language models and chat bots and
things like this and I'm going to tell
you in a minute how it
works um but really animals and humans
can learn new task extremely quickly and
they can understand how the world works
um they can reason and plan they have
common sense um and and the behavior is
really driven by objective it's not just
kind of predicting the next word in a
text okay so how how does those um those
chatbot and LLMs work and I only have
two slides on it and then I'm not going
to talk about it at all okay so auto
reggressive uh large language models
they're trained to predict the next word
in a sequence or the next symbol in a
sequence of symbols okay they can be
words they can be DNA u
uh music protein whatever discrete
symbols okay so you take a sequence of
symbols you feed it to a large neural
net and the architecture of the neural
net is such that the system can uh is is
trained to basically just reproduce its
input on its output this is called an
autoenccoder okay so you just take the
input and then tell the system I just
want you to reproduce your input on your
output but the architecture of the
system is such that to produce one
particular variable the system can only
look at the variables that are to the
left of it in the sequence it cannot
look at the variable that it needs to
predict okay so basically what you're
training it to do by doing this you're
training it to predict the next symbol
in a sequence okay but you do this in
parallel over a large sequence
um so you measure some sort of diver
divergence between the input sequence
you feed it and the output sequence it
produces and you minimize that uh
divergence measure through
gradient basically gradient based
optimization with respect to all the
parameters inside of the uh predictor
function which is some gigantic neural
net which may have tens or hundreds of
billions of parameters okay this is
really high dimension okay once you've
trained that system uh when you take a
sequence and you run it through it the
system is going to predict the next
symbol um okay so let's say the the
window over which it looks at at symbols
here is three in reality in an LLM it
can be several hundred thousand but
let's say three uh so you feed three
words to that system and it produces the
next uh the next word now of course it
cannot predict exactly the next word so
what it produces is a probability
distribution over all the possible words
in a
dictionary and typically in LLM we're
not we don't actually train it to
produce word we train it to produce
tokens which are like subword units and
a typical number of possible tokens
would be on the order of
100,000 okay so now when you use the
system you feed it a sequence of words
called a prompt you have the system
predict the next word and then you take
that and you shift it into the input so
now you can ask the system what is the
second next word have you produce it
shift it into the input produce a third
word shift it into the input so that's
basically auto reggressive uh prediction
a very old concept obviously in signal
processing and stat statistics um and um
and it works really well it works
amazingly well if you make those uh
neural nets really large you train them
with very large input windows on tens of
trillions of tokens data sets that are
have tens of trillions of tokens
sequences with tens of trillions of
tokens it works amazingly well those
systems seem to discover a lot of
underlying structure about language or
about whatever sequence of symbols
you're training on but there's a a major
issue with auto reggressive prediction
and uh this you know mathematicians in
the room here would probably do a much
better job than me at kind of writing
proofs about this but um autography
prediction is kind of a divergent
process right if you imagine that you
have those those symbols are discrete so
every time you produce a symbol you have
multiple choices maybe 100 thousand
choices and you can think of the
sequence of all possible tokens
as some gigantic tree with a branching
factor of 100,000 okay within this
gigantic tree there's a small sub tree
that correspond to all answers that
could be qualified as correct okay all
continuations that you would think are
qualified as correct so if the prompt is
a question you know the answer would be
the the produced text would contain the
answer
now that sub tree is a tiny subset of
the the gigantic tree of possible
sequence of symbols and the problem is
if you assume which of course is false
that uh there is some sort of
probability of error every time you
produce a symbol and you assume those
errors are
independent uh and that probability is E
then the probability that the sequence
of n symbols would be correct is 1 minus
E to the^
N even if E is really small this has got
to diverge exponentially and it's not
fixable within the context of auto
reggressive prediction so my prediction
is that autogressive LLMs are
doomed a few years from now nobody in
their right mind will use them okay and
that that's why you know they you've
heard about LLM elucidation and things
like that you know sometimes they
produce nonsense uh and it's essentially
because of this auto reggressive
prediction so the question is what
should we replace this by and and you
know are there other types of limitation
now so I think we're missing something
really big in terms of like a new
concept of how to build AI systems we're
never going to get to human level AI by
just training large language models on
bigger data sets it's just not going to
happen and I'll give you another reason
why in a minute but never mind humans
never by never mind you know trying to
reproduce mathematicians or scientists
um we can't even reproduce what a cat
can do a cat has an amazing
understanding of the physical world and
I always say cat it could be a rat um
and we have no idea how to get an AI
system to work as well as as a cat in in
terms of understanding the physical
physical
world um house cats can plan really
complex actions they have causal models
of the world they know what the
consequences of their actions will be um
and humans are amazing a 10-year-old can
clear up the dinner table and fill up
the dishwasher without actually learning
the task you ask the 10-year-old to do
it the 10-year-old will do it the first
the first time it's called zero shot
learning um because the 10-year-old has
good mental model of the world and sort
of knows you know how objects behave
when you manipulate them and how they're
supposed to behave a 17-year-old can
learn to drive a car in 20 hours of
practice
and autonomous driving companies have
hundreds of thousands of training uh
training data of people driving cars
around we still don't have self-driving
cars at least not level five
self-driving cars unless we cheat way
more which is fine um so we have AI
systems they can pass the the bar exam
they can do math problems they can prove
theorems but where is my level five
self-driving car where is my domestic
robot we still can't build systems that
deal with the real world uh the physical
world turns out is much more complicated
than language
and that's called the Marave paradox
right there is this uh idea that um
tasks that are complicated for humans
like computing an integral um solving a
differential equation um you know
playing chess or go planning a path uh
through a bunch of cities those are kind
of hard tasks for humans it turns out
computers are much better than us at
this um like they're so much better than
us at playing chess and go that really
humanity sucks um and what what that
tells you is that when people refer to
human intelligence as general
intelligence that's complete nonsense we
do not have general intelligence at all
we're extremely
specialized um okay so we're not going
to get to human level AI by just
training on text um and and there is a
interesting calculation you can make so
a typical LLM a modern one is trained on
something like two 10^ the 13
tokens 20
trillion um and each token is about
three bytes so that would be six 10^ the
13 bytes let's round this up to 10 the
14 bytes
okay that would take a few hundred,000
years for any of us to read through this
that basically constitutes the entirety
of all the text available on the
internet
publicly uh so I mean that seems like an
incredible amount of training data but
now take a human child a 4-year-old has
been awake a total of 16,000
hours uh which by the way corresponds to
30 minutes of YouTube uploads not not
that much data we have two million optic
nerve fibers um 1 million per eye going
to the visual cortex each optic nerve
fibers carries about one bite per second
roughly maybe a little less but who
cares um so do the calculation and
that's about 10^ the 14 bytes in four
years there's just enormously more
information in the physical world in
sensory information that we get from
vision and touch and
audition than there is in all the texts
ever produced by all humans
so again we're never going to get to
human level AI unless we can get system
to learn how the world works by
observing the world there's just way
more information there than there is in
text and uh psychologists have studied
this you know babies kind of learning
various things about about the real
world mostly by observation in the first
few months because babies really can't
uh act in the world you know beyond
their own limbs in the first three or
four months and so they learn a huge
amount of background knowledge about the
world mostly just by observation and
it's a form of self-supervised learning
that I think we absolutely have to
reproduce if we want AI systems to reach
animal level or human level intelligence
so babies run notions like object
permanence the fact that if an object is
hidden behind another one it still
exists um like uh stability like u uh
you know natural object categories
without knowing the name of it of them
and then things like intuitive physics
gravity inertia conservation momentum
you know this kind of stuff uh babies
learn this around the age of nine months
so if you show a scenario to a six-month
uh six months old of an object that
appears to float in the air like the the
the little scenario at the bottom left
um six months old babies won't be
particularly surprised but a 10-month
old baby will look at it with big eyes
like the little girl here and be really
surprised because by then they've
learned that objects that are not
supported are supposed to fall okay and
that just happened by observation a
little bit by interaction at that at
that age okay so to get to human level
AI we're going to need and we call this
we don't call this AGI at Meta U because
again human intelligence is not general
so we call this AMI advanced machine
intelligence we pronounce it AMI which
means friend in
French um and so we need systems that
learn world models mental models of the
world from observation from sensory
input so they can learn intuitive
physics and common sense and things like
that uh systems that have persistent
memory systems that can plan complex
action sequences systems that can reason
and systems that are controllable and
safe uh by design not by fine-tuning
like like current AI systems and the
only way I can think of building a
system like this is to completely change
the type of inference that is performed
by those systems so the current type of
inference that LLMs do or or neural nets
of various types is that you put an
input you run through a fixed number of
layers in the neural net and you produce
an output okay an LLM does this for
every token there's a fixed amount of
computation spent to produce each
token so the trick to get an LLM to
spend more time thinking about something
is to trick it into producing more
tokens that's called chain of thoughts
um and this is heralded as a huge
progress in AI over the last few years
um so it's very limiting the type of
functions you can compute by running
signals through a fixed number of layers
in a neural
net let's say a neural net of a
reasonable size is
limited right because most tasks that
you want to solve require many steps of
computation you cannot reduce them to
just a few
steps um you know many computational
tasks are intrinsically serial
sequential not not parallel so you may
have to spend more time thinking about
more complex functions than answering
simple simple questions so the the
better way of performing inference would
be inference by optimization so
basically you have an observation you
run it maybe through a few layers of a
neural net and then you have a cost
function which itself is a neural net
that produces a scalar output and what
is going to measure is the degree of
compatibility or incompatibility between
the input and a hypothesized
output okay so now the the inference
problem becomes one of optimization
searching for an output that minimizes
it objective function i call this
objectived driven AI but really it's not
a new concept at all like most
probabilistic inference systems perform
inference using optimization uh I know
there is quite a few people in the room
who have worked on optimal control so
planning and optimal control motion
model predictive control for example
produces output through optimization
i'll come back to that okay so this idea
is really not new but we've forgotten
about it and I think we have to go back
to it we have to build systems whose
architecture is such that they can
perform inference by
optimization the output is not an output
it's a latent variable that you optimize
with respect
to um and this is really classical in uh
traditional AI if you want uh the idea
that you search for a solution among a
space of possible solution solutions
that's very traditional it's just um
kind of
forgotten um the the type of
u task that can be solved this way would
be somewhat equivalent to what
psychology is called system two so in
human behavior is two types of producing
an action uh one is called system one
this is the kind of task that you do
kind of subconsciously you can't you
can't take the action without even
thinking about it and then system two is
when you have to devote your
entire conscious uh uh uh mind if you
want to the task and then plan uh uh
think about it for a while and then plan
a a sequence of actions like if you are
building something for example and
you're not used to that task you you
will use system two when you're proving
a theorem you're certainly using system
two
so what is the best way to sort of uh
formally represent what this process of
inference by optimization is and it
corresponds to basically um this idea of
energy based models so an energy based
model is a model that u computes a
scalar uh number that measures the
degree of incompatibility between an
input X and a candidate output Y and
performs inference by minimizing this
energy with respect to Y okay I'm going
to call this energy function
F why F not E like energy because it's F
like free energy okay we're getting
closer to the Gibbs kind of thing here
um
so that's the inference process
now modeling a dependency between two
variables through a scalar energy
function of this type is much more
general than just learning a function
from X to Y and the reason is that there
might be multiple Ys that are compatible
with X for example if the problem you're
trying to solve here is translating from
English to French there's many ways to
translate a particular English sentence
sent sentence into French that are all
pretty good so they should all have low
energy okay to indicate that those two
things are compatible for the
translation task um but there's no like
single uh output um that is
correct so basically I'm talking about
implicit functions here right represent
a dependency between variables through
an implicit function not an explicit one
i mean a very simple concept
surprisingly difficult to grasp for a
certain type of computer scientist
um okay so
um how how can we use those energy based
models in the context of an intelligent
system that might be able to plan
actions okay and this is kind of a
diagram a block diagram perhaps of the
internal structure of this uh energy
energy function scalar energy function
so in this diagram round shapes
represent
variables either observed or latent um
modules that are like flat on one end
and rounded at the other end represent
deterministic functions let's say a
neural net that produces a single
output um and uh rectangles represent
objective functions basically scalar
output the the output is implicit here
but scalar scaled valued functions um
that you know takes a low value when the
input is acceptable and larger values
when it's not so here you can have two
uh type of objectives one that measures
to what extent the system accomplishes a
task that you want it to accomplish and
and another set of objectives that maybe
u guard rails so things that prevent the
system from
doing stupid things or dangerous things
or self-destructive or dangerous for
humans around okay so uh observe the
state of the world run you through a
perception module that produces a
representation of the current state of
the world now of course you don't have a
complete perception of the state of the
world so you might want to combine this
with the content of a memory that
contains your idea of the rest of the
state of the world that you might have
in your in your memory combine those two
things and feed them to a world model
and what the world model is supposed to
do is predict the outcome of taking a
particular sequence of actions okay so
the action sequence is in the the yellow
uh variable box
and the world model
predicts a representation of the world
that will result from taking the
sequence of actions now feed this
predicted representation to your
objectives and then through since all
those modules are differentiables
they're all neural nets um you can back
propagate gradients through uh the
action sequence and by gradient descent
or something of that type gradient based
optimization find an action sequence
that minimizes the objectives and that's
just
planning okay so that's a a process by
which a system would be able to perform
inference through optimization but it
needs to have this kind of mental model
of the world to be able to predict what
the consequences of its actions are
going to be now this is a very classical
view in optimal control the idea that
you have some sort of model of the world
or the system you want to control um and
you feed it a sequence of actions and
you can sort of predict you know you you
have I don't know a rocket you want to
shoot to the the space station and what
you have is a dynamical model of the of
the rocket and you know you can
hypothesize a sequence of controls and
then predict whether a rocket is going
to end up and you can have a cost
function that measures to what extent
the rocket is near or far from the space
station uh and then by optimization
figure out a sequence of controls that
will get the rocket to the space station
okay very classical it's called model
predatory control it's used everywhere
in optimal control robotics and it's
been used by rocket trajectory planning
since the 60s
um now of course um the world is not
entirely deterministic so um your world
model may require latent variables which
are variables that you don't know the
value of nobody is telling you what
values they take and they could take a
number of different values maybe they
can they're drawn from a distribution
and they might produce multiple
predictions so planning under
uncertainty with a world model that has
latent variables that basically
represent everything you don't know
about the world or that you know would
allow you to predict would be a good
thing but that's a not a solved
problem and what we actually want to do
is hierarchical planning
so all of us do this animals can do this
no AI system today can learn how to do
hierarchical planning we can get them to
do hierarchical planning by building
everything by hand but um no system
really knows how to do hierarchical
planning so let's say I'm sitting in my
office at NYU and I decide to go to
Paris
i'm not going to be able to plan my
entire trip from my office at at NYU to
Paris in terms of millisecond by
millisecond muscle control which is kind
of the lowest level actions I can take
okay I can't do it because first of all
it's too long of a sequence second of
all I don't even have the information
right i don't I don't have uh full
knowledge of you know whether the
traffic light is going to be red or or
green so do I need to you know plan to
to stop or cross the street
but at a high level I can have a high
level prediction mental of my mental
model that says if I want to go to Paris
I need to go to the airport and catch a
plane okay now I have a sub goal going
to the airport how do I go to the
airport i'm in New York so I can go down
in the street and hail a taxi how do I
go down in the street well I'm sitting
at my desk so I need to uh stand up and
go to the elevator push the button and
then walk out the wind the the building
how do I go to the elevator i need to
stand up from my chair pick up my bag
open the door walk to the elevator
avoiding all the obstacles on the way
okay and at some level uh when you go
down the hierarchy you're going to get
to a level where you're going to be able
to plan your millisecond by millisecond
muscle control because you have all the
information in front of you right so
standing up and opening the door I can I
can plan uh ahead of time so
this problem of learning world models
learning hierarchical world models
learning abstract representations of the
world allow us to make those predictions
at at you know long range or short range
so that we can do planning nobody has
any idea or of how precisely how to do
this how to make this work um so if we
assemble all of those pieces that I told
you about we we end up with something
called cognitive architecture for AMI
um you know consisting of a world model
various objective functions an actor
that is the the the system that
optimizes the action to minimize the the
cost a short-term memory in the brain is
the hypoc campus a perception module
that's the entire back of the brain um a
configurator or let's forget about this
I I wrote a long paper about this
uh two and a half years ago uh which I
put on open review it's not on archive
where I explain where I think AI
research should go if we want to kind of
make progress in that direction u this
was before the recent you know
excitement about LLM although LLMs were
already around uh but I never believed
they were the answer to human over AI
okay so um how do we get systems to
learn mental models of the world from
sensory input like video
um can we use this idea of auto
reeggressive prediction um you know
similar to what I explained before that
LLMs were were using uh to train a
generative architecture to predict
what's going to happen in a video
predict the next few frames in a video
for example and the answer is no it
doesn't work I've tried to do to work on
this for 20 years complete
failure it doesn't work it works for
discrete symbols for predicting discrete
symbols because handling uncertainty in
prediction is simple you produce a
vector of probabilities a bunch of
numbers between 0 and one that sum to
one and that's how you handle
uncertainty the problem now is how do
you
predict a video frame in a high
dimensional continuous
space where we don't know how to
represent you know probability density
functions in any kind of meaningful way
in uh in things like that we can
represent them as an energy function
that we then normalize uh physicists
have been doing this okay but it's
intractable most of the time for most
forms of energy functions uh if you take
e to the minus the energy function and
normalize it the normalization constant
is
intractable um
so so this idea of using generative
models for training a system to predict
videos doesn't work a lot of people are
working on it at the moment but what
they are interested in is not learning
world models it's actually generating
videos if you want to generate videos of
course you should do this generating
cute videos but if you want your system
to really understand the underlying
physics of the world that's that's a
losing proposition and the reason is if
you train a system to make a single
prediction okay which is what generative
models do um what you get are blurry
predictions
essentially u because the system can
only predict the average of all the
possible futures that may happen um so
my solution to this is something called
JEPA that that stands for joint
embedding predictive architecture and
that's what it looks
like okay you may not immediately spot
the difference with the generative
architecture so let me make that more
obvious on the left generative
architectures
the function that you're minimizing
during training is basically a
prediction error right so predict
uh Y observe X observe Y during training
and just train a system to predict Y
okay this is like supervised learning
except Y is a is part of X if it's a
sequence okay so
self-supervised um this works for
discrete Y doesn't work for continuous
high dimensional
Y on the right you have the joint
embedding predictive architecture so now
both X and Y are run through encoders
and what the encoders do is that they
compute an abstract representations
representation of both X and Y okay the
encoders might be different um and then
you make the prediction you perform the
prediction in that representation space
now this is a much easier problem to
solve in many ways because there are
many details in the world that are
completely
unpredictable and what a JPA
architecture will do is basically find
an abstract
representation of the of the world so
that all the stuff that cannot be
predicted is eliminated from that
representation think of the encoder
function as some sort of function with
invariance so that uh the the the
variability of the input y uh that
correspond to things you just cannot
predict are eliminated in the
representation space right so for
example if I if my video is a video of
this room and I point the camera at the
left side I turn slowly and I stop the
camera here and I ask the system predict
what's going to happen next in the video
it can certainly predict that there's
going to be people sitting in seats etc
it cannot possibly predict where
everybody is sitting what everybody
looks like what the texture of the
ground is um or the or or or on the
walls okay there are many things that
you just cannot predict you just don't
have the information and so instead of
uh spending a huge amount of resources
attempting to predict things that you
don't have enough information for just
eliminate it from the prediction process
by learning a representations where
those details are
eliminated um so there are technical
difficulties with this okay so the
conclusion to this is if what I'm
claiming is right we're much better off
using the Japa architectures than uh
generative architectures we should
abandon generative architectures
altogether everybody is talking about
genai and I'm telling
people abandon generative
AI that makes me super popular i can
tell you
um particularly among my colleagues who
spend a huge amount of
effort actually building genai systems
in fact their whole organization is
called genai um okay so I mean there's
different flavors of those things with
latent variables etc but I'm I'm I'm not
going to go into those uh those details
um but there is an issue which is how
you train those things okay so basically
training a system like this to learn
dependencies consists in learning an
energy function in such a way that the
energy function takes low
values on your training samples okay so
at a point XY where you have data the
energy should be low but then the energy
should be higher everywhere else so
imagine XY kind of lie on some manifold
um you want the energy function to be
let's say zero on the manifold and then
to kind of gradually increase as you
move away from the
manifold um and the problem with this is
that I only know of two ways of training
systems like this um if this energy
function is is parameterized in such a
way that allows it it allows it to take
a lot of different shapes uh you can
have an issue which is that it can
collapse if you just make sure that the
energy is low around the training
samples and you don't do anything else
you might end up with an energy function
that's completely flat that's called a
collapse so there are two methods to
prevent collapse one is you generate
contrastive samples those blinking green
points and you push their energy up okay
so you figure out some loss function
which given a set of training samples
and a set of contrastive samples are
outside the data manifold uh is going to
push down the energy of the observed
samples and push out the energy of the
contrastive samples okay those are
contrasting methods and the problem with
them is that they don't work really they
don't work very well in high dimension
because the number of contrastive
samples you need to generate goes
exponentially with the dimension of the
space so there's an alternative that you
could call regularized methods and what
those methods are based on is
essentially uh coming up with some
regularizer function that if you
minimize it will minimize the volume of
space that can take low energy that has
low energy that sounds a bit mysterious
of how you do this but in fact there is
a lot of things that have
uh done this in the context of uh
applied mathematics for example in
sparse coding that's exactly what sparse
coding does when you specify a latent
variable you uh you basically minimize
the volume of space that can take low
energy reconstruction energy okay so
those two methods contrasting methods
and regularized
methods um there's different types of
architectures that can collapse or not
i'm going to Okay this is the Gibs
lecture so I have to mention Gibbs um
this is how you turn an energy function
into a probability distribution you use
the Gibs Boltzman distribution okay so
take exponential minus the energy
multiply by some constant which is akin
to an inverse temperature and then
normalize by the integral of this over
the entire domain and what you get is a
properly normalized probability
distribution conditional probability
distribution of y given x and if you
really are insisting on doing
probabistic modeling um the way you
should train your energy based model is
by basically minimizing the negative log
of that conditional probability over
your training set the problem is that
that's intractable because the the
partition function the normalizing term
is generally completely intractable so
you would have to use approximations
like you know variational approximations
or Monte Carlo approximation and I tell
you a good chunk of the machine learning
community has been
um spending a lot of effort trying to do
this you know getting inspiration from
physics and from statistics and and
various other things um I've I've made a
a chart i don't expect expect you to
read any any of it here but this is sort
of various classical methods as to
whether they are regularized or or
contrastive um so those methods whether
contrastive or regularized have been
extremely successful to basically
pre-train a vision system to learn
representations of images in a
self-supervised manner um and the idea
for this goes back to the early 90s a
paper of mine uh from 1993 and a couple
more in the mid 2000 uh with some of my
students uh at the time uh there's more
recent papers from from Google and a lot
of people have been working on
contrasting method you may have heard of
a model called clip which is produced by
open AI for learning visual features
using text supervision it's also a
contrastive uh method but again it
doesn't scale very well with the
dimension so I prefer regularized
methods and the question is how do you
make this work um one way to make this
work is
to you have to prevent the system from
collapsing and a way to prevent the
system from so what does a collapse uh
produce here a collapse will consist in
minimizing the prediction error okay d
of
sy
tilda and only doing that and what the
system can happily do is completely
ignore x and y produce
constant sx and xy and then your
prediction problem is trivial your
prediction error is zero all the time
okay but you have a collapse model that
doesn't do anything for you so one way
to prevent this from happening and
that's basically a regularization term
is attempting to maximize the
information content coming out of the
encoder or the the two encoders okay so
have some estimate of information
content I of SX and I of Sy put a minus
sign in front and minimize
that now that's a challenge um because
we don't know how to maximize
information content we know how to
minimize it because because we have
upper bounds on information we don't
have lower bounds on
information and so what we're going to
do is come up with some approximation
making some assumptions about
information
content knowing that it's actually an
upper bound on information content and
we're going to push it up and cross our
fingers so that the actual information
content actually
follows okay and it works it's not well
justified but it's better justified than
everybody anything else that people have
done so
um so it would be nice if we could come
up with uh lower bounds on information
content but frankly I don't think it's
possible
uh because there may be complicated
dependencies that you know you you don't
know the nature of and um and so it
doesn't work so the basic idea of you
know how do you kind of put a number in
a sort of differentiable objective
function on information content and the
basic idea is
um make the representations coming out
of your encoder fill the space okay and
that idea has been proposed by multiple
people almost simultaneously in
different contexts um and there are
basically two ways to do this so the the
contrastive methods are should be called
really sample contrastive right so take
a matrix of vectors coming out of your
encoder for a number of
samples contrasting methods attempt to
make the vectors coming out of the
encoder all
different okay so imagine they're all on
the surface of a sphere because you
normalize them you're basically pushing
all those vectors away from each other
so they fill up the
space it doesn't work too well i mean
you need basically a lot of rows for
this to work um to do something useful
if you have a small number of rows then
it's very easy to have random vectors
you know pointing orthogonal directions
so you need a lot of rows for this to
work so the converse is dimension
contractive methods
um where you take the columns of that
matrix and you try to make the
columns different from each other maybe
orthogonal to each other and that only
works if you have a small number of rows
relative to the dimension because
otherwise it's too easy you have a small
number of high dimensional vectors that
need to be orthogonal i mean take them
randomly they'll almost be orthogonal
okay so you have kind of a duality
between those two things in fact we have
a paper on the fact that those two
things are dual of each other um but I
prefer the second one because uh they
can deal with
highdimensional representation space
whereas the first one really
can't um so what we've come up with is
something we call vic that means
variance invariance covariance
regularization and basically the idea
there is you take the sx representation
and you have a cost function that
insists that over a batch of samples the
variance of each u variable is at least
one using a hinge
loss and then a second cost that insists
that the offdagonal terms of the
coariance matrix so basically this
matrix
uh transpose multiply by itself
um that the offdagonal term of the
coarance matrix are as close to zero as
possible so basically you are trying to
decorrelate the individual variables
make the columns of that representation
matrix
orthogonal and other people have had
kind of similar ideas at Berkeley and
some of my colleagues at NYU um with a
method called MMCR
um so um and we have some theorems that
prove that in certain conditions when
you apply this criterion to decorrelate
uh after going through some nonlinear
function what you're actually doing is
making uh the the variables pair-wise
independent not just uncorrelated which
is uh interesting but still you know
this is all kind of a little uh
wishy-washy so a lot of challenges I
think for astute mathematicians
Um now I'm going to skip this about
latent variables because I don't have
time i'm just going to show you the last
slide about this because again that has
to do with Gibbs um the fact that if you
want to minimize the information content
of of a variable which you need to do if
you have a laten variable a good way to
do this is to make it noisy so instead
of inferring a single value for this
variable you infer a distribution over
this variable you maximize the entropy
of that distribution and the best way to
do this is by writing down what's called
a variation of free energy and
minimizing that and uh you know Gibbs
had something to to say about
that um okay um this I'm also going to
skip for lack of time um you can
actually use this V crack technique um
to and apply to PD PD not solving but
like for example finding the
coefficients of a particular PD by just
looking at uh windows on the solution so
basically take the space-time solution
of a PD um take two windows and train a
system to produce a representation of
those two windows
uh using this VCR criterion by basically
forcing the two representations to be
identical regardless of the pair of
windows and the only thing that the
system can extract which is common
between different windows in the
solution of a PD are basically the
coefficients of the PD representation of
the equation of of the differential
equation itself and when you I don't
have time to explain but when you apply
this to various situation it actually
works if you want more details about
this you might want to talk to uh Rand
Bisrio who is some sitting somewhere
here where are
you right here okay uh he's one of the
main authors of that paper um and he can
give you some details uh the the bottom
line is that when you use this Vicreg
stuff to learn the coefficients of a PT
you actually get a better prediction
than if you train it in supervised mode
which is kind of interesting okay
there's another set of methods uh
alternative to Vicrag uh called
distillation based methods and we use
them because they work well but I don't
like them because they're not they're
even less theoretically justified than
um than the the VCR information
maximization techniques and I'm not
going to go into the details of of how
they work um but you're minimizing a
function that is not actually being
minimized by gradient descent it's it's
a mess there are some theoretical papers
on this i I listed one at the bottom
here um but uh only in the case of kind
of linear encoders and predictors and uh
really it's not such a satisfying method
but it works really well and a lot of
people have been using it for uh
learning features of images in the
self-supervised manner um um is a
technique called Japa which I won't have
time to describe in detail but it it
does a really good job at kind of
learning representations of images that
you can use then for a subsequent task
uh that would be supervised but without
requiring too many labelled samples um
and then there is a a video version of
this called video japa so you take a a
video you mask a big chunk of it um at
all times for example and then you train
some gigantic neural net a JPA
architecture to predict the internal
representation of the full video from
the representation of the partially
masked one and what you get in the end
is a system that does a really good job
at representing videos you can use that
representation as input to a system that
can classify the action taking place in
the video and and doing things of that
type and I'm not going to bore you with
tables of results but it works really
well and one really interesting thing
about this technique this is a paper
that u uh we just finished and we're
we're submitting um when you test those
systems and you measure the prediction
error they're they're doing uh on a
video if you show them a video that is
physically impossible like an object
disappears or changes shape
spontaneously it tells you it tells you
that can't happen my prediction error
goes through the roof and so those
systems this system has kind of learn a
very basic form of common sense a little
bit like you know the babies I was
talking about earlier i mean that's
really a surprising result because the
system is really not trained to um
predict it's just to kind of fill in uh
missing parts um and we've been using
u kind of self-supervised encoders and
predictors to do planning so I'm I'm
coming to this idea of world model so
let's say you have a picture of a state
of the world and the system can control
a robot
arm and u you
want the system to basically act in such
a way that the final state of the world
looks like u you know a particular
target um so let's say you have a bunch
of blue chips on the table and you want
to kind of move a robot arm so that at
the end the blue chips are all within a
a nice little square as represented here
in the middle um and so um train um an
encoder so we use dynov2 which is a
pre-trained encoder and then train a
world model to predict what is going to
happen at the representation level when
you take a particular action can you
predict the resulting effect uh on the
on that board with blue chips um and
then once you have that world model
which you can train from you know random
data with random actions and random blue
chips can you use it to plan a sequence
of actions so as to arrive at a
particular um result and I'm going to
cut to the
chains um so I mean there's various
problems that we've applied this to and
it works pretty well for planning
several things but here is the result
for the the blue chips okay so what you
see here is a video uh you don't see the
actions of the robot arm but it's
actually taking actions and what you see
at the top is what's happening in the
world and what you see at the bottom is
what the system predicts is going to
happen in its own internal world model
we trained a separate decoder to kind of
produce an image of what the internal
thinking of the system is let me play
this again
um so at the bottom here you see kind of
the uh configuration kind of progressing
as the robot kind of pushes things
around and then the end state is not
exactly a square but it's pretty close
and this is a very complex dynamical
system where the chips are interacting
with each other you you could not you
know really kind of model this uh
sufficiently accurately to kind of do
planning in uh with a hand constructed
model really uh we've kind of similar
work uh for planning and navigation in
in real world i'm going to skip this
because I'm running out of time so my uh
recommendations um abandon generative
models in favor of the joint embedding
architectures abandon probabilistic
models in favor of energy based models
abandon contrastive methods in favor of
regularized methods uh abandon
reinforcement learning i've been saying
this for a decade now if ever of model
predictive control and planning
um and if you are interested in human
level AI just don't work on
Labs in fact if you are a PhD student in
AI you should absolutely not work on Lab
because you're putting yourself in
competition with enormous teams with
tens of thousands of GPUs at their
disposals you're not going to be able to
contribute anything um okay problems to
solve large scale run models how do you
train them on multimodel inputs planning
algorithms um I think there's a lot of
uh expertise perhaps in the room here in
optimal control in sort of various ways
to perform optimization uh when you
perform use gradient based method for
planning uh you get hit by all kinds of
local minima issues and non
differentiability and all that kind of
stuff so uh maybe methods like adm
um JPA with latent variables and
planning with uh nondeterministic
environment uh regularizing latent
variables uh uh hierarchical planning uh
etc and so you know what are the
mathematical foundations of energy based
learning when we're stepping out of
probabilistic learning we're getting
into the
unknown and characterizing what is
appropriate to do there is not clear uh
learning cost modules I didn't talk
about this but that's also an issue uh
planning with inaccurate world model and
how you adjust the world models
um okay so maybe if we solve all those
problems within the next decade or or
half decade we're going to be on a good
path towards building systems that are
truly intelligent that are capable of
planning and reasoning um and I think
the only way for this to work is if the
platforms are open source i've been a
big advocate of open source AI and I I
really believe in this um but if we
succeed maybe um AI will be sort of a
big amplifier of human intelligence that
can be only good thank you very much
[Applause]
[Music]
heat
Loading video analysis...