Yann LeCun "Mathematical Obstacles on the Way to Human-Level AI"

By Joint Mathematics Meetings

Summary

## Key takeaways - **LLMs are doomed, not the path to human-level AI**: Autoregressive Large Language Models (LLMs) are fundamentally flawed due to their error-prone, exponentially divergent prediction process. This inherent limitation means they are not a viable path towards achieving human-level artificial intelligence. [11:56] - **AI struggles with the physical world, unlike cats**: Current AI systems, despite passing exams and solving complex math problems, fail to grasp the physical world as effectively as a cat. This is a significant obstacle, as understanding intuitive physics, causality, and object permanence is crucial for advanced AI. [12:48], [12:52] - **Sensory data dwarfs text for true AI understanding**: The sheer volume of information available through human senses (vision, touch, audition) vastly exceeds that found in all human text. To achieve human-level AI, systems must learn from observing the world, not just from text data. [16:11], [16:39] - **Inference by optimization is key for advanced AI**: Instead of fixed-layer computations in LLMs, AI should adopt inference by optimization. This involves searching for solutions that minimize a cost function, similar to model predictive control, enabling more complex reasoning and planning. [20:29], [20:34] - **JEPA architectures outperform generative models for AI**: Joint Embedding Predictive Architectures (JEPA) are superior to generative models for AI development. JEPA focuses on learning abstract representations by predicting in a representation space, filtering out unpredictable details, unlike generative models which often produce blurry predictions. [34:01], [34:47] - **AI needs world models for planning and common sense**: To achieve advanced machine intelligence (AMI), systems require world models learned from sensory input. These models enable intuitive physics, common sense reasoning, and hierarchical planning, allowing AI to predict consequences and act effectively. [18:42], [26:30]

Topics Covered

AI can pass the bar but can't empty a dishwasher.
A child sees more data than all internet text.
The future of AI is objective-driven, not generative.
Why we must abandon generative AI.
Autoregressive LLMs are doomed. Here is why.

Full Transcript

good

afternoon welcome everybody i'm Bry Cra

and I'm president of the AMS and it's my

pleasure to welcome all of you to this

AMS Josiah Willard Gibbs lecture these

lectures were established in

1923 to show the public some some of the

aspects of mathematics and its

application they are named in honor of

Gibbs mathemat mathematical physicist

who made deep theoretical contributions

in physics chemistry and mathematics

something all of us do right yeah gibbs

is one of the founders of statistical

mechanics he coined the term and of

vector calculus his influence was so

profound that he had even been featured

on a US stamp in 2005

the list of Gibbs previous Gibbs

lectures lecturers is a veritable who's

who in broad mathematics including GH

Hardy John Fonoyman and Albert

Einstein so today it's my pleasure to

add another one to this list and

introduce Yan Lun for the Josiah Gibbs

lecture jan is the Jacob T schwarz

professor of computer science data

science neuroscience and electrical and

computer engineering competing with

Gibbs in terms of fields at the Corrant

Institute of Mathematical Sciences at

New York and he is the chief AI

scientist at Meta he is well known for

his work in numerous areas especially in

computer vision and deep learning and

he's one of the developers of things you

may be using all the time the DJ VU

format that probably somewhere on your

computer he's won numerous awards i

won't list them all because then there

would be no time for his lecture but I

will just note that he won the touring

award recently in 2018 and he's a member

of numerousmies including the National

Academy of Sciences the National Academy

of Engineering and the French Academy of

Sciences without further ado let me

introduce Yan

[Applause]

All right now that Bryer has listed all

the prominent speakers uh of the Gibbs

lecture I'm uh intimidated

um and I I don't believe I'll I'll kind

of fill the the shoes of uh of those

names but um but let me talk about AI

obviously everybody is talking about AI

and particularly about obstacles towards

human level AI so a lot of people in the

AI research and development community

are perceiving the idea that perhaps we

have a shot within the next decade or so

of building machines that have a

blueprint that might um eventually reach

kind of human level intelligence uh the

estimates about how long it's going to

take vary by huge amounts u the most

optimistic people say we've we're

already there some people who are

raising lots of funds are claiming it's

going to happen next year but I don't

believe so myself um but I think we have

a good shot and and so I'm going to tell

you where I think uh research in AI

should go and what are the obstacles and

some of them are really mathematical

obstacles um okay

so why would we need to build AI systems

with human level intelligence and it's

because uh you know in the near future

we're all going to be walking around

with AI assistants um helping us in our

daily lives that that we're going to be

able to interact with uh through various

smart devices including smart glasses

and things like that through voice and

through various other ways of uh

interacting with them um so we have

smart glasses with cameras and displays

in them etc currently you can have smart

glasses without displays but soon um the

displays will will exist right now they

exist they're just too expensive to be

commercialized uh this is the Orion uh

demonstration built by our colleagues at

at Meta um so the the future is is

coming and the vision is that you know

all of us will be basically walking

around with AI assistants uh all our

lives it's like you know all of us will

be kind of like a uh you know high level

uh CEO or politician or something

running around with a staff of smart

virtual people working for us that's

kind of the a possible

picture um but the problem is we don't

know how to build this yet and and

really

um the current state of machine learning

is that it

sucks i mean in terms of learning

abilities compared to humans and animals

um it it really it's really very

inefficient in terms of the number of

samples or trials that machines have to

go through before they can reach a

particular level of performance um so in

the past the dominant paradigm of

machine learning was supervised learning

right so supervised learning is you give

an input to the system you wait for it

to produce the output then you tell it

what output you wanted and if the output

you wanted is different from the one

that the system produces the system

adjusts its internal parameters to get

the the output closer to the one you

want okay it's just learning an input

output uh uh function uh reinforcement

learning you don't tell the system what

the correct answer is to just tell it

whether it's good or bad whether the

answer it produced was good or bad and

the main issue with this is that it

requires the system to basically produce

multiple outputs and ask you is this

good is this bad is this better um and

that's even less efficient um so it only

works for games basically or for things

that you can simulate really quickly on

a computer um so one thing that has

revolutionized AI in the last few years

is called self-supervised learning and

it works really great it's really really

revolutionized

um AI um but it's still very limited so

self-supervised learning is the basis of

large language models and chat bots and

things like this and I'm going to tell

you in a minute how it

works um but really animals and humans

can learn new task extremely quickly and

they can understand how the world works

um they can reason and plan they have

common sense um and and the behavior is

really driven by objective it's not just

kind of predicting the next word in a

text okay so how how does those um those

chatbot and LLMs work and I only have

two slides on it and then I'm not going

to talk about it at all okay so auto

reggressive uh large language models

they're trained to predict the next word

in a sequence or the next symbol in a

sequence of symbols okay they can be

words they can be DNA u

uh music protein whatever discrete

symbols okay so you take a sequence of

symbols you feed it to a large neural

net and the architecture of the neural

net is such that the system can uh is is

trained to basically just reproduce its

input on its output this is called an

autoenccoder okay so you just take the

input and then tell the system I just

want you to reproduce your input on your

output but the architecture of the

system is such that to produce one

particular variable the system can only

look at the variables that are to the

left of it in the sequence it cannot

look at the variable that it needs to

predict okay so basically what you're

training it to do by doing this you're

training it to predict the next symbol

in a sequence okay but you do this in

parallel over a large sequence

um so you measure some sort of diver

divergence between the input sequence

you feed it and the output sequence it

produces and you minimize that uh

divergence measure through

gradient basically gradient based

optimization with respect to all the

parameters inside of the uh predictor

function which is some gigantic neural

net which may have tens or hundreds of

billions of parameters okay this is

really high dimension okay once you've

trained that system uh when you take a

sequence and you run it through it the

system is going to predict the next

symbol um okay so let's say the the

window over which it looks at at symbols

here is three in reality in an LLM it

can be several hundred thousand but

let's say three uh so you feed three

words to that system and it produces the

next uh the next word now of course it

cannot predict exactly the next word so

what it produces is a probability

distribution over all the possible words

in a

dictionary and typically in LLM we're

not we don't actually train it to

produce word we train it to produce

tokens which are like subword units and

a typical number of possible tokens

would be on the order of

100,000 okay so now when you use the

system you feed it a sequence of words

called a prompt you have the system

predict the next word and then you take

that and you shift it into the input so

now you can ask the system what is the

second next word have you produce it

shift it into the input produce a third

word shift it into the input so that's

basically auto reggressive uh prediction

a very old concept obviously in signal

processing and stat statistics um and um

and it works really well it works

amazingly well if you make those uh

neural nets really large you train them

with very large input windows on tens of

trillions of tokens data sets that are

have tens of trillions of tokens

sequences with tens of trillions of

tokens it works amazingly well those

systems seem to discover a lot of

underlying structure about language or

about whatever sequence of symbols

you're training on but there's a a major

issue with auto reggressive prediction

and uh this you know mathematicians in

the room here would probably do a much

better job than me at kind of writing

proofs about this but um autography

prediction is kind of a divergent

process right if you imagine that you

have those those symbols are discrete so

every time you produce a symbol you have

multiple choices maybe 100 thousand

choices and you can think of the

sequence of all possible tokens

as some gigantic tree with a branching

factor of 100,000 okay within this

gigantic tree there's a small sub tree

that correspond to all answers that

could be qualified as correct okay all

continuations that you would think are

qualified as correct so if the prompt is

a question you know the answer would be

the the produced text would contain the

answer

now that sub tree is a tiny subset of

the the gigantic tree of possible

sequence of symbols and the problem is

if you assume which of course is false

that uh there is some sort of

probability of error every time you

produce a symbol and you assume those

errors are

independent uh and that probability is E

then the probability that the sequence

of n symbols would be correct is 1 minus

E to the^

N even if E is really small this has got

to diverge exponentially and it's not

fixable within the context of auto

reggressive prediction so my prediction

is that autogressive LLMs are

doomed a few years from now nobody in

their right mind will use them okay and

that that's why you know they you've

heard about LLM elucidation and things

like that you know sometimes they

produce nonsense uh and it's essentially

because of this auto reggressive

prediction so the question is what

should we replace this by and and you

know are there other types of limitation

now so I think we're missing something

really big in terms of like a new

concept of how to build AI systems we're

never going to get to human level AI by

just training large language models on

bigger data sets it's just not going to

happen and I'll give you another reason

why in a minute but never mind humans

never by never mind you know trying to

reproduce mathematicians or scientists

um we can't even reproduce what a cat

can do a cat has an amazing

understanding of the physical world and

I always say cat it could be a rat um

and we have no idea how to get an AI

system to work as well as as a cat in in

terms of understanding the physical

physical

world um house cats can plan really

complex actions they have causal models

of the world they know what the

consequences of their actions will be um

and humans are amazing a 10-year-old can

clear up the dinner table and fill up

the dishwasher without actually learning

the task you ask the 10-year-old to do

it the 10-year-old will do it the first

the first time it's called zero shot

learning um because the 10-year-old has

good mental model of the world and sort

of knows you know how objects behave

when you manipulate them and how they're

supposed to behave a 17-year-old can

learn to drive a car in 20 hours of

practice

and autonomous driving companies have

hundreds of thousands of training uh

training data of people driving cars

around we still don't have self-driving

cars at least not level five

self-driving cars unless we cheat way

more which is fine um so we have AI

systems they can pass the the bar exam

they can do math problems they can prove

theorems but where is my level five

self-driving car where is my domestic

robot we still can't build systems that

deal with the real world uh the physical

world turns out is much more complicated

than language

and that's called the Marave paradox

right there is this uh idea that um

tasks that are complicated for humans

like computing an integral um solving a

differential equation um you know

playing chess or go planning a path uh

through a bunch of cities those are kind

of hard tasks for humans it turns out

computers are much better than us at

this um like they're so much better than

us at playing chess and go that really

humanity sucks um and what what that

tells you is that when people refer to

human intelligence as general

intelligence that's complete nonsense we

do not have general intelligence at all

we're extremely

specialized um okay so we're not going

to get to human level AI by just

training on text um and and there is a

interesting calculation you can make so

a typical LLM a modern one is trained on

something like two 10^ the 13

tokens 20

trillion um and each token is about

three bytes so that would be six 10^ the

13 bytes let's round this up to 10 the

14 bytes

okay that would take a few hundred,000

years for any of us to read through this

that basically constitutes the entirety

of all the text available on the

internet

publicly uh so I mean that seems like an

incredible amount of training data but

now take a human child a 4-year-old has

been awake a total of 16,000

hours uh which by the way corresponds to

30 minutes of YouTube uploads not not

that much data we have two million optic

nerve fibers um 1 million per eye going

to the visual cortex each optic nerve

fibers carries about one bite per second

roughly maybe a little less but who

cares um so do the calculation and

that's about 10^ the 14 bytes in four

years there's just enormously more

information in the physical world in

sensory information that we get from

vision and touch and

audition than there is in all the texts

ever produced by all humans

so again we're never going to get to

human level AI unless we can get system

to learn how the world works by

observing the world there's just way

more information there than there is in

text and uh psychologists have studied

this you know babies kind of learning

various things about about the real

world mostly by observation in the first

few months because babies really can't

uh act in the world you know beyond

their own limbs in the first three or

four months and so they learn a huge

amount of background knowledge about the

world mostly just by observation and

it's a form of self-supervised learning

that I think we absolutely have to

reproduce if we want AI systems to reach

animal level or human level intelligence

so babies run notions like object

permanence the fact that if an object is

hidden behind another one it still

exists um like uh stability like u uh

you know natural object categories

without knowing the name of it of them

and then things like intuitive physics

gravity inertia conservation momentum

you know this kind of stuff uh babies

learn this around the age of nine months

so if you show a scenario to a six-month

uh six months old of an object that

appears to float in the air like the the

the little scenario at the bottom left

um six months old babies won't be

particularly surprised but a 10-month

old baby will look at it with big eyes

like the little girl here and be really

surprised because by then they've

learned that objects that are not

supported are supposed to fall okay and

that just happened by observation a

little bit by interaction at that at

that age okay so to get to human level

AI we're going to need and we call this

we don't call this AGI at Meta U because

again human intelligence is not general

so we call this AMI advanced machine

intelligence we pronounce it AMI which

means friend in

French um and so we need systems that

learn world models mental models of the

world from observation from sensory

input so they can learn intuitive

physics and common sense and things like

that uh systems that have persistent

memory systems that can plan complex

action sequences systems that can reason

and systems that are controllable and

safe uh by design not by fine-tuning

like like current AI systems and the

only way I can think of building a

system like this is to completely change

the type of inference that is performed

by those systems so the current type of

inference that LLMs do or or neural nets

of various types is that you put an

input you run through a fixed number of

layers in the neural net and you produce

an output okay an LLM does this for

every token there's a fixed amount of

computation spent to produce each

token so the trick to get an LLM to

spend more time thinking about something

is to trick it into producing more

tokens that's called chain of thoughts

um and this is heralded as a huge

progress in AI over the last few years

um so it's very limiting the type of

functions you can compute by running

signals through a fixed number of layers

in a neural

net let's say a neural net of a

reasonable size is

limited right because most tasks that

you want to solve require many steps of

computation you cannot reduce them to

just a few

steps um you know many computational

tasks are intrinsically serial

sequential not not parallel so you may

have to spend more time thinking about

more complex functions than answering

simple simple questions so the the

better way of performing inference would

be inference by optimization so

basically you have an observation you

run it maybe through a few layers of a

neural net and then you have a cost

function which itself is a neural net

that produces a scalar output and what

is going to measure is the degree of

compatibility or incompatibility between

the input and a hypothesized

output okay so now the the inference

problem becomes one of optimization

searching for an output that minimizes

it objective function i call this

objectived driven AI but really it's not

a new concept at all like most

probabilistic inference systems perform

inference using optimization uh I know

there is quite a few people in the room

who have worked on optimal control so

planning and optimal control motion

model predictive control for example

produces output through optimization

i'll come back to that okay so this idea

is really not new but we've forgotten

about it and I think we have to go back

to it we have to build systems whose

architecture is such that they can

perform inference by

optimization the output is not an output

it's a latent variable that you optimize

with respect

to um and this is really classical in uh

traditional AI if you want uh the idea

that you search for a solution among a

space of possible solution solutions

that's very traditional it's just um

kind of

forgotten um the the type of

u task that can be solved this way would

be somewhat equivalent to what

psychology is called system two so in

human behavior is two types of producing

an action uh one is called system one

this is the kind of task that you do

kind of subconsciously you can't you

can't take the action without even

thinking about it and then system two is

when you have to devote your

entire conscious uh uh uh mind if you

want to the task and then plan uh uh

think about it for a while and then plan

a a sequence of actions like if you are

building something for example and

you're not used to that task you you

will use system two when you're proving

a theorem you're certainly using system

two

so what is the best way to sort of uh

formally represent what this process of

inference by optimization is and it

corresponds to basically um this idea of

energy based models so an energy based

model is a model that u computes a

scalar uh number that measures the

degree of incompatibility between an

input X and a candidate output Y and

performs inference by minimizing this

energy with respect to Y okay I'm going

to call this energy function

F why F not E like energy because it's F

like free energy okay we're getting

closer to the Gibbs kind of thing here

um

so that's the inference process

now modeling a dependency between two

variables through a scalar energy

function of this type is much more

general than just learning a function

from X to Y and the reason is that there

might be multiple Ys that are compatible

with X for example if the problem you're

trying to solve here is translating from

English to French there's many ways to

translate a particular English sentence

sent sentence into French that are all

pretty good so they should all have low

energy okay to indicate that those two

things are compatible for the

translation task um but there's no like

single uh output um that is

correct so basically I'm talking about

implicit functions here right represent

a dependency between variables through

an implicit function not an explicit one

i mean a very simple concept

surprisingly difficult to grasp for a

certain type of computer scientist

um okay so

um how how can we use those energy based

models in the context of an intelligent

system that might be able to plan

actions okay and this is kind of a

diagram a block diagram perhaps of the

internal structure of this uh energy

energy function scalar energy function

so in this diagram round shapes

represent

variables either observed or latent um

modules that are like flat on one end

and rounded at the other end represent

deterministic functions let's say a

neural net that produces a single

output um and uh rectangles represent

objective functions basically scalar

output the the output is implicit here

but scalar scaled valued functions um

that you know takes a low value when the

input is acceptable and larger values

when it's not so here you can have two

uh type of objectives one that measures

to what extent the system accomplishes a

task that you want it to accomplish and

and another set of objectives that maybe

u guard rails so things that prevent the

system from

doing stupid things or dangerous things

or self-destructive or dangerous for

humans around okay so uh observe the

state of the world run you through a

perception module that produces a

representation of the current state of

the world now of course you don't have a

complete perception of the state of the

world so you might want to combine this

with the content of a memory that

contains your idea of the rest of the

state of the world that you might have

in your in your memory combine those two

things and feed them to a world model

and what the world model is supposed to

do is predict the outcome of taking a

particular sequence of actions okay so

the action sequence is in the the yellow

uh variable box

and the world model

predicts a representation of the world

that will result from taking the

sequence of actions now feed this

predicted representation to your

objectives and then through since all

those modules are differentiables

they're all neural nets um you can back

propagate gradients through uh the

action sequence and by gradient descent

or something of that type gradient based

optimization find an action sequence

that minimizes the objectives and that's

just

planning okay so that's a a process by

which a system would be able to perform

inference through optimization but it

needs to have this kind of mental model

of the world to be able to predict what

the consequences of its actions are

going to be now this is a very classical

view in optimal control the idea that

you have some sort of model of the world

or the system you want to control um and

you feed it a sequence of actions and

you can sort of predict you know you you

have I don't know a rocket you want to

shoot to the the space station and what

you have is a dynamical model of the of

the rocket and you know you can

hypothesize a sequence of controls and

then predict whether a rocket is going

to end up and you can have a cost

function that measures to what extent

the rocket is near or far from the space

station uh and then by optimization

figure out a sequence of controls that

will get the rocket to the space station

okay very classical it's called model

predatory control it's used everywhere

in optimal control robotics and it's

been used by rocket trajectory planning

since the 60s

um now of course um the world is not

entirely deterministic so um your world

model may require latent variables which

are variables that you don't know the

value of nobody is telling you what

values they take and they could take a

number of different values maybe they

can they're drawn from a distribution

and they might produce multiple

predictions so planning under

uncertainty with a world model that has

latent variables that basically

represent everything you don't know

about the world or that you know would

allow you to predict would be a good

thing but that's a not a solved

problem and what we actually want to do

is hierarchical planning

so all of us do this animals can do this

no AI system today can learn how to do

hierarchical planning we can get them to

do hierarchical planning by building

everything by hand but um no system

really knows how to do hierarchical

planning so let's say I'm sitting in my

office at NYU and I decide to go to

Paris

i'm not going to be able to plan my

entire trip from my office at at NYU to

Paris in terms of millisecond by

millisecond muscle control which is kind

of the lowest level actions I can take

okay I can't do it because first of all

it's too long of a sequence second of

all I don't even have the information

right i don't I don't have uh full

knowledge of you know whether the

traffic light is going to be red or or

green so do I need to you know plan to

to stop or cross the street

but at a high level I can have a high

level prediction mental of my mental

model that says if I want to go to Paris

I need to go to the airport and catch a

plane okay now I have a sub goal going

to the airport how do I go to the

airport i'm in New York so I can go down

in the street and hail a taxi how do I

go down in the street well I'm sitting

at my desk so I need to uh stand up and

go to the elevator push the button and

then walk out the wind the the building

how do I go to the elevator i need to

stand up from my chair pick up my bag

open the door walk to the elevator

avoiding all the obstacles on the way

okay and at some level uh when you go

down the hierarchy you're going to get

to a level where you're going to be able

to plan your millisecond by millisecond

muscle control because you have all the

information in front of you right so

standing up and opening the door I can I

can plan uh ahead of time so

this problem of learning world models

learning hierarchical world models

learning abstract representations of the

world allow us to make those predictions

at at you know long range or short range

so that we can do planning nobody has

any idea or of how precisely how to do

this how to make this work um so if we

assemble all of those pieces that I told

you about we we end up with something

called cognitive architecture for AMI

um you know consisting of a world model

various objective functions an actor

that is the the the system that

optimizes the action to minimize the the

cost a short-term memory in the brain is

the hypoc campus a perception module

that's the entire back of the brain um a

configurator or let's forget about this

I I wrote a long paper about this

uh two and a half years ago uh which I

put on open review it's not on archive

where I explain where I think AI

research should go if we want to kind of

make progress in that direction u this

was before the recent you know

excitement about LLM although LLMs were

already around uh but I never believed

they were the answer to human over AI

okay so um how do we get systems to

learn mental models of the world from

sensory input like video

um can we use this idea of auto

reeggressive prediction um you know

similar to what I explained before that

LLMs were were using uh to train a

generative architecture to predict

what's going to happen in a video

predict the next few frames in a video

for example and the answer is no it

doesn't work I've tried to do to work on

this for 20 years complete

failure it doesn't work it works for

discrete symbols for predicting discrete

symbols because handling uncertainty in

prediction is simple you produce a

vector of probabilities a bunch of

numbers between 0 and one that sum to

one and that's how you handle

uncertainty the problem now is how do

you

predict a video frame in a high

dimensional continuous

space where we don't know how to

represent you know probability density

functions in any kind of meaningful way

in uh in things like that we can

represent them as an energy function

that we then normalize uh physicists

have been doing this okay but it's

intractable most of the time for most

forms of energy functions uh if you take

e to the minus the energy function and

normalize it the normalization constant

is

intractable um

so so this idea of using generative

models for training a system to predict

videos doesn't work a lot of people are

working on it at the moment but what

they are interested in is not learning

world models it's actually generating

videos if you want to generate videos of

course you should do this generating

cute videos but if you want your system

to really understand the underlying

physics of the world that's that's a

losing proposition and the reason is if

you train a system to make a single

prediction okay which is what generative

models do um what you get are blurry

predictions

essentially u because the system can

only predict the average of all the

possible futures that may happen um so

my solution to this is something called

JEPA that that stands for joint

embedding predictive architecture and

that's what it looks

like okay you may not immediately spot

the difference with the generative

architecture so let me make that more

obvious on the left generative

architectures

the function that you're minimizing

during training is basically a

prediction error right so predict

uh Y observe X observe Y during training

and just train a system to predict Y

okay this is like supervised learning

except Y is a is part of X if it's a

sequence okay so

self-supervised um this works for

discrete Y doesn't work for continuous

high dimensional

Y on the right you have the joint

embedding predictive architecture so now

both X and Y are run through encoders

and what the encoders do is that they

compute an abstract representations

representation of both X and Y okay the

encoders might be different um and then

you make the prediction you perform the

prediction in that representation space

now this is a much easier problem to

solve in many ways because there are

many details in the world that are

completely

unpredictable and what a JPA

architecture will do is basically find

an abstract

representation of the of the world so

that all the stuff that cannot be

predicted is eliminated from that

representation think of the encoder

function as some sort of function with

invariance so that uh the the the

variability of the input y uh that

correspond to things you just cannot

predict are eliminated in the

representation space right so for

example if I if my video is a video of

this room and I point the camera at the

left side I turn slowly and I stop the

camera here and I ask the system predict

what's going to happen next in the video

it can certainly predict that there's

going to be people sitting in seats etc

it cannot possibly predict where

everybody is sitting what everybody

looks like what the texture of the

ground is um or the or or or on the

walls okay there are many things that

you just cannot predict you just don't

have the information and so instead of

uh spending a huge amount of resources

attempting to predict things that you

don't have enough information for just

eliminate it from the prediction process

by learning a representations where

those details are

eliminated um so there are technical

difficulties with this okay so the

conclusion to this is if what I'm

claiming is right we're much better off

using the Japa architectures than uh

generative architectures we should

abandon generative architectures

altogether everybody is talking about

genai and I'm telling

people abandon generative

AI that makes me super popular i can

tell you

um particularly among my colleagues who

spend a huge amount of

effort actually building genai systems

in fact their whole organization is

called genai um okay so I mean there's

different flavors of those things with

latent variables etc but I'm I'm I'm not

going to go into those uh those details

um but there is an issue which is how

you train those things okay so basically

training a system like this to learn

dependencies consists in learning an

energy function in such a way that the

energy function takes low

values on your training samples okay so

at a point XY where you have data the

energy should be low but then the energy

should be higher everywhere else so

imagine XY kind of lie on some manifold

um you want the energy function to be

let's say zero on the manifold and then

to kind of gradually increase as you

move away from the

manifold um and the problem with this is

that I only know of two ways of training

systems like this um if this energy

function is is parameterized in such a

way that allows it it allows it to take

a lot of different shapes uh you can

have an issue which is that it can

collapse if you just make sure that the

energy is low around the training

samples and you don't do anything else

you might end up with an energy function

that's completely flat that's called a

collapse so there are two methods to

prevent collapse one is you generate

contrastive samples those blinking green

points and you push their energy up okay

so you figure out some loss function

which given a set of training samples

and a set of contrastive samples are

outside the data manifold uh is going to

push down the energy of the observed

samples and push out the energy of the

contrastive samples okay those are

contrasting methods and the problem with

them is that they don't work really they

don't work very well in high dimension

because the number of contrastive

samples you need to generate goes

exponentially with the dimension of the

space so there's an alternative that you

could call regularized methods and what

those methods are based on is

essentially uh coming up with some

regularizer function that if you

minimize it will minimize the volume of

space that can take low energy that has

low energy that sounds a bit mysterious

of how you do this but in fact there is

a lot of things that have

uh done this in the context of uh

applied mathematics for example in

sparse coding that's exactly what sparse

coding does when you specify a latent

variable you uh you basically minimize

the volume of space that can take low

energy reconstruction energy okay so

those two methods contrasting methods

and regularized

methods um there's different types of

architectures that can collapse or not

i'm going to Okay this is the Gibs

lecture so I have to mention Gibbs um

this is how you turn an energy function

into a probability distribution you use

the Gibs Boltzman distribution okay so

take exponential minus the energy

multiply by some constant which is akin

to an inverse temperature and then

normalize by the integral of this over

the entire domain and what you get is a

properly normalized probability

distribution conditional probability

distribution of y given x and if you

really are insisting on doing

probabistic modeling um the way you

should train your energy based model is

by basically minimizing the negative log

of that conditional probability over

your training set the problem is that

that's intractable because the the

partition function the normalizing term

is generally completely intractable so

you would have to use approximations

like you know variational approximations

or Monte Carlo approximation and I tell

you a good chunk of the machine learning

community has been

um spending a lot of effort trying to do

this you know getting inspiration from

physics and from statistics and and

various other things um I've I've made a

a chart i don't expect expect you to

read any any of it here but this is sort

of various classical methods as to

whether they are regularized or or

contrastive um so those methods whether

contrastive or regularized have been

extremely successful to basically

pre-train a vision system to learn

representations of images in a

self-supervised manner um and the idea

for this goes back to the early 90s a

paper of mine uh from 1993 and a couple

more in the mid 2000 uh with some of my

students uh at the time uh there's more

recent papers from from Google and a lot

of people have been working on

contrasting method you may have heard of

a model called clip which is produced by

open AI for learning visual features

using text supervision it's also a

contrastive uh method but again it

doesn't scale very well with the

dimension so I prefer regularized

methods and the question is how do you

make this work um one way to make this

work is

to you have to prevent the system from

collapsing and a way to prevent the

system from so what does a collapse uh

produce here a collapse will consist in

minimizing the prediction error okay d

of

sy

tilda and only doing that and what the

system can happily do is completely

ignore x and y produce

constant sx and xy and then your

prediction problem is trivial your

prediction error is zero all the time

okay but you have a collapse model that

doesn't do anything for you so one way

to prevent this from happening and

that's basically a regularization term

is attempting to maximize the

information content coming out of the

encoder or the the two encoders okay so

have some estimate of information

content I of SX and I of Sy put a minus

sign in front and minimize

that now that's a challenge um because

we don't know how to maximize

information content we know how to

minimize it because because we have

upper bounds on information we don't

have lower bounds on

information and so what we're going to

do is come up with some approximation

making some assumptions about

information

content knowing that it's actually an

upper bound on information content and

we're going to push it up and cross our

fingers so that the actual information

content actually

follows okay and it works it's not well

justified but it's better justified than

everybody anything else that people have

done so

um so it would be nice if we could come

up with uh lower bounds on information

content but frankly I don't think it's

possible

uh because there may be complicated

dependencies that you know you you don't

know the nature of and um and so it

doesn't work so the basic idea of you

know how do you kind of put a number in

a sort of differentiable objective

function on information content and the

basic idea is

um make the representations coming out

of your encoder fill the space okay and

that idea has been proposed by multiple

people almost simultaneously in

different contexts um and there are

basically two ways to do this so the the

contrastive methods are should be called

really sample contrastive right so take

a matrix of vectors coming out of your

encoder for a number of

samples contrasting methods attempt to

make the vectors coming out of the

encoder all

different okay so imagine they're all on

the surface of a sphere because you

normalize them you're basically pushing

all those vectors away from each other

so they fill up the

space it doesn't work too well i mean

you need basically a lot of rows for

this to work um to do something useful

if you have a small number of rows then

it's very easy to have random vectors

you know pointing orthogonal directions

so you need a lot of rows for this to

work so the converse is dimension

contractive methods

um where you take the columns of that

matrix and you try to make the

columns different from each other maybe

orthogonal to each other and that only

works if you have a small number of rows

relative to the dimension because

otherwise it's too easy you have a small

number of high dimensional vectors that

need to be orthogonal i mean take them

randomly they'll almost be orthogonal

okay so you have kind of a duality

between those two things in fact we have

a paper on the fact that those two

things are dual of each other um but I

prefer the second one because uh they

can deal with

highdimensional representation space

whereas the first one really

can't um so what we've come up with is

something we call vic that means

variance invariance covariance

regularization and basically the idea

there is you take the sx representation

and you have a cost function that

insists that over a batch of samples the

variance of each u variable is at least

one using a hinge

loss and then a second cost that insists

that the offdagonal terms of the

coariance matrix so basically this

matrix

uh transpose multiply by itself

um that the offdagonal term of the

coarance matrix are as close to zero as

possible so basically you are trying to

decorrelate the individual variables

make the columns of that representation

matrix

orthogonal and other people have had

kind of similar ideas at Berkeley and

some of my colleagues at NYU um with a

method called MMCR

um so um and we have some theorems that

prove that in certain conditions when

you apply this criterion to decorrelate

uh after going through some nonlinear

function what you're actually doing is

making uh the the variables pair-wise

independent not just uncorrelated which

is uh interesting but still you know

this is all kind of a little uh

wishy-washy so a lot of challenges I

think for astute mathematicians

Um now I'm going to skip this about

latent variables because I don't have

time i'm just going to show you the last

slide about this because again that has

to do with Gibbs um the fact that if you

want to minimize the information content

of of a variable which you need to do if

you have a laten variable a good way to

do this is to make it noisy so instead

of inferring a single value for this

variable you infer a distribution over

this variable you maximize the entropy

of that distribution and the best way to

do this is by writing down what's called

a variation of free energy and

minimizing that and uh you know Gibbs

had something to to say about

that um okay um this I'm also going to

skip for lack of time um you can

actually use this V crack technique um

to and apply to PD PD not solving but

like for example finding the

coefficients of a particular PD by just

looking at uh windows on the solution so

basically take the space-time solution

of a PD um take two windows and train a

system to produce a representation of

those two windows

uh using this VCR criterion by basically

forcing the two representations to be

identical regardless of the pair of

windows and the only thing that the

system can extract which is common

between different windows in the

solution of a PD are basically the

coefficients of the PD representation of

the equation of of the differential

equation itself and when you I don't

have time to explain but when you apply

this to various situation it actually

works if you want more details about

this you might want to talk to uh Rand

Bisrio who is some sitting somewhere

here where are

you right here okay uh he's one of the

main authors of that paper um and he can

give you some details uh the the bottom

line is that when you use this Vicreg

stuff to learn the coefficients of a PT

you actually get a better prediction

than if you train it in supervised mode

which is kind of interesting okay

there's another set of methods uh

alternative to Vicrag uh called

distillation based methods and we use

them because they work well but I don't

like them because they're not they're

even less theoretically justified than

um than the the VCR information

maximization techniques and I'm not

going to go into the details of of how

they work um but you're minimizing a

function that is not actually being

minimized by gradient descent it's it's

a mess there are some theoretical papers

on this i I listed one at the bottom

here um but uh only in the case of kind

of linear encoders and predictors and uh

really it's not such a satisfying method

but it works really well and a lot of

people have been using it for uh

learning features of images in the

self-supervised manner um um is a

technique called Japa which I won't have

time to describe in detail but it it

does a really good job at kind of

learning representations of images that

you can use then for a subsequent task

uh that would be supervised but without

requiring too many labelled samples um

and then there is a a video version of

this called video japa so you take a a

video you mask a big chunk of it um at

all times for example and then you train

some gigantic neural net a JPA

architecture to predict the internal

representation of the full video from

the representation of the partially

masked one and what you get in the end

is a system that does a really good job

at representing videos you can use that

representation as input to a system that

can classify the action taking place in

the video and and doing things of that

type and I'm not going to bore you with

tables of results but it works really

well and one really interesting thing

about this technique this is a paper

that u uh we just finished and we're

we're submitting um when you test those

systems and you measure the prediction

error they're they're doing uh on a

video if you show them a video that is

physically impossible like an object

disappears or changes shape

spontaneously it tells you it tells you

that can't happen my prediction error

goes through the roof and so those

systems this system has kind of learn a

very basic form of common sense a little

bit like you know the babies I was

talking about earlier i mean that's

really a surprising result because the

system is really not trained to um

predict it's just to kind of fill in uh

missing parts um and we've been using

u kind of self-supervised encoders and

predictors to do planning so I'm I'm

coming to this idea of world model so

let's say you have a picture of a state

of the world and the system can control

a robot

arm and u you

want the system to basically act in such

a way that the final state of the world

looks like u you know a particular

target um so let's say you have a bunch

of blue chips on the table and you want

to kind of move a robot arm so that at

the end the blue chips are all within a

a nice little square as represented here

in the middle um and so um train um an

encoder so we use dynov2 which is a

pre-trained encoder and then train a

world model to predict what is going to

happen at the representation level when

you take a particular action can you

predict the resulting effect uh on the

on that board with blue chips um and

then once you have that world model

which you can train from you know random

data with random actions and random blue

chips can you use it to plan a sequence

of actions so as to arrive at a

particular um result and I'm going to

cut to the

chains um so I mean there's various

problems that we've applied this to and

it works pretty well for planning

several things but here is the result

for the the blue chips okay so what you

see here is a video uh you don't see the

actions of the robot arm but it's

actually taking actions and what you see

at the top is what's happening in the

world and what you see at the bottom is

what the system predicts is going to

happen in its own internal world model

we trained a separate decoder to kind of

produce an image of what the internal

thinking of the system is let me play

this again

um so at the bottom here you see kind of

the uh configuration kind of progressing

as the robot kind of pushes things

around and then the end state is not

exactly a square but it's pretty close

and this is a very complex dynamical

system where the chips are interacting

with each other you you could not you

know really kind of model this uh

sufficiently accurately to kind of do

planning in uh with a hand constructed

model really uh we've kind of similar

work uh for planning and navigation in

in real world i'm going to skip this

because I'm running out of time so my uh

recommendations um abandon generative

models in favor of the joint embedding

architectures abandon probabilistic

models in favor of energy based models

abandon contrastive methods in favor of

regularized methods uh abandon

reinforcement learning i've been saying

this for a decade now if ever of model

predictive control and planning

um and if you are interested in human

level AI just don't work on

Labs in fact if you are a PhD student in

AI you should absolutely not work on Lab

because you're putting yourself in

competition with enormous teams with

tens of thousands of GPUs at their

disposals you're not going to be able to

contribute anything um okay problems to

solve large scale run models how do you

train them on multimodel inputs planning

algorithms um I think there's a lot of

uh expertise perhaps in the room here in

optimal control in sort of various ways

to perform optimization uh when you

perform use gradient based method for

planning uh you get hit by all kinds of

local minima issues and non

differentiability and all that kind of

stuff so uh maybe methods like adm

um JPA with latent variables and

planning with uh nondeterministic

environment uh regularizing latent

variables uh uh hierarchical planning uh

etc and so you know what are the

mathematical foundations of energy based

learning when we're stepping out of

probabilistic learning we're getting

into the

unknown and characterizing what is

appropriate to do there is not clear uh

learning cost modules I didn't talk

about this but that's also an issue uh

planning with inaccurate world model and

how you adjust the world models

um okay so maybe if we solve all those

problems within the next decade or or

half decade we're going to be on a good

path towards building systems that are

truly intelligent that are capable of

planning and reasoning um and I think

the only way for this to work is if the

platforms are open source i've been a

big advocate of open source AI and I I

really believe in this um but if we

succeed maybe um AI will be sort of a

big amplifier of human intelligence that

can be only good thank you very much

[Applause]

[Music]

heat

Loading...

Loading video analysis...