Stanford AI Club: Jeff Dean on Important AI Trends

By Stanford AI Club

Summary

## Key takeaways - **Needed Million Times More Compute**: Jeff Dean's 1990 senior thesis on parallel neural net training with 32 processors failed because neural nets required a million times more processing power than expected to work well. [04:48] - **DistBelief Enabled Huge Nets**: In 2012, DistBelief software allowed training 50-100x larger neural networks than before using asynchronous updates across 200 replicas, despite being mathematically incorrect, it worked. [06:10] - **Cat Neuron Emerged Unsupervised**: Training on 10 million random YouTube frames with unsupervised reconstruction objective produced neurons detecting cats, human faces, and pedestrians without labels, yielding 70% ImageNet improvement. [08:00] - **Word Vectors Capture Directions**: Word embeddings trained on raw text place related words nearby like cat-puma-tiger, with meaningful directions like king - man + woman ≈ queen for gender shifts. [10:13] - **Transformers 10x Fewer Parameters**: Transformer architecture achieved same language model loss with 10x fewer parameters and 10-100x less compute than prior methods by attending to all past states instead of compressing into a vector. [18:40] - **TPUv1 15-30x Faster Inference**: Google's TPUv1 for inference was 15-30x faster than contemporary CPUs/GPUs and 30-80x more energy efficient, designed for low-precision matrix operations in neural nets. [14:20]

Topics Covered

Scale trumps 32 processors by millionfold
Cats emerge unsupervised in billion-parameter nets
TPUs deliver 80x efficiency leap
Transformers slash compute 100x
Gemini earns IMO gold

Full Transcript

Uh for a quick intro on Jeff. Jeff

joined Google in 1999 as his 30th employee where he built some of the most foundational infrastructure that powers the modern internet including map produce, big table and spanner. Jeff

went on to found Google Brain in 2011 where he developed and released TensorFlow, one of the world's most popular deep learning frameworks. Jeff

now serves as the chief scientist of Google Deep Mind and Google research where he leads the Gemini team. Jeff,

it's an honor to have you here today and take it away.

>> I think I und >> fantastic. Okay, so uh what I thought I

>> fantastic. Okay, so uh what I thought I would do today is talk to you about important trends in AI. sort of a whole bunch of developments that have happened mostly over the past 15 years or so and

you know how those kind of have fit together well into building sort of the modern capable models that we have today. Um this is presenting the work of

today. Um this is presenting the work of many many people at Google and some of it is also from elsewhere. Uh but uh you know I'm sometimes just the messenger

sometimes a uh collaborator and developer of some of these things. Um so

first a few observations. Uh I think in the last decade or so machine learning has really changed completely changed our expectations of what we think is possible with computers, right? Like 10

years ago you could not get very natural speech recognition and conversations with your with your computer. You they

weren't really very good at image recognition or understanding what's in visual uh form. They didn't really understand language all that well. Um

but what has happened is we've discovered uh that a particular paradigm of deep learning based methods neural networks and and increasing scale has

really delivered really good results as we've scaled things up and along the way we've developed really new and interesting algorithmic and model architecture improvements that have also provided these massive improvements and

these are often kind of comp uh you know combine well so even bigger things with even better algorithms tend to work even better. Uh the other thing that's been a

better. Uh the other thing that's been a bit of a a significant effect in the whole computing industry is the kinds of computations we want to run and the hardware on which we want to run them

have dramatically changed. um right like 15 years ago mostly you cared about how fast was your CPU maybe how many cores did it have could it run Microsoft Word

and Chrome or you know uh traditional um you know handcoded uh program uh uh computations quickly now you care can it

run uh interesting machine learning computations uh with all kinds of different kinds of constraints

okay So a rapidfire advance or whirlwind tour of 15 years of machine learning advances. How did today's models come to

advances. How did today's models come to be? It's going to be like one or two

be? It's going to be like one or two slides per advance. There's often a you know archive link or a paper link that you can go learn more. But I'm going to try to give you just the highest level

essence of like why was this idea important and and what does it help us with?

Okay. But I'm gonna even go back more than that. I'll go back like 50 years.

than that. I'll go back like 50 years.

uh neural nets. Turned out uh these are a relatively old idea and this notion of artificial neurons that we have weights

on the edges and we can sort of learn to recognize certain kinds of patterns um actually turns out to be really important. And then combined with that

important. And then combined with that back propagation as a way to learn the weights on the edges turns out to be a really key thing because then you can do

endtoend learning on the entire network from some error signal you have. And so

this was kind of the state of affairs when I first learned about neural networks in 1990, my senior year of of college. And I got like really excited.

college. And I got like really excited.

I'm like, "Oh, this is such a great abstraction. It's going to be awesome.

abstraction. It's going to be awesome.

and we could build really great pattern recognition things and solve all kinds of problems. So I got really really excited and I said, "Oh, I'm going to do a senior thesis on parallel training of

neurons." Um, and so what I ended up

neurons." Um, and so what I ended up doing was like, well, let's just try to use the 32 processor machine in the department instead of a single machine and we'll be able to build really

impressive neural networks. So it's

going to be really great. And I

essentially implemented two different things that we now call data parallel and model parallel training of neural nets on this funky hyper cubebased

machine and then looked at uh you know uh how that scaled as you as you added more processors. Um so it turns out I

more processors. Um so it turns out I was completely wrong. You needed like a million times as much processing power uh to make really good neural nets, not 32 times. Uh but it was a fun exercise.

32 times. Uh but it was a fun exercise.

I really enjoyed uh writing this thesis.

And then then I went off and decided to do other things in grad school. Uh but

this always kind of had a little inkling in the back of my mind. This could be an important abstraction.

Um so in uh 2012 uh I bumped in actually to Andrew Ing in a micro kitchen at Google. I'm like, "Oh, hi Andrew. How

Google. I'm like, "Oh, hi Andrew. How

are you?" He's like, "What are you doing here?" And he's like he's like, "Oh,

here?" And he's like he's like, "Oh, well, I'm starting to spend a day a week at Google and uh I haven't really figured out what I'm doing yet here, but my students at Stanford are starting to

get good results with neural nets on um you know, various kinds of speech problems." I'm like, "Oh, that's cool.

We should train really big neural networks." So uh that was kind of the

networks." So uh that was kind of the genesis of the Google brain project was how do we scale uh large training of of

neural networks using lots and lots of computation and at that time we didn't actually have uh accelerators in our data centers we had lots and lots of

CPUs with lots of cores. So we ended up building this uh software abstraction uh that we called disbelief in part because people didn't believe it was going to

work. uh but uh this ended up supporting

work. uh but uh this ended up supporting both model parallelism and also data parallelism. Uh and in fact we did uh

parallelism. Uh and in fact we did uh this kind of funky asynchronous training of multiple replicas of the model on the right hand side where before every step with a batch of data one of the replicas

would download the current set of parameters and it would kind of crunch away on doing one one batch of training on that and computer a gradient update.

that's the delta w there and send it to the parameter servers who would then add in the delta w to the current the the parameters that it was hosting. Now this

is all completely mathematically wrong because at the same time all the other mathematic model replicas were also computing gradients and asynchronously adding them into this shared uh set of

parameter state. Uh so that made a lot

parameter state. Uh so that made a lot of people kind of nervous and because it's not actually what you're really supposed to do but it turned out it worked. So that was nice. Uh and we had

worked. So that was nice. Uh and we had systems where we'd have 200 replicas of the model all turning away asynchronously and updating parameters.

Um and it seemed to work reasonably well. And we also had model parallelism

well. And we also had model parallelism where we could divide very large models across you know many many computers and end up with um you know so this system

enabled us at in 2012 to train a 50 to 100x larger neural network than anyone had ever trained before. Um they they

look really small now but at that time we were like oh this is great. Um and so one of the first things we used this system for was uh what's known as the

cat paper which where we took 10 million uh random frames from random YouTube videos and just used an unsupervised objective function to be able to learn a

representation that we could then use to reconstruct the raw pixels. scope uh of of each frame and the learning objective was sort of trying to minimize the error

in the reconstruction of the frame given the input frame. So you don't need any labels and in fact the system never saw any label data for the unsupervised

portion. And but what we found was that

portion. And but what we found was that at the top of this this model, uh, you'd end up with neurons that were sensitive to whether the image contained different

kinds of of sort of high level concepts, even though it had never been taught, you know, what a cat was. There was a neuron where the most the the strongest

stimulus you could give that neuron was something like that. And so it had sort of come up with the concept of a cat just by being exposed to to that. And it

also there were also other neurons of like human faces or the backs of pedestrians or things like that. Um and

perhaps more importantly we got very large increases in state-of-the-art on the uh more thinly traded imageet 22,000

category uh uh benchmark. Uh most people competed in and the one you usually hear about is the 10,00 category one. They're

like well let's do the 22,000 one. Um

and so we got actually like a 70% relative improvement in the state-of-the-art on that. Uh and what we are also able to show is that we did unsupervised pre-training. We actually

unsupervised pre-training. We actually got a pretty significant increase in the the uh accuracy.

Um we also started to think about uh language and looking at how we could have nice distributed representations of words. So rather than representing words

words. So rather than representing words as discrete things, we wanted to have a neural net-like representation for every word and then be able to learn those

representations so that you end up with these highdimensional vectors that represent each word or or phrase in the system. Um and when you do that, uh we

system. Um and when you do that, uh we had a few different objectives for how do you train this? One way you can do it is use the middle representation of use the word in a sequence of words and use

its representation to try to predict the other nearby words and then you can get an error signal and back back propagate into the uh representations for all the words. Um and if you do this and you

words. Um and if you do this and you have a lot of training data which is just raw text that you you need to train this on then what you find is that the nearby words in the high dimensional

space after you train it are all quite related. So cat and puma and tiger are

related. So cat and puma and tiger are all nearby. But also interestingly we

all nearby. But also interestingly we found directions are kind of meaningful.

So if you subtract these vectors you end up going in the same direction to change the gender of a word for example um regardless of whether you start at king or you start at man you end up being

able to do that. And there's other directions for like past tenses of verbs and you know future tenses of verbs. So

that was kind of interesting.

Um then my colleagues Ilia, Oriel and Quac uh worked on using LSTMs, so these kind of recurrent uh long short-term

memory models to uh work on a a particularly nice problem abstraction where you have one sequence and you're going to use that to predict a different

sequence. Um and it turns out this has

sequence. Um and it turns out this has all kinds of uses in the world. So one

uh use that they focused on in the paper was translation. So you have uh an

was translation. So you have uh an English sentence say and you're going to try to predict the French sentence and you have a bunch of training data where you know the correct French translation

of an English sentence and so you end up uh using that as a supervised learning objective to then learn good representations in the the recurrent

model in order to do this translation task. Um and if you see enough English

task. Um and if you see enough English French sentence pairs and use this sequence based sequence to sequence based um learning objective then you end up with a quite high quality translation

system. Uh turns out you can use this

system. Uh turns out you can use this for all kinds of other things as well.

Um but I will not talk about that.

So, one of the other things we started to realize as we were getting more and more success in using neural nets for all kinds of interesting things and speech recognition and vision and

language was that um well actually I I did a a bit of a back of the envelope calculation with a we we had just produced a really high quality

speech recognition uh model that we hadn't rolled out but we could see that it was you know much lower error rate than the current production speech recognition system at Google,

which at that time ran in our data centers. Um, and so I said, "Oh, well,

centers. Um, and so I said, "Oh, well, if speech recognition gets a lot better, people are going to want to use it more." And so what if a 100 million

more." And so what if a 100 million people want to start to talk to their phones for three minutes a day? Uh, just

as like random numbers pulled out of my uh head. Um, and it turned out if we

uh head. Um, and it turned out if we wanted to run this high-quality model on uh CPUs, which is what we had in the data centers at that time, we would need to double the number of computers Google

had in order just to roll out this improved speech recognition features.

Uh, so I said, well, we really should think about specialized hardware because there's all kinds of nice properties for neural net computations that are that we could take advantage of by building

specialized hardware. in particular. Um,

specialized hardware. in particular. Um,

they're very tolerant of very low precision computations. So, you don't

precision computations. So, you don't need like 32-bit floatingoint numbers or anything like that. And all the neural nets that we'd been looking at at the

time were just different compositions of essentially dense linear algebra operations, matrix multiplies, vector.products, and so on. So if you

vector.products, and so on. So if you can build specialized hardware that is really really good at reduced precision linear algebra then all of a sudden you can have something that's much more

efficient and uh we started to work with a you know a team of people who uh are chip designers and you know uh board

designers and this is kind of um a paper we ended up publishing a few years later but in 2015 we ended up having these uh TPUv1 so the tensor processing unit uh

which was really designed to accelerate inference um uh roll out into our data center and we were able to do a bunch of nice uh sort of empirical comparisons

and show that it was 15 to 30 times faster than CPUs and GPUs at the time and 30 to 80 times more energy efficient. Um so uh and this is now the

efficient. Um so uh and this is now the most cited paper in ISA's 50-y year history excited about um and then

working with that same set of people we realized that we also wanted to look at the training problem because uh inference is like a nice you know small scale problem where you can have a at

that time a a single PCIe card you could plug in to a computer and have a whole bunch of uh you know uh models that run on that. But for training, it's a much

on that. But for training, it's a much larger scale problem. And so we started to design much uh essentially machine learning supercomputers around the idea

of having low precision uh a high-speed custom network um and sort of a compiler that could map high level computations

onto the onto the actual hardware. Um

and it ended up with a whole sequence of TPU designs that are sort of progressively faster and faster and larger and larger. Um, and our most recent one is we've changed our naming

scheme. It's no longer what you might

scheme. It's no longer what you might expect. Uh, it's called Ironwood. Um,

expect. Uh, it's called Ironwood. Um,

but the pod sizes for this system are, you know, 9 9,216 chips all connected in a 3D uh uh Taurus

and uh quite a lot of bandwidth and capacity. And if you compare that to

capacity. And if you compare that to TPUv2, which was our first ML supercomputing pod, uh it's about 3600 times the peak performance per pod

compared to the first one, which to be fair was only 256 chips instead of 9,000, but still every individual chip is also much faster. Um, and it's also

about 30 times as energy efficient as the PPV2. Now some of that comes from

the PPV2. Now some of that comes from scaling of process nodes and so on but some of it just comes from you know looking at uh energy consumption in all kinds of ways and building really energy

efficient systems. Uh another thing that's happened is open source tools have really enabled the whole community. So we developed

whole community. So we developed TensorFlow as a successor to our internal disbelief system which we'd used for hundreds or thousands of kinds of models and fixed a bunch of things in

it that we didn't like uh and decided to open source it when we start first started building it. Uh a bunch of people were working a little bit later

on uh a system called torch that used a language called Lua uh and didn't get very popular because most people don't want to program or did not know Lua. uh

but then they built a version called PyTorch that was Python based that uh really had a lot of success and uh uh another team at Google uh has been building a system called Jax that has

this nice functional way of expressing uh machine learning computations uh and those have really enabled the whole community in lots of ways like many different kinds of uh applied ML things

are doing uh using those some of those frameworks researchers are using those and so on uh in 2017 uh several of my colleagues worked on

this attention-based mechanism uh building on some earlier work on attention but coming up with this really nice architecture that is now at the core of most of the uh sort of exciting

language-based models that you're seeing today. uh and their their observation

today. uh and their their observation was really in unlike an LSTM where you sort of have a word and you consume that word by updating your internal state and then you go on to the next word. Their

observation was hey let's not try to force all that state into an a vector that we update every every step.

Instead, let's just be able to save all those uh states we go through and then let's be able to attend to the all of them uh whenever we're trying to do something based on the context of the

past. Uh and that's really uh kind of at

past. Uh and that's really uh kind of at the core of the the attention is all you need uh in the title. And what what they were able to show was that you could get

much higher accuracy. This is from the paper with 10 to 100x less compute and in this case 10 times smaller models. So

this is the number of parameters on a log scale for uh a a language model to get down to a particular level of of uh loss. And what they were able to show is

loss. And what they were able to show is that 10 times fewer parameters in a transformer-based model uh would get you there. And also in other data in the

there. And also in other data in the paper they've showed 10 to 100x less compute.

Uh another super important development has been just language modeling at scale with unsup with self-supervised data. Uh

there's lots and lots of text in the world. You know, self-supervised

world. You know, self-supervised learning on this text can give you almost infinite numbers of training examples where the right answer is known because you have some word that you've

removed from the view of the model and then you're trying to predict that word.

And there are a couple of different flavors. One is auto reggressive where

flavors. One is auto reggressive where you get to look to the left and try to predict what's the next word given all the words that you've seen before that.

So Stanford blank Stanford and the true word is university. So you make a guess for this word. If you get it right great if you get it wrong uh you know then you

can uh use that as a error signal to then do back propagation through your entire model. And you know, looking at

entire model. And you know, looking at that first blank, it's not necessarily obvious it's going to be a university, right? Could be Stanford is a beautiful

right? Could be Stanford is a beautiful campus or something. Um, and so all the uh effort you put into doing this kind of thing makes it so the model is able

to take advantage of all this context and make uh better and better predictions.

Uh there's another objective you can use where you get to look at a whole bunch more context both to the left and right and you just try to guess the missing words. So if you've ever played Mad

words. So if you've ever played Mad Libs, it's a bit like that. You know the Stanford blank club blank together blank and computer blank enthusiasts. Um so

you can some of those you can probably guess some of those are harder to guess.

Um but that's uh really kind of the key for doing self-supervised learning on text which is at the heart of modern language models. Uh turns out you can

language models. Uh turns out you can also apply these transformer-based models to computer vision. And so

another set of my colleagues worked on you know how can we do that? And what

they found again was you know boldfaced things are the best result for a particular row. And what they found was

particular row. And what they found was uh these two were theirs in varying sizes of configuration uh but roughly you know four to 10 four

to 20 times less compute you could get to the best results. So again

algorithmic improvements make a big difference here because now all of a sudden you can train something much bigger or use less compute to get the same accuracy.

Um, so I and a few other people really started to encourage some of our colleagues and gather a small group of

people to work on much sparser models um because we felt like in a normal neural network you have the entire model activated for every uh example or every

token you're trying to predict u and that just seems very uh wasteful. It'd

be much better to have a very very large model and then have different parts of it be good at different kinds of things and then when you call upon the expertise that's needed in the model you

only activate a very small portion of the overall model. So maybe one to one to 5% of the total uh parameters in the model are used in any any given

prediction. And again, we were able to

prediction. And again, we were able to see that this was a major improvement in time to or compute to to a given level

of accuracy. That's like this line here,

of accuracy. That's like this line here, L and M, uh, showing about a 8x improvement in training reduction in training cost compute for the same

accuracy. Or you could choose to spend

accuracy. Or you could choose to spend that by just training a a much better model with the same compute budget.

And then we've continued to do a whole bunch of uh research on sparse models because we think this is quite an important thing. And indeed most of the

important thing. And indeed most of the models you hear about today uh like Gemini models for example are sparse uh models.

um in order to support sort of more interesting kind of weird sparse models, we started to build uh compute

abstractions that would let us map uh you know interesting ML models onto the hardware where you didn't have to think as much about where uh you know particular pieces of the computation

were located. So, Pathways was the

were located. So, Pathways was the system we we built that was really designed to be quite scalable to simplify running these really large scale uh training computations in

particular. Um

particular. Um and uh oh well so so one thing like if each of these is one of these TPU pods, there's a super high-speed network

between the chips and the pod. Um, but

then sometimes you want a job that will span multiple pods. And so then the orange lines are sort of the local data center network in the same building that you can use to communicate between

adjacent pods. Uh, and then maybe you

adjacent pods. Uh, and then maybe you have multiple buildings on the same campus where you have some network between the two buildings. That's the

purple line. And you can even uh run computations where um you're using multiple metro areas. uh and you know a

a long distance uh high-speed link to communicate between multiple metro areas and one of the things pathways does is it orchestrates all this computation so you don't you as a ML researcher don't

have to think about okay which network link should I use it sort of chooses the best thing uh at the best time and it uses deals with failures of what happens if one of these chips goes down or one

of these pods goes down uh things like that uh and one of the things that it provides ides as an abstraction is we have a layer uh underneath Jax that is a

pathways runtime system and so we can then make a single Python process look like a Jax programming environment that instead of having four devices has

10,000 devices um and you can use all the normal Jack's machinery to express okay I'd like to you know run this computation on all these devices

um so another of my colleagues uh uh uh set of my colleagues worked on how can we sort of use better prompting of the model to elicit better answers. And you

know one of their observations was if you uh in this case we're giving the model one example of a problem and then we're asking it to solve a different

problem but similar to the example we gave it. And if you just tell the model

gave it. And if you just tell the model here's the the example problem and it just is give uh is told to give the answer like you know the answer is nine

then it doesn't do as well as if you give the model sort of uh some guidance that it's supposed to show its work and demonstrate that in the first problem

and then it will actually go ahead and emit its u you know uh show its work for the actual problem you're trying to get it to solve.

Um, and you know, one way of thinking about this is because the model gets to do more computation for every token it emits. In some sense, it's able to use

emits. In some sense, it's able to use more compute in order to arrive at the answer. But it also is helpful for it to

answer. But it also is helpful for it to be able to reason through problems kind of step by step uh rather than trying to just internally come up with the right

answer. And you know this uh uh paper

answer. And you know this uh uh paper showed that you got pretty significant increases in accuracy on GSM8K which is like a middle school math benchmark kind

of like these problems. um if you use this chain of thought prompting uh versus standard prompting. Now remember

this was three years ago, right? And we're really excited that we've now gotten 15% correct on eighth grade math problems of

the form. You know, Sean has five toys

the form. You know, Sean has five toys and for Christmas he got two more. Uh so

we've made a lot of progress on math in the last few years.

Uh another important technique uh turns out to be a technique I worked on with Jeff Hinton and Oral Vineyols uh called distillation. Uh and the idea here is

distillation. Uh and the idea here is when we're doing this sort of next word prediction uh you know if you're doing uh self-supervised learning you you perform

the conerto for blank and the correct answer in the text you're you're training on is violin. Um, but it turns out if you have a really good uh neural

network uh already, you can use that as a teacher and the teacher will give you a distribution of likely words for that

missing word.

And so it turns out that you can use this distribution uh to give the student model much more information when it gets something wrong, right? because it's, you know,

wrong, right? because it's, you know, very it's likely the word is violin or piano or trumpet, but it's extremely unlikely it's airplane. And that rich

signal actually makes it much easier for the model to learn quickly. Uh, and in particular, what we showed in this paper

was that this was a speech data set. So,

we're trying to predict the sound in a frame of audio uh correctly. And the

baseline if you use 100% of the training set uh you could get uh 58.9% on the test frames. But if you only use

3% of the training data then you get uh only 44% test frame accuracy. So a huge drop in accuracy. But if you use these soft targets for and a distillation

process then 3% of the training data you can get 57% accuracy. And so this is why this is such a super important technique because you can train a really really large model and then you can use

distillation to take a much smaller model and use the distillation targets to give you a really high quality small model that approximates uh quite closely

the performance of a large model.

Okay. And then uh in the 2020s I guess I should say uh people have been doing a lot more reinforcement learning for post- training. So once you've already

post- training. So once you've already trained a model on these self-supervised objectives and so on, you uh then want to sort of encourage the right kinds of

behavior from your model and you want to do that in terms of things like the style of the responses. Do you want it to be polite? You can give it reinforcement learning feedback or give

it examples of being polite and sort of do training on that to sort of coax the polite kind of answers out of the model and suppress the less polite ones. Uh

safety properties might want the model to just not try to engage with people on certain kinds of topics. Um but then you can also enhance the capabilities of the model by showing it how to tackle much

more complex problems and these can come from many different sources. So one is reinforcement learning from human feedback. You can use human feedback on

feedback. You can use human feedback on the outputs of the model. Uh where a human can say yeah that's a good answer.

No that's a bad answer. Yes, that was a good answer. And using lots of those

good answer. And using lots of those signals you can get uh the model to approximate the kinds of behaviors your human reward signal is giving you. Uh RL

from machine feedback is you can use machine feedback from another model often called a reward model. uh where uh you prompt the reward model to judge you

know do you like answer A or B better and uh use that as an RL signal uh but then probably one of the most important thing is RL and verifiable domains like

math or coding so here you can try to generate some sort of solution to a mathematical problem and let's say it's a proof and then you have a verifiable

domain so you can run a more traditional proof checker against the proof that the model has generated and then the proof checker can say yes that's a correct proof or no that's an incorrect proof

and in particular it's wrong in step 73 or something and that can give positive reward to the model when it reasons correctly. You can also do this for

correctly. You can also do this for coding where you give reward for code that compiles uh and then even more reward for code that compiles and passes the unit test that you have for some

coding problem and you just have a whole slew of problems you ask the model to try to solve and get rewards for when it solves it. And so this enables the model

solves it. And so this enables the model to really explore the space of potential solutions and over time it gets better and better at exploring that space.

Okay, so there's been all kinds of innovations at many different levels.

Uh, you know, many of which I just talked about, but I think it's important to realize everything from the hardware to the software abstractions and model architecture, training algorithms, all these things have all come together and

really contributed. And I'm way behind

really contributed. And I'm way behind time, so I'm going to speed up.

Okay. Uh so we've been working on Gemini models at Google which kind of combine a lot of these ideas into pretty interesting uh we think models. Um and

our goal with the Gemini effort is really train the world's best multimodal models, use them all across Google and make them available to external people as well. And just this week we released

as well. And just this week we released our 3.0 uh pro model. Um,

we wanted it to be multimodal from the start to take all kinds of different modalities as input and also produce lots of modalities as output. Uh, we've

been adding more modalities. This is

from the original deck report. Uh, we've

since added the ability to produce video and other kinds of things, audio.

Um, we believe in having a really large context length so that model can look at sort of lots of kinds of in pieces of input and reason about it or summarize

it or refer back to it. Um, that's been pretty important. Um,

pretty important. Um, you know, 2.0 sort of built on a lot of these kinds of ideas and uh was a quite

capable model. 2.5, you know, was also

capable model. 2.5, you know, was also quite a good model. Uh and then um just to show you how far the mathematical

reasoning has come uh we used a variant of the 2.5 pro model uh to compete in the international mathematical olympiad this year and also last year uh but this

year it was like a pure language model based system and we solve five of the six IMO problems correct which gets you a gold medal there. Um and uh you know

there's a nice quote from the IMO president. Uh

president. Uh so the way the IMO works is there's two days of competition. Each day you get three problems. The third of the problems on each day. So problems three

and problem six are the hardest. And uh

this was problem three which we did get correct. We didn't get problem six

correct. We didn't get problem six correct. Uh but we got all the other

correct. Uh but we got all the other ones correct. And so this is the problem

ones correct. And so this is the problem statement. This is the input to our

statement. This is the input to our model.

Um and this is the kind of output the model is able to produce uh which kind of goes on. I think the

judges like the elegance of our solution which is nice uh it goes on for a little while and you know therefore we have proved QED.

Um, so I think it's pretty good to sit back and appreciate how far the mathematical reasoning capabilities of these models have come since 2022 when we were trying to solve, you know, John

has four rabbits and, you know, got two more. Uh, how many rabbits does he have

more. Uh, how many rabbits does he have now?

Um, and then earlier this week, we released our Gemini 3 models. I'm really

excited about it as you can see. Um, you

know, and it performs quite well on a bunch of different benchmarks. uh

there's like way too many benchmarks in the world. Uh but you know, it's a good

the world. Uh but you know, it's a good way to benchmark uh to assess how good is your model relative to other ones, especially for you know, ones that are maybe more interesting or haven't been

leaked under the internet quite as much.

Um uh we're uh number one in the LM arena which is a good way of assessing sort of in a non-benchmark based way

where you allow a user to see two random uh anonymous language model responses to a prompt they give and then the user can say I prefer A or B and over time you

get a aggregated score from that because you can see is your model generally more preferred to other models. One of the things that's really happened is we had

a huge leap in webdev style coding uh versus our earlier model. Uh I'm going to skip this uh I'm going to skip Well,

I'll show you that.

So this is an example of you know the word Gemini skateboarding or the word Gemini surfing.

So it's actually generating code for animating all these kinds of things.

Cloating drew a beautiful landscape.

Here it is as a forest. I like that one.

So the the sort of you can give very high level instructions to these models and have them write code and it doesn't always work but when it works it's kind of this nice magical feeling. Um here's

another good example. This is, you know, someone had a whole bunch of recipes in various forms, some in Korean, some in English. And they,

English. And they, you know, basically just said, "Okay, I'm gonna scan them all in. I'm going to take

photos of them. Great. There we go.

They're all in there. Translate and

transcribe them. Awesome."

Okay. And there they're all transcribed.

And then our next step is going to be let's see if we can create a bilingual website using these recipes. So

go. We've now done this and we've generated some nice imagery for it. And

there you go. So now there's your website with your recipes. So that's

kind of nice. It combines a whole bunch of capabilities of these models to end up with something that might be kind of useful. Users generally seem to be

useful. Users generally seem to be enjoying this. Um,

enjoying this. Um, uh, yeah, I mean there's lots of quotes on the web. We also launched a much better uh, image generation model today.

Um, so that's been kind of exciting. Uh,

people seem to really like it. It can do pretty crazy things. So you can give it for example uh turn this blueprint into a 3D image of

what the house would look like.

Uh or take the original attention is all you need uh uh figure and please annotate it with all the important aspects of that happen in each different

spot.

Um Mustafa is one of the people who worked uh most on the nano banana work.

So uh one of the things that's interesting about it is it actually reasons in intermediate imagery.

So uh and you can see this in the thoughts if you use AI studio. So the question is you know tell me which bucket the ball

lands in. Use images to solve it step by

lands in. Use images to solve it step by step. And so this is what the model

step. And so this is what the model does.

It sort of does what you might think.

the first it rolls down there, then oh yeah, then it's gonna roll the other way onto RAM three, then it's going to roll on RAM five, and then it's going to be in B.

Um, it's kind of cool. I mean, that's kind of how you would mentally do it.

Uh, you know, it's pretty good at infographicy things, so it can uh, you know, annotate old historical figures

and tell you things. Uh uh I I posted this uh image of the solar system, you know, as an example. Show me a chart of the solar system. Ann annotate each

planet with one interesting fact. So

that's the image we did. Uh turns out if you do that, people are really sad, especially people my age or a little bit

younger. Uh so that's okay. So make this

younger. Uh so that's okay. So make this 219 to add Pluto and add a humorous comment. Um, you know, the former planet

comment. Um, you know, the former planet got demoted to dwarf planet status. Feel

grumpy about it.

Perfect. You're so back.

Okay. So, in conclusion, you know, I think I hope you've seen in your own use of these models and also in what I've presented that these models are really

becoming quite powerful for all kinds of different things. um further research

different things. um further research and innovation is going to uh continue this trend. It's going to have a

this trend. It's going to have a dramatic effect on a bunch of different areas. Uh, you know, in particular, um,

areas. Uh, you know, in particular, um, healthcare education scientific research, media creation, which we just saw misinformation um, things like that. And it potentially

makes really deep expertise available to many more people, right? Like if you think about the coding examples, there are many people who haven't been trained in how to write code and they can get

some you know uh computer assisted and their their vision uh can help them generate interesting you know websites for recipes or whatever. Um but done

well I think our AI assisted future is bright but I'm not completely oblivious like the areas like misinformation is a a potential area of concern. Um actually

John Hennessy and Dave Patterson and I and a few other co-authors worked on a paper uh last year that kind of touched on all those different areas and look and interview domain experts in all

those areas and you know looked asked them what their opinions were and how can we make sure that we get all the amazing benefits in the world for healthcare and education and scientific

research but also what can we do to minimize the the potential downsides from misinformation or other kinds of things. So that's what I got.

things. So that's what I got.

Loading...

Loading video analysis...