Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 9: Scaling laws 1

By Stanford Online

Summary

## Key takeaways - **Scale small models to predict big ones**: The whole point of scaling laws is to build simple predictive laws for language model behavior by training small models, learning from them, and extrapolating to bigger models to nail it in one go when building the big model. [02:21], [02:53] - **1993 Bell Labs first scaling law**: The first scaling law paper from 1993 NeurIPS by Bell Labs researchers proposed predicting model test error as irreducible error plus a polynomially decaying term by training small models and fitting curves to predict larger model behavior. [05:32], [06:52] - **Data scaling follows power law**: When plotting log data set size vs log test loss, performance is linear, implying polynomial decay in error, as seen empirically across domains and explained theoretically by nonparametric rates with slope -1/d where d is intrinsic dimension. [14:47], [19:56] - **Transformers beat LSTMs by constant factor**: Scaling laws show transformers have a constant factor compute efficiency gap over LSTMs no matter the scale, with LSTMs being about 15 times less efficient in log compute vs loss plots. [27:44], [28:27] - **Chinchilla: 20 tokens per parameter**: Chinchilla analysis using isoFLOP contours and minimum envelope methods finds optimal training uses about 20 tokens per parameter for fixed flops, better than Kaplan's earlier underestimation due to cosine schedule issues. [52:23], [52:31] - **MuP stabilizes LR across scales**: MuP reparameterizes models by scaling initialization variance and layer outputs based on width so optimal learning rate is stable and transfers directly from small to large scales without retuning. [40:43], [41:04]

Topics Covered

Scale Small Models to Frontier Labs
Scaling Laws Born 1993 Bell Labs
Power Laws from Nonparametric Dimension
Transformers Beat LSTMs at Scale
Chinchilla 20:1 Ratio Shifts Inference

Full Transcript

I'm going to talk a little bit about scaling laws. Originally, I think we

scaling laws. Originally, I think we were going to talk about inference, but I'll I'll take a few minutes to to start on scaling laws and then we'll kind of figure out where we'll go from

there. Okay. So, the whole point of

there. Okay. So, the whole point of scaling laws is kind of well to begin with I want you to to put yourself into the following scenario, right? So you

have a very rich friend um and he or she has given you 10,000 actually let's say 100,000 H100s for a month um and you have to build you know the best uh open source LM that you can right so this is

a a somewhat hard task and we've given you some of the tools that you need to make progress on this question um so you know you can uh put together your your infra team and your systems people and

you can put together a distributed training framework um in the next assign after that you're going to put together a great pre-training data And then you kind of know all about you know architectures and so on. So you kind of know you have all the pieces and so we

can turn the crank and we can run the big model. And in the the first couple

model. And in the the first couple lectures um we talked about you know all the other various decisions you might make along this journey right like what's the architecture what's the hyperparameters like how are you going

to do all these things well you know I think the in some ways the answer I gave you from those early lectures was just pick what other people have done right like just follow llama or whatever other

models but in a way that's a very boring answer because that doesn't let you push the frontiers right like if you're if you're in like a big frontier lab and you're going to build the best model you don't want to just copy other people,

you want to innovate, right? So, how do we innovate and get these optimized solutions in the first place? So, that's

kind of going to be um the point of scaling laws. What we want to do is we

scaling laws. What we want to do is we want to build simple predictive um laws for the behavior of language models. And

scaling laws are basically this this whole idea of being able to take small models, scale them up um and be able to to do that in order to to improve your engineering, right? So one way of

engineering, right? So one way of thinking about this is the old unpleasant way of doing deep learning is just you know train a bunch of big models tune your hyperparameters so that your big models are good right um that's

just going to cost tons and tons of compute like you can't really easily do that um and so I think the new optimism um and if you're sort of following a lot of this these developments on scaling

you know you kind of think of this as all right we're going to train a bunch of small models we're going to learn a lot of things from those small models and then we're going to extrapolate them back up two two bigger models, right?

So, we're going to take our smallest models at, you know, the left side of this this sort of compute scale here, and I'm going to learn a lot about what to do, and then I'm going to nail it in one go when I build my big

model. Um, and the first place I want to

model. Um, and the first place I want to start with is just kind of the the history and the background of scaling laws. Um, and I want to contextualize

laws. Um, and I want to contextualize this because I think when people talk about scaling laws, often this is done in like very like messaic like AGI terms. They're like scaling laws just tell you that,

you know, these amazing things are log linear forever and we will achieve, you know, super intelligence or something.

Um, but I think scaling laws are actually much more grounded and have a lot of interesting history. And so I'm going to start there to sort of try to convince you that scaling laws aren't necessarily just fitting lines on log

plots, although that is a very big part of what we're going to do. Um, and then I'm going to do basically very easy steps. I'm going to try to convince you

steps. I'm going to try to convince you that at least for data scaling laws are very natural thing to think about and expect. So, so as a person that's kind

expect. So, so as a person that's kind of brought up in in statistical machine learning, you know, my my starting point is going to be statistical machine learning, right? Like what is scaling

learning, right? Like what is scaling laws? You know, in some ways scaling

laws? You know, in some ways scaling laws are telling us as we increase the amount of data or we change the model size, we expect certain behaviors out of the model, right? And if you go back to

something like machine learning 101 and if you remember your like VC dimensions and rotomacher complexities and so on um in some ways that's the theory version of exactly this. So, so you know I have

on the top you know uh generalization bound for how the the generalization bound for the excess risk of learning amongst a finite set of k hypotheses and

we see that that should scale as one over square root of m. Right? In some

ways that's a theoretical version of a scaling law where we're making predictions about how fast our error should decay as a function um of n. on

the bottom we might have something a little bit more exotic if we're doing generative modeling and we're our generative model is a really flexible nonparametric class um what we might do instead is we might you know fit some

sort of smooth uh density so in this case you know uh our prediction is that the L2 sort of error uh of estimating a density is going to be upperbounded by

some polomial n to the beta over 2 beta plus one right this is what some people might call nonparametric rates so you know theorists have been thinking for a very long time about how sample size especially should relate to error.

Right? This is a very classic problem um that people have thought about in machine learning theory. But these are upper bounds not actual realized loss values. And really scaling laws are in

values. And really scaling laws are in some sense the leap from thinking about kind of the theoretical side of how should data and model size relate to performance and going to the empirical

side of saying actually our bounds are bad but maybe we can actually fit these things empirically.

And this is a fun trivia fact, you know, or or arguable trivia fact. Like what is the first scaling loss paper um and actually not many papers cite this one, but I think probably the right first

scaling law papers is um a paper from 1993 Nurips's uh from Bell Labs. Um and

you might recognize some of these names.

These are you know kind of theorists um and some of the people that have done really classic work in in machine learning theory like you know Vapnik and Karina Cortez and others. And I' I've taken excerpts because you know I was

reading this paper actually just you know preparing this this lecture earlier um and it just struck me how you know ahead of its time in many ways this paper was right it's saying you know

training classifiers on large databases is very computationally demanding and we need to figure out which ones are good before actually training them and so what we're going to do is we're going to propose a new predictive method that

predicts how good a model is going to be without actually training the whole thing right and that look that sounds a lot like scaling loss Um and you'll see this later. Um but you know they have a

this later. Um but you know they have a functional form that's basically like oh the test error of a model is you know expressible as some irreducible error plus a polomially decaying term. And

you're like huh that looks a lot like a modern scaling law and they even do the thing where they you know train a bunch of small models they fit their their curves and they're like oh we can accurately predict the behavior of the

model um further out. So, so as with many things, I guess, you know, scaling laws partially, you know, thought about at Bell Labs way back when. And of course, there's others that

when. And of course, there's others that I think, you know, have thought about related ideas in scaling. Um, not just scaling knowledge, but also really the modern mindset I think of of thinking

about scaling. Um, there's another paper

about scaling. Um, there's another paper that often gets mentioned in sort of the history of scaling laws. um Bo Banko and Bril um who was studying sort of how does the performance of a certain kind

of NLP system scale with the amount of data and they have you know what looks like you know very often a modern scaling law you know log axis data on the x-axis performance on the y-axis and

you know they're basically arguing well look we can get really dramatic performance improvements just by scaling up data it's very predictable and you know maybe we should consider the trade-off spent between you know

spending time and money on algorithm development versus just collecting more data. And you're like, "Huh, that sounds

data. And you're like, "Huh, that sounds a lot like what a lot of this pre-training stuff is thinking about." And then finally, you know, one

about." And then finally, you know, one of the things that I think people have thought about recently and in the past is, you know, is this thing really predictable? What are the right

predictable? What are the right functional forms? Um, and as early as

functional forms? Um, and as early as early as like 2012, you know, people were really thinking about, all right, like are these things actually predictable? you know is power law like

predictable? you know is power law like for example power 3 and pow 4 are those really the right functional forms for predicting the behavior of models and of course all of this you know just to

remind you right is thinking about the behavior of models on the y- axis the capabilities as a function of the amount of data that you have on the x-axis here right so that's the relationship that I think has been really classically

studied what you might call data scaling um in all these cases and if you're interested in like kind of the the earliest like largecale old neural scaling law paper that would

probably be hes at all in 2017. Um I

believe they were at by due when they did this work. Um they showed that for uh a range of tasks uh machine translation um speech and I think like some vision

tasks um they showed that essentially error rates fall as a power law um and they even have this nice plot that I really like to to refer to when when people are discussing scaling loss that really your expectation should be that

there's three different regions in the behavior of a model right initially you start out at best guess you then enter into a region where you're kind of predictably scaling the model that's the power law region region and then there's

another asmtoic region where you're approaching essentially the irreducible error of your model class. Um and I'll kind of highlight that I think you know there's been in the last few years a lot

of talk of new phenomena things like oh emerging capabilities or like scaling compute being a new thing or uh systems being really important. Um but had you been reading sort of Hessnus in 2018

carefully you would have seen essentially all of these things you know um they say actually you know it's really hard to do predictions by scaling law when models are are at random performance because suddenly you can

leave the random region. Um they talk about computational limits actually you know if we can scale it means actually scaling by compute is really important.

And then finally they even say things like you know maybe we should do things like quantization because if we have predictable scaling then that means we should be willing to pay for model accuracy with compute right these are

all very very modern ideas that I think a lot of the early scaling law papers I think kind of understood fairly intuitively because you know once you see these plots you kind of see that actually with predictable resource

investment you get predictable capabilities improvements right so that that's in some sense um sort of the core not quite history but I context um that

has really shaped uh scaling loss. All

right, any questions so far on kind of the context? This is mainly just kind of

the context? This is mainly just kind of data scaling, but I wanted to make sure we we go over it carefully.

Yes, like it's pretty natural for like scaling. I was wondering like like is

scaling. I was wondering like like is there cases where there isn't scaling where Yeah. Yeah. So the question was um you

Yeah. Yeah. So the question was um you know it's natural or or maybe it maybe arguably natural to expect scaling. Are

there cases where we don't get scaling or we get different kinds of scaling um and I think one way of thinking about this is if you're measuring kind of training loss or like you know held out

versions of training loss then I think scaling is very natural right like all of classical statistical theory says you know things should converge and when they converge eventually they will get better right at some sort of very

asmtoic sense. Um but we do see

asmtoic sense. Um but we do see non-scaling behavior. Um there was a

non-scaling behavior. Um there was a really interesting competition a few years back um called like the inverse scaling prize where they were looking for things that like scale inversely as

models got better. Um and a lot of these are are very niche things like you know models tend to copy better and so if you want to like suppress copying behavior becomes really hard for really strong models for example. Um, but I think one

sort of like thing that ties a lot of that together is, you know, if you go really far out of distribution where the behavior is not well specified by the data, then you can get all sorts of behaviors like no scaling at all or

inverse scaling or what have you, right?

So in some sense you can think of this as like the extension of the classic like deep learning robustness problems. Cool. Okay. So now I'm going

problems. Cool. Okay. So now I'm going to talk about um the scaling behaviors of LLM like just essentially going through several kinds of empirical results. I'm going to walk you through

results. I'm going to walk you through um data scaling in particular and some examples just to convince you that this is a very natural object to expect and then we'll talk about model size which

is a a you know different kind of a thing.

Um so um scaling laws I think are fairly well established and they seem to appear very very often in kind of many variables right you see scaling in

compute um on the x-axis these are all taken from um Kaplan's scaling law paper which I'll refer to extensively in this lecture so the x-axis here is is log compute y-axis here is log test loss um

and on the right you see similar kinds of scaling both for data set size so this is the amount of data uh in parameters. One subtlety I'll mention

parameters. One subtlety I'll mention here as I as I sort of talk through this is you know when we scale things like data set size or parameters we're always assuming that the other variable in this case if you're scaling data set size the

model size is much much much bigger than you can saturate with the data set size right because obviously if you have way more data than you know parameters eventually you're going to sort of asmtote out right so in all of these

we're trying to avoid the asmtoic regime um they hold in also pretty non-standard settings they'll hold for for downstream tasks they'll hold out of distribution which is what's being shown here from

the Kaplan paper. Um, and so, you know, in some ways power law relationships seem to appear more often than we might initially expect, especially for these

OOD or other variables. So, I want to talk through

variables. So, I want to talk through data scaling laws first because I think they're the most intuitive. Like, at the very least, I think the theory for that is fairly clear. Um, and so to be precise, when I say something like data

scaling, what I mean is just some sort of simple formula that maps data set size, which I'm going to refer to as n, to um, our excess error, right? Excess

error is the error beyond the irreducible regime. And you know, if you

irreducible regime. And you know, if you recall um, that figure I referred to in Hess, um, what we are going to expect is monotonic logistic looking curves. And

really our interest is primarily going to be in the power law region to the irreducible error region. Like of course it's very interesting to also ask questions about what happens in the small data regions as we leave random

guessing. Um but that's much much harder

guessing. Um but that's much much harder to reason about. Whereas I think this right tail actually I can hopefully convince you that this part is actually a very very natural thing to expect

power loss scaling.

So okay right so the first empirical observation that we have right and this is kind of the thing that I'm going to convince you is natural is when we plot on the x-axis data set size and on the

y-axis test loss then on the log plot model performance is linear right um you might call this scale free or you might call it power law these are more sort of

physics um physics uh oriented terminology um and sort of this was established you know by by many people um but you might refer to to Kaplan to

see many examples of this. So I think you know as sort of the previous question sort of brought up right we kind of expect error to be monotone we train on more data error goes down

fairly obvious the part that is less obvious is the precise functional form of this scaling right so when I say it's a power law it's it's linear in log log space um and then so what is the

implication of that right if something is linear in log log that means that there's a polinomial relationship between your x-axis and your y-axis right um and why Is polinomial decay

natural? Well, I'm going to walk you

natural? Well, I'm going to walk you through two examples and both of those are going to result in some fairly natural polinomial decay. I'm going to start with the simplest possible example, right? Like this is just going

example, right? Like this is just going to be, you know, even stats 101 rather than, you know, machine learning 101.

So, what I want to do is I want to estimate the mean of a data set, right?

And estimating the mean is a task of estimating a parameter, right? I can ask for what's the scaling law. What's the

error of my mean estimation task as a function of data, right? So I can write that down. Well, you know, my input

that down. Well, you know, my input comes from a gausian and the task is to estimate the average. I've written those out in the blue box above. And what's

the error? Well, by sort of very standard arguments, right, the average is going to be also distributed as a gausian with the standard deviation divided by n. So I'm going to get, you know, sigma squared over n is my estimation error, right? This is the

expected squared error of my estimate.

And if you look at this, this is polinomial in n. And just to really drive the point home, you know, you take the log of both sides of this. Log of

the error on the left and log of sort of um of uh n on the right hand side. You

know, I get exactly log of error is equal to negative log n plus 2 log sigma. Right? Um so this is exactly the

sigma. Right? Um so this is exactly the kind of thing we expect and we expect a slope of one if we were to fit a scaling law for mean estimation.

So now you know equipped with this this new knowledge you might say all right I'm going to go around and I'm going to look at what the rates are for estimating different things and that will tell me about what I should expect

for data scaling and so you might say oh what I expect is one overn you might expect one over square root of n for agnostic learning um and so on and so forth so we should expect to see some like pretty nice round numbers on the

slope here right of a log plot I should expect to see like one or five um what do we actually find empirically when we look across these papers right um just

to sort of call them out in hessness for machine translation we see 0.13 for speech we see 0.3 and for language

modeling um we see an exponent of095 right those are all much much slower than the 1 overn or one over square root of n rates that you might

expect when you're just fitting simple functions so why might this be okay This will be the last math slide of this this lecture and then we can go to just fitting lines on log lock log plots the

rest of the time. But um this will hopefully drive the point home of why we might see these particular slopes. So we

know that neural nets aren't just estimating the mean, right? Or it's not even fitting a linear regression, right?

They can fit arbitrary functions, right?

So let's turn that into an example and let's work through that example. So um

my input is um x1 through xn. I have n samples and I'm going to put them uniformly in the 2D unit box. um and I want to estimate some random not random some arbitrary regression function y

equals f right and I'll assume f is smooth and so on um if you really want to be precise right there's some regularity conditions here um a simple approach to to estimating a regression

function f is just to cut the 2D space up into small boxes and within each box I can measure the average of the y values right like a very simple nonparametric regressor is to just cut

the space up and then to estimate what's going to happen now informally If we pick, you know, I'm going to have square root m boxes. Now, each box is going to get square root of n samples. And now my

error is going to be 1 over<unk> of n.

And if you sort of follow this logic through the more dimensions, you'll see that in dimensions, this is going to be error is equal to n^ 1 /d. And then sort of my overall scaling if I were to take

log log plots of the whole thing is I expect the slope of negative 1 /d, right? Um, and so why did I walk you

right? Um, and so why did I walk you through this example, right? I walked

you through this example because if you have flexible function classes, what people call nonparametric function classes, you expect dimension dependence and therefore the slope of the uh scaling law to actually move sort of

much more slowly. And in some sense, the slope is telling you almost precisely um kind of the intrinsic dimensionality or the ease of learning this task. Um and

people have argued this more formally or or sort of more literally. Um there's

been several sort of theory/impirical papers arguing that really the reason why we get these sort of exotic or non-standard rates of learning is that it is closely connected to the the

intrinsic dimensionality of the data. Um

and the sort of for example the plots of these predictions the dash lines and these these purple circles um are somewhat close although um you know you don't want to read too much into this because estimation of intrinsic

dimension is is an extremely difficult problem and as difficult as modeling uh the data uh overall. Okay. Oh yes. Yeah.

I mean I guess this is related to the point you made at the end. It's much

more but like yeah how do you how do you generate data that has an underlying intrinsic dimension at all from a simulation perspective? Yeah. So so uh

simulation perspective? Yeah. So so uh the results here well if you want for example to generate data that's actually not too hard. you could like write down a function that takes in like five

variables, right? And then that would be

variables, right? And then that would be a as long as all five of those variables like don't, you know, cancel each other.

That's a fivedimensional surface and you can add a little bit of noise and you're good to go. Um the difficulty here is that they're actually doing things like, you know, training on CFR and then they're having, you know, different uh they're trying to estimate the intrinsic

dimensionality of CRA. That's a, you know, much harder task. Okay. Um and data scaling laws are

task. Okay. Um and data scaling laws are quite useful. Um you know I was going at

quite useful. Um you know I was going at this from a let me explain to you scaling laws perspective but you can actually use scaling laws to do many interesting things right you can make

engineering decisions of various kinds using data scaling laws and people do um in fact do this. Um for example you know you might say well how does data set

composition affect performance not just data set size. Well, um, if you're changing the test set, you know, uh, Kaplan at all has a really nice figure showing actually data composition only

affects the offset, not the slope. And

what that would mean is it says if you want to pick a really good, uh, data set, you don't have to necessarily train your models at a huge scale. You can

scale them down and do your data selection experiments on much smaller models. um and the shape of the expected

models. um and the shape of the expected uh sort of as we mix sort of different data, we might expect certain kinds of sort of shapes and you can use regression and other kinds of techniques to try to figure out for example optimal

data mixing using scaling laws and people have written several papers um on this topic. Although you know as with

this topic. Although you know as with all data selection sort of uh research, a lot of this seems fairly tricky to execute reliably.

There's other also interesting questions that you might ask right there's a lot of discussion these days about you know are we running out of data right on the internet and so once you start asking those questions the other interesting

and important question is well can we just keep training on the same data we have what's the diminishing returns property of that right and so um there's interesting work extending scaling laws

to multi-epo training um basically arguing that there's an sort of effective sample size and after about four epochs you know you have rapid ly diminishing returns as you repeat more

and more data. And by modifying sort of um the usual scaling law, you can basically get a version where you have amount of effective data and unique

tokens that sort of diminish out as you increase the amount of repetition. Finally, I think one

repetition. Finally, I think one interesting sort of combination of these two ideas is if you're thinking about sort of data selection in the large data regime, right? Like imagine you're going

regime, right? Like imagine you're going to be training on trillions and trillions of tokens right now. What

would be better? Would it be better to repeat highquality sources like you know Wikipedia and perhaps your secret pirated books 10 times or would it be better to include new data right um the

fact that you can either repeat data or you can include more data right now has multiple sort of axes on which you can sort of optimize your data mixture. Um

and there's also been some interesting data scaling work. um this one from CMU folks um on essentially trading off between repeating data versus picking

lower quality data that's new, right?

And so all of this really is a is a really natural extension of what I sort of already taught you, which is if you assume that there's a predictive power law relationship, right? and that this

power law relationship holds sort of on a per mixture basis, then you can fit these sort of scaling law extrapolations and then get an estimate of how good

your data is going to be at scale.

Right? So that's the starting point um which is data scaling. Right? And

hopefully I've convinced you at this point both sort of empirically and conceptually that it's natural to have you know log log linear relationships between data and error. Um this

relationship seems to hold very robustly across domains across different kinds of models. Um and you can kind of have a

models. Um and you can kind of have a nice clean theoretical understanding um of what is happening here. Um and once you do this you can use this for all sorts of purposes like picking optimal

data mixtures um or whatever else. Okay.

Yes. How was the model size picked on the data scaling box? Yeah. So um as I was kind of saying back

in well not this slide but let's see back in this slide um when we think about kind of the data size scaling the the model is always picked to be really

really large. So the data is not

really large. So the data is not saturating your model right um and you want to kind of avoid being in this irreducible error regime. So the model is always picked to be large enough that

you're in the power law region whenever you're only varying data. So is it just one model size for like all of them that one really really big model size or like is each point like a different size

model? Yeah. For for for example for

model? Yeah. For for for example for this plot in particular it's for it's like one big model size. Okay. When

you're looking at for example compute scaling on this axis then data and model scale jointly at some like you know pre-ordained ratio. Cool. Any other

ratio. Cool. Any other

questions? Good. Okay.

Excellent.

Um all right. So, now I think we get to move

right. So, now I think we get to move from data scaling to, in my opinion, slightly more mysterious kinds of scaling. Um, and we're going to talk

scaling. Um, and we're going to talk about model scaling next. Um, and I think this is a more practical engineering set of questions that we're now going to try to answer. So you're in

charge of you know building and shipping a really large language model and there's a lot of interesting ideas out there right like you could train the latest you know state space model you could train a transformer you could use

atom you could use SGD right people invent all sorts of new tricks which ones are worth scaling up and which ones are not um you could also take you know uh your limited compute resources and spend them on different things you can

train models for longer or you could train bigger models right for given flop you can trade between these do um and you could also do things like go and collect more data versus get more GPS.

There's a lot of different sort of things that you can do and the scaling laws allow you to have a pretty simple procedure to just answer all these questions, right? So, I'll go through um

questions, right? So, I'll go through um the classic uh sort of Kaplan scaling law paper. If you're interested in these

law paper. If you're interested in these topics, I encourage you to read it. It's

just kind of a gold mine um of all these kinds of observations. Some of it is old, but it's um I think still unmatched in the thoroughess of all the things that it really studied um in a fairly

nice unified setting. So, architecture-

wise um you might start by asking like transformers versus LSTMs, right? Which

one's better? Well, you know, the brute force way might be to, you know, scale up LSTMs and up to like GPT3 level and then, you know, you can figure out whether it's good or not. Um the scaling law way is much simpler, right? You

basically train a bunch of LSTMs and transformers across many different compute thresholds or compute levels and then you kind of see what happens as you scale them up. And I think the trends here are fairly clear, right? Like no

matter how many layers you have on your LSTMs, there's a pretty big gap, right?

Pretty big constant factor gap between transformers and LSTMs, right? And

remember this is in log scale. So this

is kind of saying something like, you know, I don't know what the exact numbers are, but imagine this is like 15 times less efficient, right? that no

matter where you are on this plot, you know, the LSTM is, let's say, 15 times less compute efficient than a transformer, right? So that there's a

transformer, right? So that there's a constant factor compute penalty to using um LSTMs um at least in this plot. You

know, you could you could zoom out and say, well, there's a lot more architectures, you know, which ones are, you know, really good and worth doing.

um and sort of some of the classic papers. This one is by uh ET and others

papers. This one is by uh ET and others uh at Google um have done exactly this kind of scaling work where they took a bunch of architectures on the right here um and they basically scaled them up. So

the x-axis is the amount of compute. The

red line is basically each architecture and the green line is the transformer baseline, right? And they ask like oh

baseline, right? And they ask like oh can any of these alternative architectures match or out you know outscale uh the transformer, right? Um

and what what do they uh end up? Well,

actually the only thing that seems to like really strongly and reliably beat the transformer is you know gated linear units um and mixture of experts. And

once you know it, that's exactly the kind of stuff that people are doing today, right? And so this is kind of the

today, right? And so this is kind of the scaling law version of that same idea of saying like how would you have come to the conclusion that we should be doing switch transformers and GLU and and for example not the performer, right? Okay.

And the scaling law provides some some clear evidence of why you might want to do that. Optimizer choice I think follows a

that. Optimizer choice I think follows a similar thing. Um this one's from

similar thing. Um this one's from Hessnus. Um you know they compare SGD

Hessnus. Um you know they compare SGD and Atom. They find very similar to

and Atom. They find very similar to before this kind of constant factor gap right in compute. um in this case data set size but of course that translates to compute um in in the effectiveness of

atom versus SGD right um you know RHN in this in this case is recurrent highway nets you can sort of ignore the details here um you kind of see the the point of how you would do this analysis rather than the specific results um that are

shown here you know in the beginning I also said something like oh you know depth versus width like what what should the aspect ratios be that was one of the hyperparameter topics we talked out um

and we see sort of similar sort of analysis but in scaling law form from Kaplan. I think this one's intriguing to

Kaplan. I think this one's intriguing to me at least because you know we might think that deeper layers get dramatically better right that there's like clear separation between the number of layers but we see at least here that

you know there's actually a lot of sort of slop one layer is really bad but a lot of the other sort of layer choices sort of remain pretty stable. Um and

hopefully this is reminiscent of kind of that slide I showed back in the architecture lecture where I said well you know the aspect ratio the ratio of width to depth you know roughly um

something like 4 to 16 or something was a pretty natural number but there's a really wide basin in which you're approximately optimal and the scaling law analysis also backs that up. One

important subtlety um that I do want to point out um and this one bites people every now and then is that not all parameters are equal. like often you want to do you know parameter scaling

analyses. Um but if you were to say

analyses. Um but if you were to say count embedding parameters as part of your model well you get like a pretty different scaling law. You get this you know kind of weird looking thing that

like slightly bends over here. Um

whereas if you only consider the non-mbedding parameter you see that much cleaner result that I showed you before.

Right? So embedding layer parameters don't really behave the same and they don't show the same kinds of um sort of log linear scaling um as the non-mbbedding parameters uh when you

account for them. Um and there's sort of related work on saying like not all parameters are the same um on recent papers on scaling uh mixtures of experts where they're also sort of trying to figure out like what does it mean to be

a parameter when you have such sparsely activated parameters and in those kinds of papers they sort of try to derive essentially things like equivalent number of dense parameters in order to

sort of try to normalize um the the number of parameters ine right um I've showed you this plot um earlier in the hyperparameter selection, but hopefully now actually you see the full

context, not just the original sort of the hyperparameter choice question. Um,

we know that in many cases, um, I'll go back, let's say, uh, to like here. Often

what we'll see is scaling lock curves that look like the following. You'll

often see that the slope of the curves remain very similar. They're

non-crossing and that there's sort of constant factor offsets between these curves. And whenever this is true, what

curves. And whenever this is true, what you can then do is you can take a slice at a particular level of compute or a particular set of hyperparameters and analyze the hyperparameter trade-offs

very carefully assuming or and sort of be sort of safe in sort of scaling that up. And so when you go to Kaplan's

up. And so when you go to Kaplan's paper, you'll see exactly these kinds of analyses being done. Um especially I think that the center one, the aspect ratio plot is definitely worth looking

at. you know, they're not just sort of

at. you know, they're not just sort of scaling up and down models. They're

actually taking different slices. So,

different sized models, 50 million, 270 million, 1.5 billion, and they're looking at how the aspect ratio changes the loss. And they kind of see that, oh,

the loss. And they kind of see that, oh, actually the shape of the curve, not just the scaling slopes actually remain similar. And this means that, you know,

similar. And this means that, you know, I can pick an aspect ratio between 10 to 100 and any anything in between will work fine at all of these different scales, right? And so this is um I think

scales, right? And so this is um I think important to think about. I think

initially when you're trained in sort of deep learning, you know, model training, you think about hyperparameter tuning, but you want to be sort of scale aware in how you're tuning your hyperparameters. And that's a really big

hyperparameters. And that's a really big difference in mindset, I think, um between kind of the scaling law style approach and sort of maybe what you've been trained or what you've you know naturally think about in terms of oh

let's just tune these models at a small scale, right? And so um the same is

scale, right? And so um the same is being done kind of for feed forward dimen uh ratio and for attention head dimension. You know you're varying v

dimension. You know you're varying v various aspects of scale and you're trying to see whether sort of the minima uh remain similar. Okay. Um another important

similar. Okay. Um another important thing um next actually maybe not next lecture but next next lecture I'm going to talk about um sort of practical case

studies almost of how people have scaled up models. Um, and we'll actually see

up models. Um, and we'll actually see that batch size and learning rate are actually two really tricky things that you have to deal with carefully when you scale models up, right? So, when you

scale models up, you're going to have to maybe think about, you know, the the optimal learning rate will be different across model scales. And if you're doing that, then actually also maybe the optimal batch size might end up varying as well, right? Because those two are

often co- um, and so we need to think about what the right way of scaling batch size is and how batch size interacts with scale and also learning rates. I'll talk about those um for the

rates. I'll talk about those um for the next couple slides. Right? So batch

slides from the systems lecture hopefully you remember it has diminishing returns past a certain point. So up until a certain point so

point. So up until a certain point so you know when the batch size is uh smaller than the noise scale we're on the left hand sides here right increasing batch size is almost

equivalent to taking more gradient steps. So so that's you know roughly

steps. So so that's you know roughly saying if I double my batch size it's as good as taking two gradient steps. And

that's a really really good place to be, right? Because now you've got the

right? Because now you've got the systems power of being able to parallelize across the batch, right?

While having the optimization efficiency of taking two steps. But past a certain point, you're going to have ineffective scaling, right? Where now your sort of

scaling, right? Where now your sort of noise scale and your batch size are the same and the additional uh samples in your batch that you're taking, you know, they're not reducing useful noise. it's

getting dominated by kind of the curvature of um the bias term so to speak of the curvature of your um optimization landscape. And one really

optimization landscape. And one really useful thing to to think about useful sort of analysis object um is this notion of a critical batch size. And the

critical batch size you can think of as is kind of this threshold point where we go from perfect scaling to uh dip strong diminishing returns, right? Um and you can sort of analyze this in theory and

the in sort of open AI papers on critical batch sizes do this but you could also analyze this uh empirically and this is another thing that's been studied sort of you know in the scaling

law kind of of way. Um you can kind of see um you can estimate the point at which sort of uh progress slow. So you

can estimate empirically what the critical batch size point trade-off points are. And you can also basically

points are. And you can also basically train bigger and better models. And one

really interesting thing is as you try to you know improve the loss so you you're going left side here. So you're

making losses better and better and better and better and better. Um your

critical batch size um ends up getting smaller, right? So the the smaller the

smaller, right? So the the smaller the loss target, the bigger uh the overall batch size um that you can be. Um and so one of the things that this leads to is

for example if you look at the llama 3 um training report you'll actually see for example that they'll like increase the batch size after a certain point or they'll do things like increase the batch size um as they train because as

your loss target gets smaller um your batch sizes um can in turn get bigger. So um as we increase both

bigger. So um as we increase both compute and model size like what's the right thing to do once again we can do kind of a scaling analysis this is from Kaplan um and you can try to figure out

you know as we increase the amount of compute what is um the optimal batch size and what we kind of see is that you know um as we increase the amount of

compute um we can actually have reasonable sort of parallelism. The

number of total steps can stay the same um at at least within this compute threshold. the number of total steps can

threshold. the number of total steps can stay the same um while sort of get getting the batches bigger and bigger and bigger and if you fix the amount of batches of course the number of steps is going to go up and up and up. So this is good news hopefully for for data

parallel processing. So that's the batch size

processing. So that's the batch size story. Um the thing you can you should

story. Um the thing you can you should maybe remember um because I think critical batch sizes are kind of a messy concept is that a there's a a sort of diminishing returns point the critical batch size that's one thing. The second

one is that it does seem to follow a pretty predictable scaling often as a function of your target loss. Um and

given that you can figure out you know what is the right tradeoffs that I can make um in terms of systems efficiency and my optimization

progress as I said before um the other aspect of this right is you know you've got your batch size and then you've got your learning rate and those two are fairly closely linked with each other.

Um, and I'm going to talk about MUP at much more extensive length um, in the next part of the scaling lecture. Um,

but this is kind of a really important I think broader idea. So you could do one of two things. And I think this figure will allow me to talk about both of these. So let's look at this left plot

these. So let's look at this left plot first. Um, what's what's labeled

first. Um, what's what's labeled standard practice, right? So when you train a transformer, right? What you're

basically going to see is something like this left thing here, this standard practice. Um so the optimal learning

practice. Um so the optimal learning rate is going to be at different points and the wider the model right as you increase your model size and your your MLPS get wider and wider and wider the

optimal learning rate is going to be pretty small and as you make your your model smaller and smaller and smaller right your losses of course are going to go up because your model is less you know less expressive but also the

optimum learning rate is going to also go up right um and often you know people say there's a rule of thumb it's like one over the width is the right rate at which you could scale the learning rate.

Um more advanced people will actually fit uh uh basically take these curves, find the minimum and then fit a scaling law on the optimum learning rate. And so

there we can see that this is a predictable decay in learning rate and maybe we can fit a scaling law. I'll

talk about this more in the next set of lectures. But an alternative one that I

lectures. But an alternative one that I think many people have started to adopt and I think is a is a really interesting thing to think about is that you can actually reparameterize the model. Um

and in particular, you know, you can do things like scale the initializa or scale the learning rates um of different layers based on the width. You can uh scale the variance of the initialization

based on based on the width of the model. Um as well as multiply the output

model. Um as well as multiply the output in the forward paths of of different layers of the model. Um and if you do this in a way that you know is dependent

on sort of the width of the model um you end up with a parameterization of the model whose learning rate is supposed to be more stable or at least you know in

the original paper exactly stable across scale. Right? So you tune your learning

scale. Right? So you tune your learning rate once and you don't have to do anything else. That optimum directly

anything else. That optimum directly transfer it's actually you tune it here on the smallest one and that directly transfers to the very largest scale.

Right? And this is the the idea called new P. Um there have been sort of this

new P. Um there have been sort of this original paper that I'm showing you is called with new P. There's been other variants. Um Meta with the release of

variants. Um Meta with the release of Llama 4 claims to have invented something called metap, which I'm not quite sure what it is yet. Um but you can sort of see that a lot of labs are kind of thinking about this, right?

Because if you're going to have to, you know, rely on predicting what the optimum learning rate is, then you have to do all sorts of tricky scaling law fits and maybe this is very unstable.

But if you can reparameterize your model then well maybe you don't have to do any sort of retuning at all. Of course

that's you know way more optimistic than what happens in practice but hopefully this gives you a sense of of you know why this is really cool and really interesting right scale aware initializations. Cool. Any questions up

initializations. Cool. Any questions up until this point? I feel like I've sort of gone through a whole bunch of uh scaling architecture and parameter stuff. So maybe I'll stop for a moment

stuff. So maybe I'll stop for a moment here um in case anyone has any questions.

Yeah, I didn't really get the intuition behind like if we want a lower loss target, we want to increase the match size. I didn't really understand that.

size. I didn't really understand that.

Yeah. Like so when you have a lower loss target.

Yeah. So, um what you want to do, right, is the smaller the loss target, the kind of more sensitive things are. And in the same way that like you're going to be lowering your learning rate, right, you

want to also increase your batch size in order to d noiseise, right? Like the

more sensitive the target that you have, the sort of more precise your gradients potentially have to be. Um, one way of thinking about it is like, you know, as you're cooling down and your learning rate is going up, um, maybe your batch

size should increase as well because the learning rate and batch sizes sort of affect each other inversely.

Yeah. this backside thing is only true for NLP not for or for computer vision.

I'm not uh I'm not sure. There is a sort of related OpenAI scaling paper for sort of multimodal models but I'm not I don't remember what that says about critical

batch size for those.

Yeah. Yes.

The noise scale. Yeah. the noise scale at least in sort of this figure if that's what you're asking about this is a kind of theoretical analysis. It's

basically about the gradient noise that you expect from random sampling within the batch. Um so this is not like a you

the batch. Um so this is not like a you know precisely empirically measured quantity. It'll be

quantity. It'll be simple.

Okay. All right. So one thing I'll caution and I think this is a big caution for a lot of scaling law works is that scaling laws are very nicely behaved for log losses right so we train

on you know uh next token prediction cross entropies when your scaling law targets are those cross entropies very easy works very well um but if you're

trying to do downstream tasks right you're trying to like directly scale on benchmarks behavior is much less predictable um so here on the left side.

This is from um YK's paper uh comparing lots of different sort of like hyperparameters and architectures. Um

you see that the number of parameters which in this case is a surrogate for compute and the negative log perplexity is you know very nicely linearly correlated and what this is basically

saying is well it doesn't matter what your like depth or width or like precise setting of the hyperparameters are the only thing that really matters is your total compute expenditure right this is

a very simple and nice story but then you take these models um this was back in 2023 so you know people were still kind of doing super glue accuracy um And you know, you basically say like, okay,

but what's the downstream performance of these models? And while now we don't see

these models? And while now we don't see a very nice linear relationship anymore, right? We see this like totally

right? We see this like totally different thing where certain models are much better than others and certain architectures are are better than others. Um, and so you might not expect

others. Um, and so you might not expect exactly this kind of scaling property.

And we've seen variants of this story uh play out in many different places. Um,

if you followed the literature on state space models, um, that's one thing that we've seen. um you know in state space

we've seen. um you know in state space models you know we see really nice predictable scaling like the ones on the left but often um for certain capabilities like in context learning or

for for QA um people have shown that you know these models maybe do less well so it's important to not take this like perplexity scaling as the same thing as downstream scaling and you want to be a

little bit cautious whenever you're doing these kinds of analyses. Okay. So, um maybe this is not

analyses. Okay. So, um maybe this is not surprising to some of you, but hopefully you know this is surprising and convincing. Um which is that you know if

convincing. Um which is that you know if we want to make lots of engineering decisions like hyperparameter choices, architecture decisions, um we can do a lot of that before training, right? Like

we can train these models at small scale across several orders of magnitude compute and then use scale that up in order to try to predict the behavior of of models, right? So the scaling law based design procedure is pretty simple.

You train a few smaller models and these smaller models should span a couple orders of magnitude compute. You

establish a scaling law of some kind. So

you you know see that at least on the models that you trained that there's a clear log log linear relationship and then based on this prediction you can set optimal hyper parameters. Um in many

cases in fact you know these scaling laws won't really vary too much. Their

slopes will actually be the same. in

which case sort of the correlary to this is you can just train a few smaller models and you know the results of those small models will transfer surprisingly well to the larger models in many of these cases but not all of them um

learning rate being an important exception for example okay so that's how you do things like hyperparameter selection and architecture selection um now I want to talk about one very important use of

scaling laws one that's had kind of an outsized influence on you know how we pick sizes of models how we think about um data efficiency and so on of these

models. So I think um back in the

models. So I think um back in the earlier days when people were beginning to scale up these models there's a really core question that you need to ask right do we need more data or do we

need bigger models in some sense um you know back in 2021 to 2023 or something um you know data was way more abundant than compute right so we didn't need to

worry about um you know the total data limitations and so the one limiting resource is compute right your total number of flops for your training budget that's kind of the limiting resource and you can then spend that resource in many

different ways. You can spend it on

different ways. You can spend it on training on lots of data with a small model or you can train one giant model on very little data, right? And both of those extremes seem very wasteful, right? Like if you have a teenytiny

right? Like if you have a teenytiny model, pumping in tons and tons of data doesn't seem useful. And reverse um if you have a giant model with like 10 tokens also doesn't seem very useful.

And so this was sort of a core question for for many people. Um and so there you know simultaneously several authors sort of proposed um sort of joint data model scaling laws to try to answer this

question. And so what are those? Right?

question. And so what are those? Right?

I have been talking about scaling laws um in essentially one variable exclusively up until this point. And

that one variable has varied. It has

sometimes been parameters or data or compute. Um but we've not looked at

compute. Um but we've not looked at joint scaling, right? Um and so data model scaling laws are things that look like this. These two sort of um

like this. These two sort of um equations here are both like functionally equivalent um to first order um and describe the trade-off between the amount of data and the

amount of models. So the top one from Rosenfeld you know is basically saying there's a part of the error one part of it that decays polinomially in data there's a part of the error that decays

polinomially in the model size and then there's an irreducible error term that cannot be removed even if I scale both the data size and the model to infinity right same effect with Kaplan um but

here they're sort of thinking about irreducible error rather than uh reducible error and so there's no constant term here um so this seems kind of arbitrary

um because I don't think there's any sort of you know uh top down reason why this has to be the correct functional form but this provides surprisingly good fits to the joint error that you see in

data and model. So this is from um I believe Rosenfeld. They show this nice

believe Rosenfeld. They show this nice 3D plot of this is the amount of data.

This is um the amount this is the size of the model and this is the loss on the y-axis. And the surface that's being fit

y-axis. And the surface that's being fit is their functional form. The dots are their runs. Um it might be a little hard

their runs. Um it might be a little hard to see from the back, but the surface fits the dots almost exactly.

Um, and despite the fact that this functional form is kind of ad hoc, like it's pulled out of a hat, um, it is surprisingly accurate. Um, this one's

surprisingly accurate. Um, this one's from from Rosenfeld as well. Um, where

they basically say, okay, I'm only going to train, um, on essentially the small half, right? Models that are small and

half, right? Models that are small and data that is small, right? So on this sort of left bottom, and I'm going to extrapolate to models that are sort of both large and trained with more models.

And how good is that fit of like joint extrapolation? Well, quite good, right?

extrapolation? Well, quite good, right?

So, if you look at the error, my my um uh my sort of uh real values are on the x-axis. My predictions of the error on

x-axis. My predictions of the error on the y- axis, and they're sort of almost exactly right both on like sort of imageet and on wiki text. So, this seems

pretty good. Um and so, you know, for a

pretty good. Um and so, you know, for a fixed compute budget, now what can we do? we go back to so for example Kappla

do? we go back to so for example Kappla and we see similar things being done here we see sort of joint scaling of compute and data. So in this case parameters are on the x-axis the colors

represent compute and so there's sort of a third axis of data that's being implicitly varied in order to vary the total amount of comput. So as you go uh shift in uh on these curves um the

parameters are being varied um while the compute is being held constant and so the amount of data is going to vary. So chinchilla I think many of you

vary. So chinchilla I think many of you have hopefully heard of um is probably the reference in solving this problem right. So both Rosenfeld and Kaplan came

right. So both Rosenfeld and Kaplan came up with kind of this joint scaling functional form and then you know both of them sort of noticed that it was possible to use these functional forms

to optimize the trade-off between compute and data in various ways. Um but

for various reasons it's you know basically it's hard to fit these sort of functional forms precisely and the details like the the sort of learning rate shapes um being different um are

important and so Kaplan sort of had one estimate that was quite far off from what was later in some sense validated to be optimal. And so the Chinchilla paper um by a bunch of Google authors

sort of was an attempt to really empirically try to nail down what is the right trade-off um between the amount of tokens and the model size assuming that your goal is to get the best model for

the smallest amount of training flops.

Right? So they have three different approaches. Approach one, two, and three

approaches. Approach one, two, and three um for basically fitting different curves and making scaling predictions.

These blue dots are the models that they trained. Um and the basically the lines

trained. Um and the basically the lines are predicting different optimal parameter sizes for different flops. And

hopefully most of you kind of know the chinchilla ratio. That's something like

chinchilla ratio. That's something like you know 20 tokens per parameter. And

that comes from exactly this, right?

Like if you take each of these points and you multiply it by 20, you're going to get roughly the flop or sorry multiply it by 20 you'll get the token count. And so if you multiply the

count. And so if you multiply the parameters by that you'll get the flops.

Um the difference between sort of the the Kaplan uh results which were uh basically estimating one set of token to parameter ratios um and sort of the

chinchilla ones. One of the reasons um

chinchilla ones. One of the reasons um is because of learning rate schedules, right? We know that we train models with

right? We know that we train models with cosine learning rates, right? So cosine

learning rates are going to look something like this, right? It goes up and then it comes back down and then it's going to cool down all the way to a minimum learning rate at your bottom.

But um you can't one thing about cosine learning rates that sort of trips everyone up all the time is you can't truncate them early right for a cosine learning rate you have to sort of go all the way to the end in order to get a

valid model right you have to get a cool down phase all the way to the end if I truncate a model in the middle this is not the same as starting a model from scratch and training it with a cosine learning rate somewhere in the middle um

and this was one of the sort of contributing factors there were others as well um leading to the Kaplon estimates um being pretty far off from the the later sort of more improved

estimates provided uh by the chinchilla paper. So what do the chinchilla authors

paper. So what do the chinchilla authors actually do? Well, they have three

actually do? Well, they have three different methods of trying to estimate the optimum uh tradeoff between tokens to models. And each of these methods are

to models. And each of these methods are going to sort of provide different scaling coefficients, right? Scaling

coefficients for the model size and scaling coefficients for the data size.

Um, and kind of surprisingly in this case, they're getting 0.5 on both of these for methods one and two. And

method 3 is providing pretty different or slightly different estimates. They're

about off by 0.03. Um, but we'll talk about that a little bit later. Kaplan at

all, you see, is way off um than any of the three estimates, right? So, we'll go over each of these methods. Each of

these makes sense. they make sort of different assumptions about scaling um but they end up with very very similar estimates um at the very end

here. So method one on chinchilla is to

here. So method one on chinchilla is to basically take the minimum over curves um and so what does that mean? Well you

basically overlay all of the different training curves that you have. So you

can see here um on the x-axis is different flops on the y-axis is sort of the the training loss and I have models trained at many different sizes and of

course you know each of these sizes are going to be trained to with different amount of tokens and so they're going to reach a different sort of to total flop as I sort of go through training right now what I'm going to do is I'm going to

look at the lower envelope right the set of sort of points or checkpoints that prove to be optimal under any compute budget and I can take these models and I can look at okay what were the actual

parameter sizes of these models and you can see that sort of the the total compute on the x-axis here and the number of parameters as well as the corresponding tokens all forms a relatively nice uh scaling law right and

so this is kind of the the minimum envelope method it's basically saying I expect the minimum training loss where I optimize over all the model sizes to

actually be optimum in flops and sort of um to call back to some earlier papers, right? If you look back at the earlier

right? If you look back at the earlier um sort of Kaplan paper and other scaling laws, you see exactly this already being done. You see, you know, different models being trained with different sort of parameters and

different compute scales and we're taking sort of the minimum across these and we've already seen that the minimum forms a scaling law. So this is building on this observation that the minimum

across many different training curves across compute will should form a uh power line. So under that assumption you

power line. So under that assumption you can get fairly nice fits. Um and this gives you know one estimate that is quite consistent um with others of of 0.5.

Now the other one this I think if you were to pick a single canonical way to do the chinchilla analysis um this would probably be the one and in some ways I think this is the most conceptually

straightforward one um which is the isoflop analysis. So to do the isoflop

isoflop analysis. So to do the isoflop analysis what you do is you pick a bunch of compute scales. So each of these colors is a different amount of compute.

And what I'm going to do is for each of these compute scales, I can essentially have models with smaller parameters trained with more data or more parameters trained with less data.

Right? So I'm going to sweep over my sort of model sizes for each of these flops. And then I can look at the

flops. And then I can look at the minimum of each of these curves. I can

either pick the minimum point explicitly sort of nonparametrically or I could fit quadratics onto each of these and get the minimum point of the quadratic. But

in either case, sort of the argument is fairly simple. The argument is it should

fairly simple. The argument is it should be the case that this minimum itself follows a predictable scaling law and thus I can extract from it sort of the optimum sort of parameters per flop. So

that's the minimum points across all of these. Um and I can also extract the

these. Um and I can also extract the optimal number of tokens per flop. I can

read that out by sort of dividing my flops budget by the number of parameters. Right? So I can get those

parameters. Right? So I can get those simultaneously. And you can see that

simultaneously. And you can see that once again this gives very clean sort of results um that are consistent with method one. Right? So we can compare

method one. Right? So we can compare that with before. This says for uh the eventual chinchilla model budget you want 63 billion parameters. Um this one says 67 billion parameters. Um the two

are quite close, right? Okay. The last one um honestly is

right? Okay. The last one um honestly is just a little bit messier. Um and this goes back to kind of that Rosenfeld paper. Um if you have a functional form

paper. Um if you have a functional form like this one, right? like this uh from Rosenfeld. A very natural instinct is to

Rosenfeld. A very natural instinct is to say, I'm just going to train a bunch of models varying both N and M, right? And

I'm just going to do curve fitting. I'm

going to fit this curve onto whatever I get a thing I get out of my models, right? So, I'm going to train a bunch of

right? So, I'm going to train a bunch of models and fit that 3D shape. Um, and we know from Rosenfeld it's reasonable to some extent to fit these, right? So,

you've got all these dots which are the models. Um, I fitted a curve that's this

models. Um, I fitted a curve that's this this sort of heat map color that you see on the left. Um, and then you can sort of back out what the implied isoflop should look like from these dash lines.

But if you look at this, you know, hopefully you see that the scaling law fits and like sort of the curve fits here are just not quite as good um as the fits in the other plots, right? Um,

and you know, if you look at the coefficients, the um, chinchilla method 3 just gives way different estimates in terms of the model size and total token count than the others, right? And

actually this was a mystery to me for a long time. I think some of my students

long time. I think some of my students were like why why is method 3 so different and I said I don't know maybe scaling laws are just sometimes noisy.

Um I don't know how many of you know this but this is a really fun trivia fact. Um or not trivia fact fun uh piece

fact. Um or not trivia fact fun uh piece of trivia let's say. Um so last year some folks at Epoch AI I don't know what motivated them to do this were curious

enough about this result that they went and tried to replicate method 3. Um and

you know the it was very difficult to replicate it because you don't have the original data for all of these training runs. So they actually like went to the

runs. So they actually like went to the extreme of actually looking at the plots and using sort of a forensic tool to to extract the values of the points from

the plots and based on that they could actually replicate um the original result. And kind of the funny thing is

result. And kind of the funny thing is they showed that actually the the curve fitting was the bad part. like their

data in their approach was good, but actually when they fit the curve, they didn't necessarily do it right. And so

the original fit had residuals. If

you're familiar with regression, you know, your residual should be zero mean centered because otherwise you should be, you know, offsetting your predictions to make it zero centered.

Um, their residuals are non zero and then they, you know, fit it better. And

then when they did fit it better, well actually their optimal estimate, you know, almost exactly matched methods one and two. Um and so this is one of those

and two. Um and so this is one of those funny cases where actually you know the original authors had both the idea and the data right but because of a minor issue in curfeitting they they had kind

of had it wrong and the replication actually makes it more correct uh than before. Usually replication sort of

before. Usually replication sort of disproved things but in this case actually the replication just showed that the original result was correct all along which is I think a pretty cool uh result.

Okay, so the final thing I want to talk about um with kind of this set of chinchilla results is you know we're talking about training optimal scaling.

So you have a fixed flops budget. I want

the best possible model um possible but really I think that the story has really shifted when sort of chinchilla was written and the Kaplan paper was written. um you know LLMs were not

written. um you know LLMs were not really a product yet and so really the name of the game was everyone wanted the most biggest flashiest most intelligent model but they didn't care about the

inference cost of actually deploying these systems but nowadays you know what we really care about is uh inference costs right because these systems are actually products they generate revenue

you know you have a cost associated with the revenue um and so we've seen over time that actually the tokens per parameter has steadily grown right like GPT3 was two tokens per parameter. Um,

Chinchilla moved us to 20 tokens per parameter and for a bit people played around with sort of 20 tokens per parameter stuff, but then you know very quickly people realized actually what we

care about is, you know, really good intelligence at really small parameter sizes and so people have really started to scale up the number of tokens per parameter very very rapidly. Um, and I think I saw yesterday that for example

the most recent uh QEM models were trained on 30 trillion tokens, right?

you know, people are really pushing the limits on uh the tokens to parameter ratio because really you would much rather pay the upfront cost than to pay the ongoing operating cost of uh in

running inference on a really big um expensive model.

Cool. Last thing um you know that is kind of a fun side thing that I I want to end with is to say um you know these results are are pretty robust and easy to replicate. Um a few years back uh one

to replicate. Um a few years back uh one of my students Isan was really interested in you know really pushing diffusion models for text forward. And

so one of the things that we had to do was to say this is a whole new kind of model. We don't know what the optimal

model. We don't know what the optimal token to parameter ratio is. We don't

know if this thing even you know reliably scales. It's a totally

reliably scales. It's a totally different kind of generative model. Um

what do we do? Well, turns out, you know, if you just fit uh the same kind of playbook of saying, "Oh, we're going to do, you know, isoflop analyses for auto reggressive models, we get almost exactly the chinchilla thing without too

much effort." You know, you do the same

much effort." You know, you do the same kind of analysis on diffusion models.

Wow, we see, you know, very similar kinds of curves, even though it's a pretty different generative model entirely. And then if you plot sort of

entirely. And then if you plot sort of the minimum across of these, well, you see very predictable scaling for both, separated by a constant offset, right?

Right? Like I don't bring this up to say, you know, uh because I want the particularly pushed to diffusion models, but just as a really random sort of case study or example to say, you know, these

scaling laws don't necessarily need to be these very cherrypicked examples.

They seem to happen pretty naturally um as you're sort of working on new models or working on new environments. So, okay. Um you know this

environments. So, okay. Um you know this is uh you know the to put together this last part right log linearity is not just about sort of

one-dimensional things where we think about data they extend to sort of model parameters they extend to total compute and so that lets us you know make all sorts of hyperparameter and other decisions. That's kind of this first

decisions. That's kind of this first part. Um and they're also letting us

part. Um and they're also letting us make really smart resource trade-offs right they let us make trade-offs between sort of big models versus more data. Um and we saw that in kind of this

data. Um and we saw that in kind of this chinchilla analysis and you know it's kind of remarkable how cleanly uh things like the isoflop analysis um turn

out. So all right that's all I got for

out. So all right that's all I got for for basic data sc or basic uh scaling laws. Um we did a recap or um of Kaplan

laws. Um we did a recap or um of Kaplan as well as Chinchilla today and hopefully now you're on board with this idea of data scaling, model scaling um and using scaling loss to sort of

optimize all the aspects of your model without actually going all the way to the large scale training runs. Um thanks

and I'll see you all Thursday.

Loading...

Loading video analysis...