Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 4 - LLM Training
By Stanford Online
Summary
## Key takeaways - **LLM Training: Pre-training vs. Fine-tuning**: LLMs are trained in two stages: pre-training on vast, general data to understand language, followed by fine-tuning to adapt the model to specific tasks or desired behaviors like being a helpful assistant. [09:16], [01:02:42] - **Compute and Data Scale for LLMs**: Training LLMs requires immense computational resources and massive datasets, with models like Llama 3 trained on trillions of tokens, and the compute cost often reaching millions of dollars. [10:37], [13:04] - **FlashAttention: GPU Memory Optimization**: FlashAttention optimizes GPU performance by minimizing data transfers between high-bandwidth memory (HBM) and on-chip SRAM, processing data in smaller blocks to reduce latency. [38:31], [40:07] - **Quantization Reduces Model Size and Increases Speed**: Quantization reduces the precision of model weights, significantly decreasing memory footprint and increasing computational speed, which is crucial for deploying large models. [52:35], [55:59] - **LoRA for Parameter-Efficient Fine-Tuning**: LoRA (Low-Rank Adaptation) fine-tunes LLMs by training only a small number of additional weights, decomposing the update into low-rank matrices, which drastically reduces computational cost and memory usage. [01:37:53], [01:38:37] - **QLoRA: Quantized LoRA for Further Efficiency**: QLoRA combines LoRA with quantization, freezing the base model weights in a quantized format (like NF4) and fine-tuning only the LoRA adapters, leading to significant memory savings. [01:44:23], [01:45:35]
Topics Covered
- AI shifted from single-task models to general pre-trained models.
- Optimal model size is 20x smaller than its training data.
- How doing more computation can actually save time and memory.
- Why a raw LLM is just a fancy autocomplete.
- Fine-tune massive models by only training tiny adapter matrices.
Full Transcript
Cool.
Hello everyone and welcome to lecture 4
of CME 295. So today is Friday, October
the 17th, which means that the midterm
is one week away. So before we start,
I'm just going to go over some logistics
to make sure you know we're all aligned
on what to expect.
So the midterm will take place next
week, same time. Instead of an hour and
50 minutes, it will be an hour and 30
minutes. So it's 3:30 to 5 in this
classroom.
So it's like business as usual. Um in
terms of topics
the midterm will be about lectures 1,
two and three which we had and this one
which is lecture four.
So just to give you like an overview of
what you can expect in the midterm.
There's going to be uh some multi-choice
questions along with some free form
questions but they're mainly going to be
about the things that we've seen in
class. So if you watch the recordings or
attend the lectures and you know just go
through the slides and know the
important formulas I think yeah you'll
be you'll be fine.
So I know that you may have questions
until uh next week. So that's why after
this lecture with Shervin we will be
holding office hours. So feel free to
you know come to us and ask us any
questions. And uh of course we'll be
fully available between now and next
week. So in case you have any questions,
feel free to uh ping us on Ed. Um and
yeah, we'll make sure to respond.
Um cool. Uh I also know that a number of
you are auditing this class. So in case
you're still interested to take the
midterm for some reason uh maybe uh
because you have an upcoming interview
uh just uh tell us so that we can just
expect the number of uh like copies to
print. So we'll be printing this on
Monday. So just let us know over the
weekend in case you're interested.
Cool. So that's for the midterm. And
then the second piece of news is the
final. So, we said we were working on
the dates. So, we finally finalized the
dates, which did not change. So, it's
Wednesday, December the 10th. Okay. So,
a little bit late, 7:00 p.m. to 8:30
p.m. Uh, so it's a slot that we have.
Uh, the location is different from this
one. So, it's in this room.
And the final will only cover the second
part of the class which is basically
lectures five to 9.
Any questions on this? Yeah.
Oh yeah, good question. So is it closed
notes? Yes. Yeah. Yes.
So question is what is the format of the
multiple choice? So you'll have uh so we
did not finish writing the exam, but
it's going to be something like you have
a question and you let's say you have
like three four possible answers and
then you just choose the the one that's
that's correct. Something like this. And
you'll also have some free form uh like
you'll have to just like answer in your
own words. Yeah.
>> Yeah. Uh thanks. So question is are we
allowed to take anything? So it's closed
book. So like
like yeah nothing just a pen.
Yeah.
Uh question is no calculator you will
not need calculator
but speaking of the cheat sheet. So I'm
not sure if we mentioned I think we did.
So there is a cheat sheet for this one
which we cannot bring to the exam but
you can use for your uh just for uh you
know just your studying. Uh that's on
the website class website.
suggest recommends looking at it.
Cool. So, super clear for everyone.
Very cool. Well, okay.
As always, we'll be starting the class
just recapping what we saw in the
previous lecture. Um so if you remember
we
basically
studied a new kind of architecture which
was called the mixture of experts uh
which is such that if you have an input
what you want is to not necessarily
activate all the parameters and so you
are in a setting where you have multiple
experts uh and in the forward pass you
only activate some of them so that's a
sparse MOE. You also have the dense MOE
which basically weights the outputs as a
function of the output of the gate. So
we saw that this architecture was used
in LLM
and it was mainly used to be able to
scale these LLMs without incurring an
expensive cost at inference time because
you don't want to activate all the
parameters.
The second thing that we saw was uh just
defining what an LLM was and in
particular how you could
decide on what the next token prediction
is. So we saw three methods. First one
was we called uh greedy decoding which
was always taking the highest probable
token.
The second method we saw was beam search
where we kept track of the k most
probable sequences.
And then the third one was sampling.
So we're not doing a most probable we're
not keeping track of the highest
probable sequences. What we do is we
sample the next token respect to the
distribution that we get as output. And
then we saw there's this hyperparameter
that's called temperature that allows
you to tweak how spiky you want your
distribution to like versus uh not.
And we also saw some inference
optimization techniques which are used
in practice to avoid having uh like a
big cost at decoding time. So I'm not
going to just mention everything but I
would say KV cache for instance is a is
an important method. So yeah just
recommend just knowing what it is along
with the other ones.
And with that we're going to start
lecture four and actually I was really
looking forward to today because lecture
one we saw what self attention was what
a transformer was. Second lecture, we
saw
some of the tricks that people use today
and some of the variations from the
transformer. We introduced what an LLM
was last lecture and this lecture we're
finally going to see how these LLMs are
trained. So today we're going to focus
on LLM training.
And the first thing that I'm going to
say is if you've been in the ML field
for let's say more than a few years now
uh you may have noticed that
traditionally
if you had a task what you would do is
train a model specifically for that
task.
So let's suppose like 10 years ago,
let's suppose we had a task which was
around detecting spam. You would train a
model specifically to detect spams. So
you would train on the training set,
eval on the validation set and then test
on the test set. If you had another use
case that suppose sentiment extraction,
you would train a model specifically for
that
and so on and so forth.
But one could argue that these tasks,
they're not completely disjoint.
They're all involving just understanding
the text. So one could argue we could
find a way to somehow leverage the
knowledge that we acquired during
training for let's say one task
and reuse that for another task.
So this method has a name. It's been
around for some time. It's called
transfer learning.
So the goal of transfer learning is to
not always start from scratch. If you
have a new task, it's to start with some
pre-trained model. And we're going to
see what pre-train is and then tune it
for your task instead of starting from
scratch.
Well, it's basically the paradigm on
which LLMs are trained. So the idea here
is that all these tasks, they involve
understanding language. So, what we're
going to do is have what we call a
pre-training stage, which involves
training your LLM on vast amounts of
data to just understand what language,
what code is
and then have a second stage of quote
unquote tuning.
And we're going to see a little bit what
that tuning is. But in that second
stage, we're going to take our
pre-trained model and somehow find a way
to tune the weights to adapt to a
specific task.
So as an example here uh we would
pre-train a huge model and then suppose
for spam detection we would somehow tune
it for that uh sentiment instruction
same we would tune it for that and so
on.
So this is just to take the example from
before. And the idea here is in order to
obtain these models, we're not going to
start from scratch.
Cool. So okay. So now we're going to see
what pre-training is. So pre-training is
by far the most expensive both in terms
of compute cost, you know, everything
part of the training.
So what it does is taking a huge amount
of data and training your LLM to just
predict the next token.
And here by data what I mean is
basically everything you can find. So it
can be uh you know text in English, can
be text in other languages, it can be
even codes,
can be code in different languages, can
be basically the whole internet.
We're going to see some of the data sets
that people use for that. But you can
think of this as just training your
model to try to predict anything that's
written.
And as I mentioned, the objective here
is to predict the next token. So if you
remember, our LLM is a texttoext model
and most likely a decoder only model in
more than 90% of the cases. So what it
does is it takes some input text and it
tries to always predict the next token
in an iterative basis.
So in terms of the data sets that are
used, you will see the term common crawl
a lot on papers, it's basically a data
set composed of anything you can find on
the internet. So I think they have
something like three billion pages per
month. So if you go on their website,
they have a hu huge archive. So there's
a bunch of other websites as well that
you can find in there. So for instance
the Wikipedia articles any like social
media as well like Reddit I know there
are a lot of Reddit conversations in
those in those data sets you have a lot
of codes and of course you have a bunch
of places for that you have GitHub you
have stack overflow all these like
forums that talk about codes so all of
this is meant for your model to just
understand the structure of the language
and code
and in terms of size so it's measured in
terms of token number of tokens and one
order of magnitude that I want you to
remember is is on the order of hundreds
of billions or even trillions or even
tens of trillions of tokens.
So I'll give you an example. So GPT3 was
trained on 300 billion tokens and for
instance Llama 3 which was I believe
published last year was trained on 15
trillion tokens.
So these are huge data sets.
So before we go further, I want to
introduce two notations and I think one
of them I introduced I introduced it
last lecture.
The reason why I want to talk to you
about these notations is they are used
everywhere to talk about how much
compute
uh some model needs. So the first
notation is flops
which stands for floating operations
and what it is is it's a unit of
compute. So the higher the flops the
more operations are involved because by
definition flops is the number of
operations that involve floatingoint
numbers. So floatingoint numbers you can
think of them as just like numbers with
decimal points.
So in terms of order of magnitude
training an LLM is on the order of 10 ^
of 25 flops
and the way you obtain flops. So usually
it's like a complicated formula but in
your mind you can think of it as
something that is a function of the size
of your data. So the number of tokens
that you train it on and the number of
parameters of your model.
So there's not like a universal formula
because it also is a function of the
architecture. So you can think of for
instance based LLMs as requiring let's
say less compute because only some parts
are activated compared to let's say
dense LLMs.
But you can just think of it as it's a
function of the number of tokens and
parameters. It's like O of the product
between the two more or less.
And then there's a second notation that
I want to introduce which is also flops
but it's different. So here flops stands
for floating point operations per
second.
So it's a measure of compute speeds. So
it's basically how fast can your
hardware
execute these operations
and so you also have like some order of
magnitudes here. uh but if you're into
uh let's say GPUs, you will see that in
the description of GPUs, they always
indicate flops and we will see that in a
second.
But I just want to call out that flops
here usually all caps.
Although you may see some papers that
use one for the other,
which is confusing. So I just recommend
uh just contextual contextualizing
this notation with respect to the
sentence that it is in because sometimes
people actually switch the two but this
is the common notation.
So far so good.
Cool.
Okay. So now we know that we have a
pre-training step. We know it involves a
lot of compute. We know it involves a
lot of data. We know that our model is
large. So what people did was trying to
see how the performance evolves as a
function of model size and training
size.
And there's this one paper called
scaling loss for neural language models
that was published in 2020 that
performed a bunch of experiments by
varying these parameters. And what they
found was the more compute you have, the
better your model learns about
predicting the next token. Same for data
set uh size. So the more the bigger your
training set, the better it is. And the
bigger your model, the better it is.
So for some time, I think between 2019
and 2024,
you were seeing models that were larger
and larger, just people just building
things that were bigger and bigger
because according to these experiments,
um the performance was just getting
better.
So something else that they noticed was
bigger models tend to be more what they
call sample efficient.
So what that means is for an equal
amount of tokens that is processed
you will have a better performance with
a bigger model compared to a smaller
one.
But then you can wonder you know um we
don't have unlimited compute you know
compute is expensive it you know it has
a lot of drawbacks. So uh you have a
fixed compute and people also try to
answer the question given a certain
amount of compute.
How can you fix your training set size
and your model size in a way that's more
optimal?
Cuz um here uh you need to decide how
big is your model. So what they did is
they fixed a unit of compute which is
the color of these curves
and they tried training models of
different sizes with different training
set size. And what they saw was that
there was always a sweet spot here
which followed some kind of
relationship.
And in particular, this is a table that
summarizes quote unquote the optimal set
of number of parameters and training set
size, which is sometimes called the
chinchila.
And what they realized was if you have
an amount of training set size that's
about 20 times
the model size then you're spending your
compute in quote unquote like an optimal
way. And in particular,
GPT3 for instance,
I think it was like 175
billion parameters if I remember
correctly, but it's only trained on 300
billion tokens. So this one for instance
is according to this really undertrained
quote unquote.
I think there's a question. Yeah.
Um so yeah the question is do they fit
the neural architecture? So I think um
by now everyone agrees that LLMs are
transformerbased decoder only models. So
everyone uses the same model.
Yeah. So you can assume that when I say
LLM here it basically means decoder only
transformer based models.
Yeah. Question is architecture change
does not play a big role. So that's what
they say actually in their paper. They
say the thing that changes the most is
the amount of tokens on which you train
and the size of your model.
Cool. Any other questions?
Yeah.
Oh yeah, good question. So question is,
is there some kind of transfer learning
between different versions of models?
Um, so for a lot of these models,
they're actually closed source, so they
don't exactly reveal these things. But,
uh, I guess it's an interesting
question.
um one that I cannot answer in a general
way. So maybe I think it's the best
answer I can give you. Um but in any
case uh when you look at um some of
these papers they always state how much
it costs to train this and it's always
in the order of you know millions. It's
always an expensive step regardless.
Cool. Um just uh speaking of that um so
pre-training has a lot of challenges.
One of them is cost. So, uh when I say
millions of dollars, it's a minimum. I
think it can even cost tens of millions
of dollars or sometimes hundreds of
millions of dollars. It takes a lot of
time
and um people have been mindful of the
impact on in the environment. So,
they've also been including the
ecological cost.
So the other uh challenge is that the
pre-training
step is on data that is up to the time
at which you pre-train your model on. So
what that means is that the knowledge
that you acquire from training on this
data set can only go up until the date
at which you cut your data set.
So this date is called the knowledge
cutoff date. And so what that means is
your base model, your pre-trained base
model does not know has no way to know
by itself knowledge that occurred after
this state.
And speaking of that, a lot of papers
they've tried to edit knowledge, inject
knowledge. It's always tricky because uh
there's not a clean way to um you know
change the weights in a way that does
not penalize
some parts. So I guess what people want
to do is inject knowledge but not
regress in some other domains. And this
is a very hard problem. And of course
you know these models they try to
predict the next token and uh there's
this question of uh what if it just
generates something that it has seen at
training time. So what we call
plagiarism so there's always a risk. So
these are all the challenges I just want
to illustrate when I said the knowledge
cut off dates. So if you go on let's say
the open a website or Google websites to
look at the model cards you will always
see so I'm not sure if you can see from
here but um there is always a line on
knowledge cutoff dates which tells you
on when the pre-training of this model
was done. So here for instance GP5 was
released a few weeks ago and here it
says the knowledge cut of date is
September 30th. So you can guess that
they've done their pre-training around
that stage.
Cool.
Any questions on the first part?
Everyone good?
Perfect. So in this first part, we've
seen that pre-training was a crucial
step of the LLM training process and
we've seen all these big numbers
and one could wonder well how can you
train such a big model on such a big
amount of data like how do people do
that?
So this is what we're going to see here.
So just what I had mentioned so LLMs you
can think of them as decoder only
transformerbased models. So in order to
train your model you need that
you need a lot of data but then if you
look at your architecture
you see that a lot of the operations
involve matrix multiplications.
And I guess I have a question for you.
What is the kind of hardware that loves
matrix multiplications?
GPUs. Yes. So you also need GPUs.
Actually more than one. Yeah. You had a
question.
Oh. Uh question is GPUs for inference.
So this one we're going to focus on
training. But um requirements for GPUs
they differ a little bit between
training and inference. But in this
part, we're solely focused on training.
And speaking of GPUs, uh I guess uh it's
not GPUs everywhere because for
instance, Google, they've developed
their own hardware that's called TPUs.
Uh but any non Google
uh Google based models, they've most
likely been trained on GPUs.
Cool. So in order to train your model,
what do you do? So first of all you have
your LLM which is now so this is this
model but uh now we're representing with
a box just for simplicity you initialize
it uh it's like um you know lot of
parameters so you can think of uh the
scale as being somewhere around like
billions to hundreds of billions of
parameters. a huge model.
And what are the steps involved to train
a model? Well, what you're trying to do
is to tune the weights so that the model
can learn how to generate the next
token.
So, you have one step called the forward
pass where you have a bunch of data that
you're trying to pass through the
network. And um while we do that I just
want to call out things that are
important to note that we need to
somehow save in memory.
So when you do this forward pass you
have something that's called activations
which are basically the values at each
layer that are needed in order to
compute the loss.
So the loss tells you how off you are
compared to uh the label that you want
to train this on. And so the amount of
memory that you will use here is
dependent on a lot of things. It's
dependent on the mouth size which
impacts the number of activations. It's
dependent on how big your batch of data
is for training and it's dependent on
how large your context length is because
if you remember uh here we have of n
squared complexity because of this self
attention operation where n is the
sequence length. So you have all these
parameters that come into play.
So once you do the forward pass, let's
suppose you compute the loss, you know
how off you are compared to your label.
Now the next step is to somehow tweak
the weights in a way that minimizes the
loss. So how do you do that? There's
this another pass called backwards pass.
So what this pass does is quantify
the direction where the loss is going to
be minimized.
It's called a gradient. You take the
gradient of the loss with respect to
each parameter.
Well gradients they also need to be
saved somewhere in memory.
And then you have finally the weight
update
which is where you know where the
direction at which your loss is going to
be minimized. So you apply that update
to your weights and you typically use
optimizers out there like have you heard
of atom optimizer? Yeah. So atom
optimizer just a fancy version that has
some additional quantities
uh which keep track of uh which are
basically a function of the gradient. So
you have the first moment and the second
moment which is basically an average of
the moving average of the gradient and
the squared gradients
and all these quantities. So the first
moment the second moment you also need
to somehow save them somewhere in
memory.
So it's a lot of things to save.
Well,
okay, breaking news. Memory is not
unlimited. Memory is limited. And so
here what we have in front of us is the
description of a GPU.
Uh I think so. Yeah, H100, which is a
very good GPU. And you will see that in
that description there's a line on GPU
memory. So GPU memory is your amount of
memory per GPU. It's uh 80 gig for this
one. It's quite large. So it's in on the
order of tens of gigabytes.
So you need to store all these things in
80 GB
which is not a lot.
So
what are we doing? What will you be
doing?
So I guess the idea is to leverage not
one but several GPUs in order to somehow
distribute the load across CPUs. And in
order to do that you have several
methods
which we will see in a second.
So the first set of methods is called
data parallelism
also known as DP.
So what this set of methods does is it
distributes
data across GPUs
so that this forward pass and backward
pass they can all be done kind of
independently.
And so the idea here is to divide the
batch of data across devices.
And then um in order to do that of
course you need to have a copy of the
model per device
because of course you need to compute
the activations you you need to compute
all these things. Um but when you do
that you're able to reduce the memory
that is linked to the batch size.
So that's called data parallelism. Yes.
uh question is how about the gradient
updates? Well, it's a great question.
So, how what do you do when you have
independent computations here and there?
Well, the gradient is just the average
of the gradients uh for this for this
thing. So, you have some communication
in between the GPUs that basically
aggregate the gradient for for the
updates.
So, I have a question for you.
Is
this the answer to everything? Like if
we just scale up like this for I don't
know lot of GPUs is it is it is is it
great always great or do we have like
cons?
Oh yeah uh great point. So yeah you have
to fit one model so yeah that's great
point. So the second point that I will
add is you have an additional cost which
is called communication cost because you
need to somehow communicate between your
GPUs in order to aggregate some
quantities.
So your training is going to be slower.
It's good you you can scale up the
memory. Of course you need to uh fit a
model on a on a device and we will see
what how we can do to do that but you
will be incurring those communic
communication costs so it's not all you
know great
so speaking of the memory and the fact
that we want to I guess be able to at
least store a model per um per uh
device. So people have realized that
there's actually a lot of duplication
and there's been a paper on wanting to
dduplicate this duplicate information
and this method is called zero
zero redundancy optimization
and the idea is that in each on each GPU
you know you store the same parameters,
you store the same gradients, you store
the st same optim optimizer step states.
So the idea here is how about we shard
we partition those quantities across
GPUs. So the first variation is around
sharing the optimization optim optimizer
states. So meaning we partition those
states across the GPUs. So this reduces
the memory by a lot. We can also
partition the gradients
and we can also partition the
parameters.
So here we have no redundant
information. Things are just
partitioned. Well, the problem is you're
going to have even more communication
costs, but at least it allows uh for us
to decrease the memory load on each GPU.
So this is zero. So there's 0 1, 02, 03.
And I guess the variation that you will
choose will be a function of how
sensitive you are to I guess training
time and how big is your model
and whether this will be an actual
problem or are you just fine with just
storing everything.
So that's one set of methods. So this
set of methods is again data
parallelism.
So it's basically you having independent
sets of data that are handled by
different GPUs.
Well, you have another set of methods
that's called model parallelism.
So model parallelism tries to
parallelize
the operation even within one batch.
So there's a bunch of methods. I don't
want to sound too like a catalog. So
we're not go through them all by one by
one, but I will just call out a few that
are worth noting.
So if you remember last lecture we
talked about MOE based LLM and how
sequences were being sent to different
experts.
Well there is a way to distribute that
across GPUs via this expert parallelism
techniques
which is uh having let's say one expert
on a device another one on another
device.
So that's one thing worth noting.
Another one I will say so tensor
parallelism is uh when you have big
matrix multiplications
to somehow cut that in a way that
decreases the uh memory required for
that.
Okay. And maybe the last one I will say
is pipeline parallelism.
It's when
you consider a forward pass as involving
several layers. So you're going to say
that one GPU is going to only be
responsible for let's say layers 1 2 3
and then another one layer three uh
sorry four to five four sorry four five
six and so on and so forth
um so you also have that kind of
parallelism but anyways there's a bunch
of techniques and the ones that I
mentioned they fall in the bucket of
model parallelism
make sense
No need to know the details on there,
but I think just like knowing that there
are several methods and just a rough
idea, I think is a is a good good thing
to have in mind.
Cool.
So, what did we do? So, we realized that
during the training process, we had to
save a lot of things in memory. So what
we saw was techniques that reduce
the burden of having memory per GPU. So
we are trying to distribute that across
GPUs. So we saw data parallelism and
then the zero method that has some extra
optimizations and we saw model
parallelism as well.
So now we're going to see another
technique that leverages the structure
of the GPU. And you may have heard of
this technique is called flash
attention.
It was actually developed here at
Stanford uh in 2022.
And in order for me to talk to you about
this technique, I want to tell you more
about what GPU is composed of.
So if you look under the hood, well GPU
is very complicated and I'm for sure not
uh I don't know everything either, but
what I know is that we have two kinds of
memories in a GPU. So you have one kind
of memory that's big but relatively slow
that's in the HPM
and then another kind of memory that is
fast but much much smaller which is on
chip next to the where the compute
happens that's called the SRAM
so you have HPM and SRAM HPM has
something around uh you know tens of
gigab by so it's like the GPU memory
that you saw in the description.
SRAMM is much smaller. It's like
something around like several like you
know tens of megabytes let's say. So
it's much smaller but then it is like 10
times faster. So this one is uh a few
terabytes per second let's say and the
SRAM is uh tens of terabytes per second.
So it's like a noticeable difference in
speed.
So what we want is to somehow leverage
the strength of these kinds of memories
in order to speed up the attention
computation in a in an exact way. So
what do I mean by exact way? So what I
mean is we're not making any
approximations to the computation.
What we're doing is we're just
leveraging the strength of these
components and sending the computation
in a in a clever way.
So
if you remember the self attention
computation is done with this very
important uh formula. So it's softmax of
queries and the keys over some scaling
factor times v.
So this allows queries to interact with
everyone else.
Um so in matrix form you can think of
queries as being uh as having the number
of rows equal to the sequence length and
then columns to being the the dimension
of the query and then you same for key
and value. So you have this big matrix
multiplications.
So if you do it, if you do this
computation the standards, the vanilla
way, what you would do is store them in
the big but slow memory component of the
GPU.
So you would store it in the HPM.
So here is what you would do if you were
to not do any optimization. So you would
take those matrices from the big but
slow
HPM,
perform the computation
and then write it back to the HPM
and then you would read that result
again from the HPM, compute the softmax
and then write it back to the HPM
and then you would again load this plus
the value matrix multiply them and then
write them to the HPM.
See there's like a lot of read and write
to the HPN. So it's a lot of uh data
transfer
which actually becomes the bottleneck.
So a GPU is very very fast but then you
spend a lot of time just loading your
matrices from the memory.
The reason why you do that is because of
the softmax softmax operation. So do you
remember what a softmax does? So it
normalizes the quantities so that they
sum to one but it's row dependent
meaning that each row needs to sum up to
one.
So in a sense you need that computation
to happen first before you do your
softmax. Like uh if you just like look
at it like that you you would think yeah
you you need to do the whole thing
first.
Well turns out that you don't need to do
everything
at once.
And this is the core idea behind flash
attention.
So what flash attention does is it tries
to minimize the amount of read and write
from and to the HPM
and instead takes small blocks and it's
called tiling. The method is called
tiling. It takes small blocks that it
sends to the SROM so that it gets
computed from end to end before being
sent back to the HPM.
Does that make sense? So the idea is
let's send small matrices into the SRAM
so that it does the whole you know full
end to end computation and just send it
back to the HPM because we want to
minimize the amount of read and write
from the HPM.
So here's how how you would do it. So
you remember the softmax
uh computation with the query and the
key and then the value. Well, what you
would do is to cut your matrices
and then proceed step by step.
But then there's a cool trick that I
want to talk to you about which is that
you don't need to compute the whole
matrix inside a softmax
in order to achieve the whole softmax
computation
cuz if you think about it let's suppose
you have a whole matrix and then you
have like different let's say columns or
like submatrices S1 to SN well submax of
this huge matrix
is equal to this matrix where the
softmax is taken respect to each of
these submatrices
up to some scaling factor.
So this is the core trick
and if you want to be convinced of it
just look at the softmax formula it's
like exponential of something over some
quantity which is shared across the row.
So this scaling factor will just
fix this with respect to that.
So with this in mind, what we will do is
take each respective slices of these
matrices,
do the whole computation and then
populate the corresponding
uh entry in the output matrix.
So we will do that between let's say the
first slice of the query and the first
slice of the keys and the values and
then we will repeat for the other slice
until the end
and then we will repeat for the other
queries as well until the end.
So what the paper explains is how this
scaling factor is being computed. So
this one is some formula that I did not
put on the slide. So it's not necessary
for you to memorize the formula. It's
just the idea
and the idea is exactly this trick.
So once you do that,
you basically end up with only one read
from the HPM
and these like tiled quantities are
stored in the SRAM
and then they're read from the SRAM
which is very fast and then computed and
then back to the SRAMM and then at the
end in order to accumulate the results
they're being sent back to the HPM.
So, just to make sure we're clear. So,
in green is basically when it's red from
the SRAM and then in blue is from the
HPM. You have a question? Yeah.
>> Yeah. The question is do you take the
whole row or a portion of it? So you can
take a portion of it but just for
illustrative purposes. Here we take the
like just this is just for illustrative
purposes. You can think of your your
matrix as being completely uh you know
like a grid and then uh you just like
multiply accordingly. But yeah this is
just for illustrative purposes. Yeah.
Yep.
Yeah. So the question is uh are you
computing alpha and all these quantities
on the fly or do you have some
estimation? So all this is exact and
they're computed in an iterative basis.
So the way it works is when you
populate the output you will keep on
having some extra quantity that will
adjust for that.
Think of it as just some formula that
works.
So yeah highly recommend looking at the
paper. They actually explicit that quite
quite a lot. So the paper is flash
attention fast and memory efficient
exact attention with IO awareness. All
of the links are in the the slides. But
yeah, highly recommend just looking at
the exact formula in case you want to be
convinced.
Cool. But the idea makes sense overall.
Any questions on this?
Cool.
Well, this was flash attention.
But there's actually another idea from
the paper. So this was only the first
idea which was around making the
attention computation
faster.
The second idea is given that the
attention computation is faster.
Now let's let's try to be smarter about
the backwards pass
because when you compute in the
backwards pass when you compute the
gradient of the loss with respect to a
parameter
the chain rule will surface some
activations that you need to have in
memory.
Well, here given that computing these
activations is very fast,
one idea is to just not store
activations from the forward pass or at
least not store everything
but instead in the backwards pass
compute these activations again.
So it's called recomputation.
So when you do the forward pass, you
compute an activation to be able to
compute the loss, but then instead of
saving it to reuse it during the
gradient update, you will just discard
it and then recomputee the activation
during the backwards pass with this very
uh fast technique.
And when you do that, it's actually
quite remarkable. So you do more
operations
Uh so gigaflops is uh you know it's just
a derivative of flaps. So it's the
number of operations you're doing. So
with flash attention you're doing more
operations because you're actually
recomputing things.
But then so you also see fewer read and
rights from the HPM. So here for
instance it was 40.3 in the standard way
and then 4. So it's like almost a 10x
reduction. But then you see that the
runtime is also smaller.
So this is very remarkable because
usually when you recomputee things
you are saving memory but at the expense
of
uh runtime you're just taking longer but
here you're not only taking less amount
of time you're also saving memory. So
you're basically having everything it's
the best of all worlds.
So this is what flash attention uh
mentioned. So you will see that there
are some uh I guess variants of flash
tension. So flash two, flash three and I
would say that these are more adaptation
of these methods to the current infra
because each new GPU comes with new um
pros and cons, strength and weaknesses
and I guess there's always a way to make
these optimizations better.
Uh but yeah, I would say flash attention
is quite a common trick and I think it's
a very good thing to know.
So does that make sense?
Cool. I take that as a yes.
Okay, cool. Okay, so last thing. No, we
have a few things um that I want to talk
to you about. So you know, you have
your LLM with a bunch of weights. These
weights, they're all floating points. So
you can think of them as, you know, just
being numbers with a bunch of numbers
after the decimal. One natural question
you can ask yourself is, do you really
need to know
that much like precision after the
decimal point to be able to do a good
job? I guess put another way can you
just like cut your precision in some way
to save on memory but then keep the same
performance.
So this is the idea behind quantization.
So quantization is the process of
converting the precision of a number
from let's say one uh setting to another
to in order to better understand that I
think it's important to know how
floating points sorry floating point uh
numbers how they are encoded.
So in practice
they're just a bunch of bits
and here you have some bits that are
responsible for let's say the exponent
some they're responsible for the mantisa
which is basically how granular granular
your number is and then one that is
about the sign.
So you have a bunch of representations
of floats. So the most common ones are
in this table. So you have uh so single
point precision, half precision,
um floating point 64, floating point uh
brain float 16 that are each having
different granularities for a three
dimensions.
So if we take the first two rows which
are the two most common uh
representations that you will see out
there,
floating point 16 is only represented on
16 bits.
compared to 32. So it takes basically
half the memory if if you want to think
about it this way. Uh but then it's less
precise, meaning it's less granular. So
I guess one idea that we could have is
to somehow decrease the granularity of
these weights and these numbers
hoping that it will not impact
performance too much.
So another thing that I want to uh
mention is you know back to that uh GPU
description
you see that there is also a bunch of
information around the compute speed. So
I'm not sure if you can see very well on
the can you see very well on the slide.
So I I'll read it. Uh the compute speed
as a function of which kind of
representation you're using. So here if
you're using FP64 which is this super
granular way of representing numbers you
only have 34 teraflops of comput speed
but then if you're using let's say P32
which is half of it you can uh kind of
double
your compute speed and so on you have
all these like other um numbers as well.
So the idea is you can save on memory,
you can also go faster.
So with this in mind, I just want to
touch on one last technique before
giving it to Shervin, which is mixed
precision training.
So the idea behind mixed precision
training is to leverage different
granularities of float representations
such that it will not hurt the
performance too much
but allow you to save on memory and do
things faster.
So the idea behind mixed precision
training is you have your models or you
have your model,
you keep your weights
in high precision, so FP32
and all the operations that you do in
the forward and backwards pass, you're
going to do it in a lower precision. So
in this case FP16
but then the weight updates will be done
still in FP32.
So the authors of the paper when they
when they did that basically what they
realized was the performance was not
degraded too much but then you had a lot
of savings on memory and then it was
running faster.
So now you may wonder well why are you
keeping the the weights at a high
precision but not the activations and so
on. Um so I can offer you just a bit of
intuition.
So whenever you perform a forward pass
you're performing that on a set of data
but the set of data can be noisy in
itself.
So maybe all the decimal points after so
all the numbers after the decimal points
may not be as useful.
So you can think of this update as being
more like you know in which direction
should the weights go to go to a more
optimal state. So it doesn't require
them to be super precise,
but then it's much more important to
keep your weights, so the weights of the
model in high precision
in order to not accumulate
errors due to contisation.
That's one way of just think about it.
But the long story short is if you
reduce the granularity of some numbers
then you will have a lot lot of benefits
and not a lot of disadvantages in terms
of the performance.
Cool.
Yeah.
That's a great question. So the question
is do you apply this uh technique to all
weights of all layers or just to some of
them? So there's been a lot of papers on
that. Uh and the answer is there's some
variations to it. Um and so the answer
is not necessarily. So I do have some
pointers happy to share share them with
you. But um some parts may be more
important than others.
So this strategy is a little bit uh like
a high level idea but there are always
some variations uh from setup to setup.
Yep.
So is your question that people rely on
the like relationship between compute
and sorry not compute parameters and
token from the chinchilla paper. Uh but
it may be different from their setup
which may introduce something is is
basically your question.
So the question is uh is there also an
optimal precision to use? So I would say
people use different things there's not
like a set way where everyone does
something but um I would say that these
uh scaling law like this chinchila law
is actually something that some authors
they try to reproduce for their own
model.
So actually uh for instance the lama 3
paper there's a whole part around trying
to uh have some relationship between uh
you know given a fixed amount of compute
what is the optimal
uh you know number of tokens and then
number of parameters that is uh unique
to their setup.
So what people do is they do this thing
themselves on a small amount of training
uh training set and on smaller models so
that it doesn't cost them too much. They
try to see what the relationship is for
their setup and then they extrapolate
and I believe that's how llama 3 so I
think there's a model that so lama 35
billion parameters I think they had the
whole section just justifying that they
came up to this number with some
experiments that they run. So to your
question um I think it highly depends on
the kind of model that someone is
training and they'll probably
have this analysis I guess with respect
to their own setup.
Does that answer your question?
Cool. Cool.
Okay. Yeah.
Yeah. Yeah. Yeah. The question is uh is
there something you can do about the
range? So there are several types of
quantization. We'll not go into details
here but I can give you like a zero
point quantization and apps quantization
which are different techniques that play
with these ranges. Uh there's also like
a quantization technique that Shvin will
cover that also talks about that. Um so
yeah stay tuned for this but um I think
0 point quantization and absmax are are
ones that maybe you can look into for
this.
Cool.
So with that I'm going to give it to
Shervin.
Thank you Afin. So now we have seen uh
how pre-training worked. We're going to
see together the next stages of
training. Um so as we saw with Afin
pre-training was a way to um build the
intuition to the model about how
language was constructed and about uh
the main characteristics of what is
contained in the training data that is
general enough. So this is why we had
corpuses that spans huge corpora of text
that are typically of the size of what
you find on the internet. So typically
very large and raw and conveying um
aspects of language.
So now you might want to ask yourself
how can we make such knowledge actually
helpful. So this is something that we're
going to discuss
just in a second. But before we discuss
about what could motivate further
training, I want to come back to some
very simple example
that will
um surface what could be such a need. So
let's take our f favorite example about
our teddy bear and let's say we have a
very um practical use case. So, we have
thoroughly loved our teddy bear. Uh, but
now it has become a bit dirty. So, we
need to wash it. And you might want to
ask your favorite LLM um if you can put
it in the washer. So, with the
description that Afin mentioned of LLM
pre-training, I wanted to ask you all
what was your opinion about what could
the output to this be?
So, any guesses?
Yep.
Yeah.
Yeah. Excellent. Yeah. So the the answer
the guess was maybe another question. So
I think that is a great guess that could
definitely be it. So here in the example
we put some other sentence that could be
likely but the the gist is the same. it
has been trained on next token
prediction not being an assistant or
someone that is helpful to you. So this
is why uh it will try to mimic what
could be the pattern here and what could
be potential likely next words. So for
example, the example we put here is
something that relates to uh like how
the teddy bear is composed of um and uh
because it's in the same domain as kind
of teddy bear and then when you talk
about washer then maybe it has been
trained on data that includes materials
of teddy bears. So that could be it. Um
and I have a question with you all. Um,
are you happy with it?
No. Yeah, exactly. So, this is why we're
going to see what we can do with the
model to make it helpful to you.
And this stage is called fine-tuning. So
we're going to see um that it enables us
to uh like in the case of general like
the general case of LLMs to make the LLM
a helpful assistant but also if you have
a specific use case you can also use
this technique to tune the general
representation of the language towards
your task of interest. Uh but we're
going to focus first on what is done for
LLMs in general.
Uh so first I want to define some terms.
So model fine-tuning is commonly known
as SFT. SFT is uh short for supervised
fine-tuning. So the supervised part
suggests that we need some labels and
it's actually the case. So it's pairs of
inputs and outputs that we provide the
model to be trained on. And this is why
uh it's called supervised. And then
finetuning is uh the term that refers to
uh refining the weights that have been
already trained. So you you start from
the pre-trained weights and you further
train the model on additional data to
get a fine-tuned model. Um and um and
then one interesting note that I will
tell you all is that the objective
function even though it's the same will
differ a bit from the pre-training task.
So in the pre-training task you start
with the BOS token beginning of sentence
and you fit uh all your corpora of data
trying to predict the next token. But
here since you have uh this supervised
fine-tuning setup where you want the
model to do something useful in cases of
interest. So let's say when I ask the
model about washing my teddy bear. So
washing my teddy bear is actually an
input that I give to the model. It's
it's fixed. I don't want the model to
parrot what I say. I want the model to
be a helpful assistant to what I
condition it on. So in this setup, the
input is not a location where you would
predict the next token. You would not do
teacher forcing on it, but rather start
from this input and then predict the
next token onwards. And then you're
going to tune what the optimal um like
distribution of next token can be based
on this input.
So any questions on the overall idea of
SFT before we dive in a bit deeper?
Yep.
So the question is we don't put a loss
over the input. That's right. So you
start from the input and you start
predicting the next token and then the
loss calculation starts from there.
Yeah,
everything good.
Okay.
So uh what I want uh what I was just
telling you um is that SFT can be done
to tune your own uh task of interest but
also it's something that is being done
at uh the one of the main stages of
training LLMs as you use it every day
and then this transition from uh the
model being a good representation of
language to a useful assistance is a
subcategory of SFT called instruction
instruction tuning. So instruction
tuning comes from the fact that we want
the model to answer instructions.
So now I'm going to like show with the
graph what I was just mentioning
regarding predicting the next word. Uh
okay. So actually just in a bit first
we're going to look at the data. So um a
with a we saw that the composition of
the data for pre-training was basically
the whole internet where you had uh
sources that were carrying knowledge
such as Wikipedia um and in general like
masses of text of English that um that
tells that tells the model about how
English and other languages work as well
as coding and here it's slightly
different for instruction tuning. We're
going to look at data that presents to
the model how it can most helpfully
respond to instructions. So you can
divide this in several categories. So
here we present some categories that
could be helpful to users of LLM. For
example, uh story writing, you know,
you're interested in writing a story.
you give it a set of instructions and
then you have a ground truth that is
attached to it. Other um examples
include poem creation, list generation,
explanation and and many more that are
part of the data mixture of SFT.
So basically you uh run all these
supervised fine-tuning
um training with the input that is given
and then you train uh the model on
predicting the output
and I'm going now to show uh what it
looks like. So let's say your
instruction is here. So you ask your
model to do something. Um, so it could
be formulated in a different way, but
just to keep it short, I just said do X
and then the yellow part is where you
start fitting your objective function
on.
Um, does this make sense?
Great.
So as I was uh mentioning the data
mixture in includes um instruction
um geared data. So for example you had
all of these uh kinds of tasks that uh
users might ask for at inference time.
So it's what I put under the category
assistant dialogues.
Um, and these days, uh, since you have
all these, uh, large models that already
exist out there, and let's say you want
to generate a new one, you don't
necessarily have to start from scratch.
So, all of these that I mentioned were
originally human written. So you had all
these instructions that people gathered
and you had uh expert linguists that had
a set of instructions
on how to write the best answer that was
fluent, helpful and um everything that
was geared towards maximizing your
happiness as a user. But these days you
can use these already trained LLMs to
generate such data. So typically uh when
you have a large model that that is very
performant you can fit it such
instructions sample some generated
outputs and then have a human or some
other LLM review the quality. So it's
something that speeds up a bit uh the
curation of such data sets. So I just
wanted to emphasize on the fact that
it's not only human generated these days
but you can have some assistance
and besides all of the categories that
you care about. So typically
u math or um you know how you write a
proof, how things follow each other or
codes uh like high quality code bases.
You also have other aspects that um that
are very important when you when you
release a product like this to the wider
population that come under the umbrella
of safety. So uh you want your model to
be helpful but also harmless. So you
also have uh in the data mixture often
times a subset that enforces the model's
behavior not to uh repeat some of the
harmful content it might have seen on
its pre-training corpus. So it will
include techniques that might um reject
some um some some user prompts. So you
might have tested you know on your own
the limits of an LLM today. If you
submit queries that might leads to
something that that is considered bad it
would just say okay sorry I cannot
answer this and this sorry I cannot
answer this is you know could be done
via regex but very rarely because it's
not scalable and then people actually
embed this property as part of the
model. So um you have techniques that
like rejects user queries that are seen
as harmful or uh there is some other um
phenomenon such as hedging
that nuance the um the output of the
model in order to not make blanket
statement as well. So you might have
data sets that includes such um flavors
and you can have many more um depending
on the task of interest that you're
dealing with. But what I'm listing here
is what usually the LLM out there that
aim at being general assistants gather.
They tend to be like general tasks um
that there are being trained on.
Any questions on the data?
Yep.
Yep. So the question is the story
writing example doesn't precise what
kind of poetry uh the story should be
about. So it might be ambiguous. How can
it generalize? So that that is a great
point. So we rely on the model's uh
knowledge that it has accumulated at
pre-training time to be able to
generalize beyond the examples that it
has been fed. So let's say now you have
um a story about uh you know poetry and
from pre-training it knows about all
kinds of poetry that exist out there. So
you could imagine after fitting such an
example at supervised fine-tuning time
that you could give more details to this
prompt and then the model has the
ability to generalize the generation of
the story with respect to those other
attributes. So you can have I think like
the example that you mentioned is a
great um example of u the the the
magical ability of LLMs to adapt to
natural language. uh and all of that has
to do with the kind of distribution it
has seen in the past and what we teach
it. So uh I think like the concept
itself of writing stories is the key
learning to the model that it can
generize.
Does does that make sense? Yeah. Okay.
Great. I think there was another
question here.
Yep. So the question is about the term
of alignment. Um is this post training
uh what is called as being aligned? So
you are ahead of me. In a few slides
we're going to talk about it. Yeah.
Okay. Great great questions. Uh any
other question?
Awesome.
So uh and then just like Afin did it for
the pre-training part I'm going to give
some orders of magnitude. So the same
GPT3 and LMA 3 papers that Afin referred
some numbers from don't actually give
their uh statistics in terms of token
numbers for the for the instruction
tuning site but actually in number of
examples. So you see you you have about
13k used by GPT3 and uh 10 million for
lama 3 and uh okay so let's do some
quick estimation let's say each example
is about a thousand token when you
multiply that with the number of of
examples you see that the size of the
data sets used for SFT is several orders
of magnitude
lower than the one used for pre-training
so The mental model that you need to
have here is that pre-training is a lot
of data. You want to learn about general
um characteristics about language and an
SFT is more exactly what's uh so someone
was mentioning just now regarding
alignment was aligning the goal of the
model to being uh suited for your tasks.
So typically very high quality data sets
and much more um like concise and
precise in terms of number. So like
lower like less order of magnitude.
Okay, great. So now let's try this
exercise again. What do you think could
be the answer to our instruction tuned
now model to the same question?
Anyone wants to take a guess?
You want to try again?
>> Yeah. So the the the answer was yes, you
can put in the teddy bear in the washer
except that you shouldn't put your teddy
bear in the washer. You should hand wash
it. Um so it will give you the answer
that is helpful to the user and indeed
uh this time respond to the query.
So which is which is pretty nice.
So now we talked about all the things
that are great about supervised
finetuning. Now that I'm going to detail
a bit more about what could make it hard
or challenging and then motivate the
optimization part that we're going to
see afterwards. So first we talked about
it a a bit. uh when we have these data
sets, these data mixtures for SFT
training, we need high quality data and
when you say highquality data means
often times human involved in the loop,
you need to make sure it abides with all
the rules and all the characteristics
that the user will care about. So
typically it's highly um involved. So
originally these first models they were
almost all human but now you might have
some um some mixture of human and uh
generation
um and but one good thing about these
things is that the data sets they are
trained on are reused. So so it's a work
that you do once and you can complement
with respect to time uh but it's still
very expensive in time and resources.
So the second point I want to mention
comes back to a point that someone just
made regarding uh the distribution of
your SFT data set and how it would align
with actual inference distribution. So
in the case of the story regarding a
specific kind of poetry, the
distribution of prompts that precise the
kind of poetry seems close enough for
generalation generalization to be made.
But you could think about examples that
widely differ. So let's say you ask for
a story that is in some movie plots and
it differs from the stories that you
might uh like see in in textbooks. So
this could be slightly out of
distribution and it could have trouble
generalizing to that. So um prom
distribution is very important and
aligning that prom distribution with
respect to the target task is of
interest. Yep.
Yeah, great point. So the question is
about let's say you have SFT tuned your
model and now you put a training input
back to the model. Will you get the same
story? So it has to do with the
phenomenon that Afin was uh mentioning
regarding
memorization.
So in practice you will not see the same
story most likely because the sampling
that you're doing is uh with a non-zero
temperature it will u maybe have the
same flavor but not word word for word.
Um
then the flavor of that story if you
have the exact same prompts might be the
same. You know it depends on what it has
seen as at pre-training time. Um but
yeah so the answer is that
if I had to give a guess on that precise
example it might be the same flavor but
definitely not the same wording.
Yeah. So and it might be a different
story um if the sampling goes through
some other region of the space. You know
stories are highly creative and the
pre-training uh corpora of of like data
mixture had uh like all kinds of stories
in there. So like the specific u example
of story is likely to generate different
um different like outcomes.
>> Yep. So the question is uh will how much
the model wanders around depend on the
temperature parameter and yes. So
typically when you have a high higher
temperature you also so it's also called
more creative and it's for that reason
it might also uh generate things that
are less likely u on its output
distribution with respect to what it has
learned and this is exactly you know a
case of something that we learn. So yeah
the answer is yes.
So but I so the question is is there any
way to tune that? So when you tune the
temperature, you you exactly do that.
Yeah.
Or were you thinking about something
else?
Yep. So the question is, you know, is
there something else you could do to
have like variations in the response? So
for that I have a simple but hard
solution. you would need more data in
that category that uh will ground the
model into seeing what kind of
distribution we target with such queries
and I think that will be one way for the
model to generalize more. Yeah. So I
think like it will be working on the
data part
was there another question
all good.
So yes, so this is a very relevant
question because it touches on the topic
of generalization.
So yeah, so data mixture matters a lot
and having points that are um sparse
enough in the distribution space to give
the gist to the model of what it needs
to learn rather than repeating the same
story again and again will have a lot to
do with generalization powers. Um and
then this is something we are going to
see in a second. How do you evaluate
such models? Um the feeling that you get
as a user on how helpful a model is
tends to be subjective. So how can you
put some number on it is going to be a
key topic of interest and then also um
very soon we're going to see how to make
the computations
not that expensive.
So Ashen talked about training
optimization techniques, but now uh when
we're at the finetuning stage, maybe we
can go one step further and and and see
what simplification assumption we could
make.
Okay, awesome. Now let's dive into one
of these challenges which was the
evaluation part. So people have
decomposed what they care about in
categories. So I'm listing here some
dimensions that are evaluated against
and that give some quantitative number.
So you have a general language where
this benchmark that is popular um is
generally one score that users that
people reports. It's a MMLU
um which is massive multitask language
understanding. So it has I think 50 or
so tasks in there where the model is
being evaluated against and you have
some score that you can compare with.
You have uh benchmarks on reasoning uh
on uh math reasoning as well as code
generation. So here you have all sorts
of acronyms and there is like even way
more. So uh benchmark um basically the
the setting up benchmark has been uh one
area of research. So you always see more
and more benchmark coming in because
people tend to optimize for those that
exist and then there are some gaps that
might be filled by further ones. And the
GSM 8K stands for
um
grad school. So yeah, grad school math
8K stands for the number of examples. Uh
yeah, I think so. Or maybe maybe G
stands for something else, but it's like
high school topics basically.
And one very interesting uh pattern that
people have seen as these models came
out and as they were evaluated against
these benchmarks is that sometimes for a
same kind of model you see all of a
sudden a spike in uh numbers uh with
respect to some of these benchmarks
without a clear explanation behind it.
And there is this paper that I highly
recommend taking a look which um
explores the phenomenon of training on
the test task
not the test set the test task. So when
you have benchmarks regarding u math
reasoning let's say it matters a lot if
the model has has been trained on data
that relates to that kind of reasoning
versus not. And this is why when you
type in those kinds of benchmarks, you
will see that there are sets of
so-called auxiliary training that can be
used for the model to be trained on the
same domain and um and in that way it's
enables um like the fact of comparing
models with respect to that specific uh
capability. So the what the paper uh
tries to convey here is that if you want
to compare models between each other,
you need to compare um the training
mixture it has been trained on and then
ensure parity in terms of training on a
test task. So you need to make sure that
for example the two have been trained on
the test task or not. And if it's one
and the other knots, then it might not
be a good thing to compare uh them. It
doesn't give you like the intrinsic
value of the model.
So yeah, that's an interesting um
phenomenon here. Uh any questions?
Okay, great. So we can move on. Uh so
one thing that I was mentioning here is
that it's very hard to get a sense of
how good a model is even with respect to
benchmarks course because often times
what happens is that as they come out
people design um data on the training
side that resembles more what we try to
solve on the benchmark side. So
sometimes you end up with models that
score great everywhere but you as a user
don't necessarily see uh added value to
it. Uh and uh you know it's not the
fault of models. It's not the fault of
the benchmarks. It's just that it's very
hard to give a number that conveys the
value to you of the model. So this is
why people have come up with other uh
techniques to put a number on uh model
evaluation and you might have heard of
JetBot Arena.
Have any folks heard of it here?
Yeah. So, it's a website where models
can be submitted and um users come in
and they ask their questions and they're
presented responses from two models and
they're being asked to judge which one
is better and then with some pair wise
computations done on the website side
they come up with some ranking in the
end that's ranks model with respect to
quote unquote user preference.
So it's a sort of a a number that puts
on the like that is being put on the
vibes scene by the user and is it um all
perfect actually uh no. So it suffers
from several uh issues that are uh like
another set of issues that are hard to
deal with. So like among them when you
have a new model that come in you have
some noise at the beginning with respect
to with which other model it's being
compared against and these first few
steps actually um influence quite a bit
the actual ranking which makes it um you
know kind of a brittle
uh property
and uh there are there is a paper that
actually shows that it's easily possible
to rig
such a leaderboard. So I'm not sure if
there is any evidence that it has been
done in the past but you know you can
use any model if you ask the question
who are you it's just going to say who
who it is if you ask GPT who are you
it's going to say hey I'm Chad GPT I'm
helpful assistant and this paper uh you
know observed on this very simple
property that it was able to detect what
model was being evaluated against and
let's say you have an adversarial player
it could rig the ranking just by
selecting the right model. So it's not
um foolproof.
And then on other aspects, so um some of
the benchmarks that evaluate these
models, they're actually cured by
experts that know very well about the
target distribution of what a given
prompt should give. So it can uh so it
has a set of guidelines that says
clearly okay this is factual this is
non-factual. So they're able to
determine what is good versus bad and
you as a user you might not know about
it cuz let's just take the example of
the teddy bear that you wanted to put in
the washer. If I don't know that I need
to hand wash my teddy bear and if I have
a detailed response regarding uh you
need to wash it with the machine wash
cold and I could as a user find these
pieces of advice helpful because they
are actionable to me but actually are
they factual that is a whole other set
of issues that you as a user wouldn't be
able to tell on a lot of queries
and uh another challenge that we have
Here is a user preference. So, who here
likes it when there are emojis in LLM
responses?
Yes. No.
Yeah. Yeah. Personally, I do. And I
think there are like strong opinions and
some people don't like it at all. So,
they will downrank such such responses.
But actually, it's something that the
user should be able to to tell and
choose. And then the distribution of
people who choose the best model with
respect to the distribution of the wider
population that is going to use this
model these models is going to be
different. Um you know in often times
and I think the emoji case is a good
example. Uh because I think generally
emojis are popular in the wider
population but I think domain experts
might not like it as much. So I think
that that mismatch is one that you would
see here. And then the last point I will
mention here is the safety side. So you
as a user, you don't really like it when
a model rejects your prompt. When you
ask about something and it says, okay,
hey, sorry, I cannot answer it. So there
will be a bias towards responses that
actually responds to your query rather
than uh respect some safety principle
that might be eventually the intended
product decision. So there is also this
kind of bias that pops up here
and as I was mentioning evaluation is a
hard problem. So you have all these
angles all these angles that you can
explore and it's not one number that's
going to tell tell you what you care
about. It's the combination of all of
these and in the end it's um it's
tailored to what you actually need. So
you would need to
see the strength and weaknesses of a
given model and to determine like which
one corresponds to your use case.
Okay. So I'm going to come back to one
question about alignment here. So we're
going to see at the next lecture
a further step into aligning the model
to do what you want to do in a step
called preference tuning. And then the
combination of fine-tuning and
preference tuning um which is which
comes after pre-training is what we call
alignment of the model. So like these
two steps are called alignment.
And I want to call out one other thing.
Uh there is one step we didn't mention
here that is called mid training that
has been emerging very recently which
consists in a step just after
pre-training in aligning the kind of
data that the model is being trained on
to the tasks that you really care about.
So it's the same pre-training objective
but aligning the kind of task and the
kind of data set to something that you
care about and uh yes so it's an
emerging trend I have not mentioned it
but just so that you know mid-training
is something that you would have between
pre-training and fine-tuning
all good
okay great so now we're going to tackle
one aspect of the challenges that we had
mentioned regarding fine-tuning which is
computational expense like the fact of
being computationally expensive and
we're going to look at it with one uh
well-known technique called LoRa which
is a technique to fine-tune your weights
at the finetuning stage in an efficient
manner. So it's widely used it's um yeah
and and it saves a lot of computes. So
uh when you look at your weight matrices
instead of directly fine-tuning the
whole weight matrix the Laura technique
decomposes the fine-tuning between the
weights of the pre-trained model and
additional weights that it decomposes
into a low rank multiplication.
So uh in this formulation
the pre-trained weights that you have is
frozen. So this W0 is frozen and then B
and A are the matrices that you need to
tune. So B and A are typically um so
they have dimensions that will match the
like number of rows and number of
columns respectively from for the number
of rows of B and number of columns of A.
But then the dimension of the columns of
B and then rows of A is R which is the
rank of these matrices which is
typically taken very small.
So the dimensions of W is typically
hundreds or thousands and the dimension
of R is typically um you know up to 10
or of them. So as you can imagine it
results in much less weights to train.
Um and then I will mention just other
techniques to improve efficiency uh and
then ease the finetuning uh some u
methods called prefix tuning and
adapters. So, um, they're explained in
the in the class textbook, but we're not
going to dive in deep into them, um,
because they're like, uh, less commonly
used, but just so that you know.
Okay, so let's walk through in detail
how uh, does Lora work. So you when you
want to fine-tune your model, so you
have all these ways for which you have
already learned a distribution at
pre-training time. So one um naive way
of operating fine-tuning would be to
directly iterate on these weights. But
what we're doing here as we said is to
decompose this into the weights that we
have already pre-trained on and this
product of matrices. And what Laura says
is that you can do a forward pass on
both these terms and then adds these uh
quantities in the end. And yeah, A and B
is um like is going to be um
characterization of your task of
interest. So let's say you have a model
that uh to Ain was mentioning the kinds
of tasks that you can maybe specialize
your model on. for example, spam
detection. So you would take your
pre-trained model uh instantiate this
BNA in weights like in the weights of
your model fine tune on them and then
what you you will have is that BNA are
going to be specific to this task of
span detection similarly for sentiment
extraction and so on. So it's a very
nice property that you can start from a
base model further tune your weights and
then have your A and Bs that are task
specific.
Um so I want to comment now on where
were these um you know matrices being
learned. So in the original paper uh of
Laura it mentioned training only on the
uh attention matrices but later on uh
people realized that it might not be the
place where um it has the most positive
impact on uh on performance and then
there is a blog that uh came out a few
weeks ago that um studies this into
detail. uh and then they realized that
the feed forward blocks are actually
those where putting Lora is most
beneficial.
So today typically both these components
uh carry Lora matrices but the bulk of
the of the performance improvements is
actually contained in the feed forward
uh block.
And then I want to mention two
interesting properties about LoRa. Uh so
when you fine-tune with LoRa weights uh
you want to use a higher learning rates.
So typically 10 times more is the is the
guidance.
And then uh one interesting fact is that
it doesn't perform as well when you
train it with larger batch size.
So I don't have a good theoretical
explanation to give you for each of
them. It's more empirical observations
that have been emitted but to give the
main lines behind the mindset shared
from people who study this phenomenon.
Um the first one might be guided by the
rank of the Laura matrices that might be
small and as a result like given the
regions of space that it needs to
explore you need a higher learning rate.
And then the second one uh the
hypothesis here is that the training d
the training um like dynamic of products
of matrices is different than a full
matrix. And this is where the increasing
batch size phenomenon occurs. So it's
basically the um explanations tentative
explanations given there.
So now we're going to explore um an
optimization of this. Uh but before I
dive deep into that part, does anyone
have any questions on the Laura part?
Yep.
Yeah, great point. So the question is uh
do you do grid search on the rank? So
you could so it's a design choice the
rank R. Uh typically people would have
done it before you. So you have an idea
of what rank could be uh well suited for
your for your use case. Uh so you could
definitely do grid search or you could
just pick a popular value. I think a
four is is used commonly. You can just
go with with one of them. Um yeah and
then the we are going to see that the
reduction in number of parameters is so
huge already that reducing uh you know
even more maybe doesn't matter that much
like that initial redu reduction by
orders of magnitude goes a long way. uh
and then that is just a hyperparameter
tuning and you can see a given rank in a
given setup as being a design choice.
Uh okay so we have two minutes and we're
going to quickly cover quantized
uh Laura. So the techniques that Afin
mentioned just now uh regarding
quantizing weights uh and um and so
reducing the memory footprint is
something that we're going to see just
here. Uh so when you look at these
matrices w0 and a and b what people have
done in this paper is to quantize
the uh weights of w0 into a format that
is very smart and then uh compute and
iterate on these matrices that are being
learned a and b in full precision which
is in that case bf16.
Um and then uh the quantization of these
u frozen weights is super smart. It's in
a format called NF4
which assumes that the weights are
distributed normally and it splits the
space into quantiles rather than buckets
of fixed size which puts about the same
amount of number of uh values into each.
So it kind of optimizes the bits that
you use for encoding
and what it does is that it does like a
double quant quantization process. So it
quantizes the weights and then in a
second step it quantizes the
quantization constants that we didn't
see at length here. Basically when you
want to convert your um your full
weights in and out of quantized states
you have constants that are generated
and they propose to to quantize these
constants as well. So uh this method
generated 16x
um VRAMm uh savings and then the double
quantization methods uh gave some extra
savings but not that much but it's
interesting to know and then we're
exactly on time. Uh thank you for your
time.
Loading video analysis...