Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 4 - LLM Training

By Stanford Online

Summary

## Key takeaways - **LLM Training: Pre-training vs. Fine-tuning**: LLMs are trained in two stages: pre-training on vast, general data to understand language, followed by fine-tuning to adapt the model to specific tasks or desired behaviors like being a helpful assistant. [09:16], [01:02:42] - **Compute and Data Scale for LLMs**: Training LLMs requires immense computational resources and massive datasets, with models like Llama 3 trained on trillions of tokens, and the compute cost often reaching millions of dollars. [10:37], [13:04] - **FlashAttention: GPU Memory Optimization**: FlashAttention optimizes GPU performance by minimizing data transfers between high-bandwidth memory (HBM) and on-chip SRAM, processing data in smaller blocks to reduce latency. [38:31], [40:07] - **Quantization Reduces Model Size and Increases Speed**: Quantization reduces the precision of model weights, significantly decreasing memory footprint and increasing computational speed, which is crucial for deploying large models. [52:35], [55:59] - **LoRA for Parameter-Efficient Fine-Tuning**: LoRA (Low-Rank Adaptation) fine-tunes LLMs by training only a small number of additional weights, decomposing the update into low-rank matrices, which drastically reduces computational cost and memory usage. [01:37:53], [01:38:37] - **QLoRA: Quantized LoRA for Further Efficiency**: QLoRA combines LoRA with quantization, freezing the base model weights in a quantized format (like NF4) and fine-tuning only the LoRA adapters, leading to significant memory savings. [01:44:23], [01:45:35]

Topics Covered

AI shifted from single-task models to general pre-trained models.
Optimal model size is 20x smaller than its training data.
How doing more computation can actually save time and memory.
Why a raw LLM is just a fancy autocomplete.
Fine-tune massive models by only training tiny adapter matrices.

Full Transcript

Cool.

Hello everyone and welcome to lecture 4

of CME 295. So today is Friday, October

the 17th, which means that the midterm

is one week away. So before we start,

I'm just going to go over some logistics

to make sure you know we're all aligned

on what to expect.

So the midterm will take place next

week, same time. Instead of an hour and

50 minutes, it will be an hour and 30

minutes. So it's 3:30 to 5 in this

classroom.

So it's like business as usual. Um in

terms of topics

the midterm will be about lectures 1,

two and three which we had and this one

which is lecture four.

So just to give you like an overview of

what you can expect in the midterm.

There's going to be uh some multi-choice

questions along with some free form

questions but they're mainly going to be

about the things that we've seen in

class. So if you watch the recordings or

attend the lectures and you know just go

through the slides and know the

important formulas I think yeah you'll

be you'll be fine.

So I know that you may have questions

until uh next week. So that's why after

this lecture with Shervin we will be

holding office hours. So feel free to

you know come to us and ask us any

questions. And uh of course we'll be

fully available between now and next

week. So in case you have any questions,

feel free to uh ping us on Ed. Um and

yeah, we'll make sure to respond.

Um cool. Uh I also know that a number of

you are auditing this class. So in case

you're still interested to take the

midterm for some reason uh maybe uh

because you have an upcoming interview

uh just uh tell us so that we can just

expect the number of uh like copies to

print. So we'll be printing this on

Monday. So just let us know over the

weekend in case you're interested.

Cool. So that's for the midterm. And

then the second piece of news is the

final. So, we said we were working on

the dates. So, we finally finalized the

dates, which did not change. So, it's

Wednesday, December the 10th. Okay. So,

a little bit late, 7:00 p.m. to 8:30

p.m. Uh, so it's a slot that we have.

Uh, the location is different from this

one. So, it's in this room.

And the final will only cover the second

part of the class which is basically

lectures five to 9.

Any questions on this? Yeah.

Oh yeah, good question. So is it closed

notes? Yes. Yeah. Yes.

So question is what is the format of the

multiple choice? So you'll have uh so we

did not finish writing the exam, but

it's going to be something like you have

a question and you let's say you have

like three four possible answers and

then you just choose the the one that's

that's correct. Something like this. And

you'll also have some free form uh like

you'll have to just like answer in your

own words. Yeah.

>> Yeah. Uh thanks. So question is are we

allowed to take anything? So it's closed

book. So like

like yeah nothing just a pen.

Yeah.

Uh question is no calculator you will

not need calculator

but speaking of the cheat sheet. So I'm

not sure if we mentioned I think we did.

So there is a cheat sheet for this one

which we cannot bring to the exam but

you can use for your uh just for uh you

know just your studying. Uh that's on

the website class website.

suggest recommends looking at it.

Cool. So, super clear for everyone.

Very cool. Well, okay.

As always, we'll be starting the class

just recapping what we saw in the

previous lecture. Um so if you remember

we

basically

studied a new kind of architecture which

was called the mixture of experts uh

which is such that if you have an input

what you want is to not necessarily

activate all the parameters and so you

are in a setting where you have multiple

experts uh and in the forward pass you

only activate some of them so that's a

sparse MOE. You also have the dense MOE

which basically weights the outputs as a

function of the output of the gate. So

we saw that this architecture was used

in LLM

and it was mainly used to be able to

scale these LLMs without incurring an

expensive cost at inference time because

you don't want to activate all the

parameters.

The second thing that we saw was uh just

defining what an LLM was and in

particular how you could

decide on what the next token prediction

is. So we saw three methods. First one

was we called uh greedy decoding which

was always taking the highest probable

token.

The second method we saw was beam search

where we kept track of the k most

probable sequences.

And then the third one was sampling.

So we're not doing a most probable we're

not keeping track of the highest

probable sequences. What we do is we

sample the next token respect to the

distribution that we get as output. And

then we saw there's this hyperparameter

that's called temperature that allows

you to tweak how spiky you want your

distribution to like versus uh not.

And we also saw some inference

optimization techniques which are used

in practice to avoid having uh like a

big cost at decoding time. So I'm not

going to just mention everything but I

would say KV cache for instance is a is

an important method. So yeah just

recommend just knowing what it is along

with the other ones.

And with that we're going to start

lecture four and actually I was really

looking forward to today because lecture

one we saw what self attention was what

a transformer was. Second lecture, we

saw

some of the tricks that people use today

and some of the variations from the

transformer. We introduced what an LLM

was last lecture and this lecture we're

finally going to see how these LLMs are

trained. So today we're going to focus

on LLM training.

And the first thing that I'm going to

say is if you've been in the ML field

for let's say more than a few years now

uh you may have noticed that

traditionally

if you had a task what you would do is

train a model specifically for that

task.

So let's suppose like 10 years ago,

let's suppose we had a task which was

around detecting spam. You would train a

model specifically to detect spams. So

you would train on the training set,

eval on the validation set and then test

on the test set. If you had another use

case that suppose sentiment extraction,

you would train a model specifically for

that

and so on and so forth.

But one could argue that these tasks,

they're not completely disjoint.

They're all involving just understanding

the text. So one could argue we could

find a way to somehow leverage the

knowledge that we acquired during

training for let's say one task

and reuse that for another task.

So this method has a name. It's been

around for some time. It's called

transfer learning.

So the goal of transfer learning is to

not always start from scratch. If you

have a new task, it's to start with some

pre-trained model. And we're going to

see what pre-train is and then tune it

for your task instead of starting from

scratch.

Well, it's basically the paradigm on

which LLMs are trained. So the idea here

is that all these tasks, they involve

understanding language. So, what we're

going to do is have what we call a

pre-training stage, which involves

training your LLM on vast amounts of

data to just understand what language,

what code is

and then have a second stage of quote

unquote tuning.

And we're going to see a little bit what

that tuning is. But in that second

stage, we're going to take our

pre-trained model and somehow find a way

to tune the weights to adapt to a

specific task.

So as an example here uh we would

pre-train a huge model and then suppose

for spam detection we would somehow tune

it for that uh sentiment instruction

same we would tune it for that and so

on.

So this is just to take the example from

before. And the idea here is in order to

obtain these models, we're not going to

start from scratch.

Cool. So okay. So now we're going to see

what pre-training is. So pre-training is

by far the most expensive both in terms

of compute cost, you know, everything

part of the training.

So what it does is taking a huge amount

of data and training your LLM to just

predict the next token.

And here by data what I mean is

basically everything you can find. So it

can be uh you know text in English, can

be text in other languages, it can be

even codes,

can be code in different languages, can

be basically the whole internet.

We're going to see some of the data sets

that people use for that. But you can

think of this as just training your

model to try to predict anything that's

written.

And as I mentioned, the objective here

is to predict the next token. So if you

remember, our LLM is a texttoext model

and most likely a decoder only model in

more than 90% of the cases. So what it

does is it takes some input text and it

tries to always predict the next token

in an iterative basis.

So in terms of the data sets that are

used, you will see the term common crawl

a lot on papers, it's basically a data

set composed of anything you can find on

the internet. So I think they have

something like three billion pages per

month. So if you go on their website,

they have a hu huge archive. So there's

a bunch of other websites as well that

you can find in there. So for instance

the Wikipedia articles any like social

media as well like Reddit I know there

are a lot of Reddit conversations in

those in those data sets you have a lot

of codes and of course you have a bunch

of places for that you have GitHub you

have stack overflow all these like

forums that talk about codes so all of

this is meant for your model to just

understand the structure of the language

and code

and in terms of size so it's measured in

terms of token number of tokens and one

order of magnitude that I want you to

remember is is on the order of hundreds

of billions or even trillions or even

tens of trillions of tokens.

So I'll give you an example. So GPT3 was

trained on 300 billion tokens and for

instance Llama 3 which was I believe

published last year was trained on 15

trillion tokens.

So these are huge data sets.

So before we go further, I want to

introduce two notations and I think one

of them I introduced I introduced it

last lecture.

The reason why I want to talk to you

about these notations is they are used

everywhere to talk about how much

compute

uh some model needs. So the first

notation is flops

which stands for floating operations

and what it is is it's a unit of

compute. So the higher the flops the

more operations are involved because by

definition flops is the number of

operations that involve floatingoint

numbers. So floatingoint numbers you can

think of them as just like numbers with

decimal points.

So in terms of order of magnitude

training an LLM is on the order of 10 ^

of 25 flops

and the way you obtain flops. So usually

it's like a complicated formula but in

your mind you can think of it as

something that is a function of the size

of your data. So the number of tokens

that you train it on and the number of

parameters of your model.

So there's not like a universal formula

because it also is a function of the

architecture. So you can think of for

instance based LLMs as requiring let's

say less compute because only some parts

are activated compared to let's say

dense LLMs.

But you can just think of it as it's a

function of the number of tokens and

parameters. It's like O of the product

between the two more or less.

And then there's a second notation that

I want to introduce which is also flops

but it's different. So here flops stands

for floating point operations per

second.

So it's a measure of compute speeds. So

it's basically how fast can your

hardware

execute these operations

and so you also have like some order of

magnitudes here. uh but if you're into

uh let's say GPUs, you will see that in

the description of GPUs, they always

indicate flops and we will see that in a

second.

But I just want to call out that flops

here usually all caps.

Although you may see some papers that

use one for the other,

which is confusing. So I just recommend

uh just contextual contextualizing

this notation with respect to the

sentence that it is in because sometimes

people actually switch the two but this

is the common notation.

So far so good.

Cool.

Okay. So now we know that we have a

pre-training step. We know it involves a

lot of compute. We know it involves a

lot of data. We know that our model is

large. So what people did was trying to

see how the performance evolves as a

function of model size and training

size.

And there's this one paper called

scaling loss for neural language models

that was published in 2020 that

performed a bunch of experiments by

varying these parameters. And what they

found was the more compute you have, the

better your model learns about

predicting the next token. Same for data

set uh size. So the more the bigger your

training set, the better it is. And the

bigger your model, the better it is.

So for some time, I think between 2019

and 2024,

you were seeing models that were larger

and larger, just people just building

things that were bigger and bigger

because according to these experiments,

um the performance was just getting

better.

So something else that they noticed was

bigger models tend to be more what they

call sample efficient.

So what that means is for an equal

amount of tokens that is processed

you will have a better performance with

a bigger model compared to a smaller

one.

But then you can wonder you know um we

don't have unlimited compute you know

compute is expensive it you know it has

a lot of drawbacks. So uh you have a

fixed compute and people also try to

answer the question given a certain

amount of compute.

How can you fix your training set size

and your model size in a way that's more

optimal?

Cuz um here uh you need to decide how

big is your model. So what they did is

they fixed a unit of compute which is

the color of these curves

and they tried training models of

different sizes with different training

set size. And what they saw was that

there was always a sweet spot here

which followed some kind of

relationship.

And in particular, this is a table that

summarizes quote unquote the optimal set

of number of parameters and training set

size, which is sometimes called the

chinchila.

And what they realized was if you have

an amount of training set size that's

about 20 times

the model size then you're spending your

compute in quote unquote like an optimal

way. And in particular,

GPT3 for instance,

I think it was like 175

billion parameters if I remember

correctly, but it's only trained on 300

billion tokens. So this one for instance

is according to this really undertrained

quote unquote.

I think there's a question. Yeah.

Um so yeah the question is do they fit

the neural architecture? So I think um

by now everyone agrees that LLMs are

transformerbased decoder only models. So

everyone uses the same model.

Yeah. So you can assume that when I say

LLM here it basically means decoder only

transformer based models.

Yeah. Question is architecture change

does not play a big role. So that's what

they say actually in their paper. They

say the thing that changes the most is

the amount of tokens on which you train

and the size of your model.

Cool. Any other questions?

Yeah.

Oh yeah, good question. So question is,

is there some kind of transfer learning

between different versions of models?

Um, so for a lot of these models,

they're actually closed source, so they

don't exactly reveal these things. But,

uh, I guess it's an interesting

question.

um one that I cannot answer in a general

way. So maybe I think it's the best

answer I can give you. Um but in any

case uh when you look at um some of

these papers they always state how much

it costs to train this and it's always

in the order of you know millions. It's

always an expensive step regardless.

Cool. Um just uh speaking of that um so

pre-training has a lot of challenges.

One of them is cost. So, uh when I say

millions of dollars, it's a minimum. I

think it can even cost tens of millions

of dollars or sometimes hundreds of

millions of dollars. It takes a lot of

time

and um people have been mindful of the

impact on in the environment. So,

they've also been including the

ecological cost.

So the other uh challenge is that the

pre-training

step is on data that is up to the time

at which you pre-train your model on. So

what that means is that the knowledge

that you acquire from training on this

data set can only go up until the date

at which you cut your data set.

So this date is called the knowledge

cutoff date. And so what that means is

your base model, your pre-trained base

model does not know has no way to know

by itself knowledge that occurred after

this state.

And speaking of that, a lot of papers

they've tried to edit knowledge, inject

knowledge. It's always tricky because uh

there's not a clean way to um you know

change the weights in a way that does

not penalize

some parts. So I guess what people want

to do is inject knowledge but not

regress in some other domains. And this

is a very hard problem. And of course

you know these models they try to

predict the next token and uh there's

this question of uh what if it just

generates something that it has seen at

training time. So what we call

plagiarism so there's always a risk. So

these are all the challenges I just want

to illustrate when I said the knowledge

cut off dates. So if you go on let's say

the open a website or Google websites to

look at the model cards you will always

see so I'm not sure if you can see from

here but um there is always a line on

knowledge cutoff dates which tells you

on when the pre-training of this model

was done. So here for instance GP5 was

released a few weeks ago and here it

says the knowledge cut of date is

September 30th. So you can guess that

they've done their pre-training around

that stage.

Cool.

Any questions on the first part?

Everyone good?

Perfect. So in this first part, we've

seen that pre-training was a crucial

step of the LLM training process and

we've seen all these big numbers

and one could wonder well how can you

train such a big model on such a big

amount of data like how do people do

that?

So this is what we're going to see here.

So just what I had mentioned so LLMs you

can think of them as decoder only

transformerbased models. So in order to

train your model you need that

you need a lot of data but then if you

look at your architecture

you see that a lot of the operations

involve matrix multiplications.

And I guess I have a question for you.

What is the kind of hardware that loves

matrix multiplications?

GPUs. Yes. So you also need GPUs.

Actually more than one. Yeah. You had a

question.

Oh. Uh question is GPUs for inference.

So this one we're going to focus on

training. But um requirements for GPUs

they differ a little bit between

training and inference. But in this

part, we're solely focused on training.

And speaking of GPUs, uh I guess uh it's

not GPUs everywhere because for

instance, Google, they've developed

their own hardware that's called TPUs.

Uh but any non Google

uh Google based models, they've most

likely been trained on GPUs.

Cool. So in order to train your model,

what do you do? So first of all you have

your LLM which is now so this is this

model but uh now we're representing with

a box just for simplicity you initialize

it uh it's like um you know lot of

parameters so you can think of uh the

scale as being somewhere around like

billions to hundreds of billions of

parameters. a huge model.

And what are the steps involved to train

a model? Well, what you're trying to do

is to tune the weights so that the model

can learn how to generate the next

token.

So, you have one step called the forward

pass where you have a bunch of data that

you're trying to pass through the

network. And um while we do that I just

want to call out things that are

important to note that we need to

somehow save in memory.

So when you do this forward pass you

have something that's called activations

which are basically the values at each

layer that are needed in order to

compute the loss.

So the loss tells you how off you are

compared to uh the label that you want

to train this on. And so the amount of

memory that you will use here is

dependent on a lot of things. It's

dependent on the mouth size which

impacts the number of activations. It's

dependent on how big your batch of data

is for training and it's dependent on

how large your context length is because

if you remember uh here we have of n

squared complexity because of this self

attention operation where n is the

sequence length. So you have all these

parameters that come into play.

So once you do the forward pass, let's

suppose you compute the loss, you know

how off you are compared to your label.

Now the next step is to somehow tweak

the weights in a way that minimizes the

loss. So how do you do that? There's

this another pass called backwards pass.

So what this pass does is quantify

the direction where the loss is going to

be minimized.

It's called a gradient. You take the

gradient of the loss with respect to

each parameter.

Well gradients they also need to be

saved somewhere in memory.

And then you have finally the weight

update

which is where you know where the

direction at which your loss is going to

be minimized. So you apply that update

to your weights and you typically use

optimizers out there like have you heard

of atom optimizer? Yeah. So atom

optimizer just a fancy version that has

some additional quantities

uh which keep track of uh which are

basically a function of the gradient. So

you have the first moment and the second

moment which is basically an average of

the moving average of the gradient and

the squared gradients

and all these quantities. So the first

moment the second moment you also need

to somehow save them somewhere in

memory.

So it's a lot of things to save.

Well,

okay, breaking news. Memory is not

unlimited. Memory is limited. And so

here what we have in front of us is the

description of a GPU.

Uh I think so. Yeah, H100, which is a

very good GPU. And you will see that in

that description there's a line on GPU

memory. So GPU memory is your amount of

memory per GPU. It's uh 80 gig for this

one. It's quite large. So it's in on the

order of tens of gigabytes.

So you need to store all these things in

80 GB

which is not a lot.

So

what are we doing? What will you be

doing?

So I guess the idea is to leverage not

one but several GPUs in order to somehow

distribute the load across CPUs. And in

order to do that you have several

methods

which we will see in a second.

So the first set of methods is called

data parallelism

also known as DP.

So what this set of methods does is it

distributes

data across GPUs

so that this forward pass and backward

pass they can all be done kind of

independently.

And so the idea here is to divide the

batch of data across devices.

And then um in order to do that of

course you need to have a copy of the

model per device

because of course you need to compute

the activations you you need to compute

all these things. Um but when you do

that you're able to reduce the memory

that is linked to the batch size.

So that's called data parallelism. Yes.

uh question is how about the gradient

updates? Well, it's a great question.

So, how what do you do when you have

independent computations here and there?

Well, the gradient is just the average

of the gradients uh for this for this

thing. So, you have some communication

in between the GPUs that basically

aggregate the gradient for for the

updates.

So, I have a question for you.

Is

this the answer to everything? Like if

we just scale up like this for I don't

know lot of GPUs is it is it is is it

great always great or do we have like

cons?

Oh yeah uh great point. So yeah you have

to fit one model so yeah that's great

point. So the second point that I will

add is you have an additional cost which

is called communication cost because you

need to somehow communicate between your

GPUs in order to aggregate some

quantities.

So your training is going to be slower.

It's good you you can scale up the

memory. Of course you need to uh fit a

model on a on a device and we will see

what how we can do to do that but you

will be incurring those communic

communication costs so it's not all you

know great

so speaking of the memory and the fact

that we want to I guess be able to at

least store a model per um per uh

device. So people have realized that

there's actually a lot of duplication

and there's been a paper on wanting to

dduplicate this duplicate information

and this method is called zero

zero redundancy optimization

and the idea is that in each on each GPU

you know you store the same parameters,

you store the same gradients, you store

the st same optim optimizer step states.

So the idea here is how about we shard

we partition those quantities across

GPUs. So the first variation is around

sharing the optimization optim optimizer

states. So meaning we partition those

states across the GPUs. So this reduces

the memory by a lot. We can also

partition the gradients

and we can also partition the

parameters.

So here we have no redundant

information. Things are just

partitioned. Well, the problem is you're

going to have even more communication

costs, but at least it allows uh for us

to decrease the memory load on each GPU.

So this is zero. So there's 0 1, 02, 03.

And I guess the variation that you will

choose will be a function of how

sensitive you are to I guess training

time and how big is your model

and whether this will be an actual

problem or are you just fine with just

storing everything.

So that's one set of methods. So this

set of methods is again data

parallelism.

So it's basically you having independent

sets of data that are handled by

different GPUs.

Well, you have another set of methods

that's called model parallelism.

So model parallelism tries to

parallelize

the operation even within one batch.

So there's a bunch of methods. I don't

want to sound too like a catalog. So

we're not go through them all by one by

one, but I will just call out a few that

are worth noting.

So if you remember last lecture we

talked about MOE based LLM and how

sequences were being sent to different

experts.

Well there is a way to distribute that

across GPUs via this expert parallelism

techniques

which is uh having let's say one expert

on a device another one on another

device.

So that's one thing worth noting.

Another one I will say so tensor

parallelism is uh when you have big

matrix multiplications

to somehow cut that in a way that

decreases the uh memory required for

that.

Okay. And maybe the last one I will say

is pipeline parallelism.

It's when

you consider a forward pass as involving

several layers. So you're going to say

that one GPU is going to only be

responsible for let's say layers 1 2 3

and then another one layer three uh

sorry four to five four sorry four five

six and so on and so forth

um so you also have that kind of

parallelism but anyways there's a bunch

of techniques and the ones that I

mentioned they fall in the bucket of

model parallelism

make sense

No need to know the details on there,

but I think just like knowing that there

are several methods and just a rough

idea, I think is a is a good good thing

to have in mind.

Cool.

So, what did we do? So, we realized that

during the training process, we had to

save a lot of things in memory. So what

we saw was techniques that reduce

the burden of having memory per GPU. So

we are trying to distribute that across

GPUs. So we saw data parallelism and

then the zero method that has some extra

optimizations and we saw model

parallelism as well.

So now we're going to see another

technique that leverages the structure

of the GPU. And you may have heard of

this technique is called flash

attention.

It was actually developed here at

Stanford uh in 2022.

And in order for me to talk to you about

this technique, I want to tell you more

about what GPU is composed of.

So if you look under the hood, well GPU

is very complicated and I'm for sure not

uh I don't know everything either, but

what I know is that we have two kinds of

memories in a GPU. So you have one kind

of memory that's big but relatively slow

that's in the HPM

and then another kind of memory that is

fast but much much smaller which is on

chip next to the where the compute

happens that's called the SRAM

so you have HPM and SRAM HPM has

something around uh you know tens of

gigab by so it's like the GPU memory

that you saw in the description.

SRAMM is much smaller. It's like

something around like several like you

know tens of megabytes let's say. So

it's much smaller but then it is like 10

times faster. So this one is uh a few

terabytes per second let's say and the

SRAM is uh tens of terabytes per second.

So it's like a noticeable difference in

speed.

So what we want is to somehow leverage

the strength of these kinds of memories

in order to speed up the attention

computation in a in an exact way. So

what do I mean by exact way? So what I

mean is we're not making any

approximations to the computation.

What we're doing is we're just

leveraging the strength of these

components and sending the computation

in a in a clever way.

So

if you remember the self attention

computation is done with this very

important uh formula. So it's softmax of

queries and the keys over some scaling

factor times v.

So this allows queries to interact with

everyone else.

Um so in matrix form you can think of

queries as being uh as having the number

of rows equal to the sequence length and

then columns to being the the dimension

of the query and then you same for key

and value. So you have this big matrix

multiplications.

So if you do it, if you do this

computation the standards, the vanilla

way, what you would do is store them in

the big but slow memory component of the

GPU.

So you would store it in the HPM.

So here is what you would do if you were

to not do any optimization. So you would

take those matrices from the big but

slow

HPM,

perform the computation

and then write it back to the HPM

and then you would read that result

again from the HPM, compute the softmax

and then write it back to the HPM

and then you would again load this plus

the value matrix multiply them and then

write them to the HPM.

See there's like a lot of read and write

to the HPN. So it's a lot of uh data

transfer

which actually becomes the bottleneck.

So a GPU is very very fast but then you

spend a lot of time just loading your

matrices from the memory.

The reason why you do that is because of

the softmax softmax operation. So do you

remember what a softmax does? So it

normalizes the quantities so that they

sum to one but it's row dependent

meaning that each row needs to sum up to

one.

So in a sense you need that computation

to happen first before you do your

softmax. Like uh if you just like look

at it like that you you would think yeah

you you need to do the whole thing

first.

Well turns out that you don't need to do

everything

at once.

And this is the core idea behind flash

attention.

So what flash attention does is it tries

to minimize the amount of read and write

from and to the HPM

and instead takes small blocks and it's

called tiling. The method is called

tiling. It takes small blocks that it

sends to the SROM so that it gets

computed from end to end before being

sent back to the HPM.

Does that make sense? So the idea is

let's send small matrices into the SRAM

so that it does the whole you know full

end to end computation and just send it

back to the HPM because we want to

minimize the amount of read and write

from the HPM.

So here's how how you would do it. So

you remember the softmax

uh computation with the query and the

key and then the value. Well, what you

would do is to cut your matrices

and then proceed step by step.

But then there's a cool trick that I

want to talk to you about which is that

you don't need to compute the whole

matrix inside a softmax

in order to achieve the whole softmax

computation

cuz if you think about it let's suppose

you have a whole matrix and then you

have like different let's say columns or

like submatrices S1 to SN well submax of

this huge matrix

is equal to this matrix where the

softmax is taken respect to each of

these submatrices

up to some scaling factor.

So this is the core trick

and if you want to be convinced of it

just look at the softmax formula it's

like exponential of something over some

quantity which is shared across the row.

So this scaling factor will just

fix this with respect to that.

So with this in mind, what we will do is

take each respective slices of these

matrices,

do the whole computation and then

populate the corresponding

uh entry in the output matrix.

So we will do that between let's say the

first slice of the query and the first

slice of the keys and the values and

then we will repeat for the other slice

until the end

and then we will repeat for the other

queries as well until the end.

So what the paper explains is how this

scaling factor is being computed. So

this one is some formula that I did not

put on the slide. So it's not necessary

for you to memorize the formula. It's

just the idea

and the idea is exactly this trick.

So once you do that,

you basically end up with only one read

from the HPM

and these like tiled quantities are

stored in the SRAM

and then they're read from the SRAM

which is very fast and then computed and

then back to the SRAMM and then at the

end in order to accumulate the results

they're being sent back to the HPM.

So, just to make sure we're clear. So,

in green is basically when it's red from

the SRAM and then in blue is from the

HPM. You have a question? Yeah.

>> Yeah. The question is do you take the

whole row or a portion of it? So you can

take a portion of it but just for

illustrative purposes. Here we take the

like just this is just for illustrative

purposes. You can think of your your

matrix as being completely uh you know

like a grid and then uh you just like

multiply accordingly. But yeah this is

just for illustrative purposes. Yeah.

Yep.

Yeah. So the question is uh are you

computing alpha and all these quantities

on the fly or do you have some

estimation? So all this is exact and

they're computed in an iterative basis.

So the way it works is when you

populate the output you will keep on

having some extra quantity that will

adjust for that.

Think of it as just some formula that

works.

So yeah highly recommend looking at the

paper. They actually explicit that quite

quite a lot. So the paper is flash

attention fast and memory efficient

exact attention with IO awareness. All

of the links are in the the slides. But

yeah, highly recommend just looking at

the exact formula in case you want to be

convinced.

Cool. But the idea makes sense overall.

Any questions on this?

Cool.

Well, this was flash attention.

But there's actually another idea from

the paper. So this was only the first

idea which was around making the

attention computation

faster.

The second idea is given that the

attention computation is faster.

Now let's let's try to be smarter about

the backwards pass

because when you compute in the

backwards pass when you compute the

gradient of the loss with respect to a

parameter

the chain rule will surface some

activations that you need to have in

memory.

Well, here given that computing these

activations is very fast,

one idea is to just not store

activations from the forward pass or at

least not store everything

but instead in the backwards pass

compute these activations again.

So it's called recomputation.

So when you do the forward pass, you

compute an activation to be able to

compute the loss, but then instead of

saving it to reuse it during the

gradient update, you will just discard

it and then recomputee the activation

during the backwards pass with this very

uh fast technique.

And when you do that, it's actually

quite remarkable. So you do more

operations

Uh so gigaflops is uh you know it's just

a derivative of flaps. So it's the

number of operations you're doing. So

with flash attention you're doing more

operations because you're actually

recomputing things.

But then so you also see fewer read and

rights from the HPM. So here for

instance it was 40.3 in the standard way

and then 4. So it's like almost a 10x

reduction. But then you see that the

runtime is also smaller.

So this is very remarkable because

usually when you recomputee things

you are saving memory but at the expense

of

uh runtime you're just taking longer but

here you're not only taking less amount

of time you're also saving memory. So

you're basically having everything it's

the best of all worlds.

So this is what flash attention uh

mentioned. So you will see that there

are some uh I guess variants of flash

tension. So flash two, flash three and I

would say that these are more adaptation

of these methods to the current infra

because each new GPU comes with new um

pros and cons, strength and weaknesses

and I guess there's always a way to make

these optimizations better.

Uh but yeah, I would say flash attention

is quite a common trick and I think it's

a very good thing to know.

So does that make sense?

Cool. I take that as a yes.

Okay, cool. Okay, so last thing. No, we

have a few things um that I want to talk

to you about. So you know, you have

your LLM with a bunch of weights. These

weights, they're all floating points. So

you can think of them as, you know, just

being numbers with a bunch of numbers

after the decimal. One natural question

you can ask yourself is, do you really

need to know

that much like precision after the

decimal point to be able to do a good

job? I guess put another way can you

just like cut your precision in some way

to save on memory but then keep the same

performance.

So this is the idea behind quantization.

So quantization is the process of

converting the precision of a number

from let's say one uh setting to another

to in order to better understand that I

think it's important to know how

floating points sorry floating point uh

numbers how they are encoded.

So in practice

they're just a bunch of bits

and here you have some bits that are

responsible for let's say the exponent

some they're responsible for the mantisa

which is basically how granular granular

your number is and then one that is

about the sign.

So you have a bunch of representations

of floats. So the most common ones are

in this table. So you have uh so single

point precision, half precision,

um floating point 64, floating point uh

brain float 16 that are each having

different granularities for a three

dimensions.

So if we take the first two rows which

are the two most common uh

representations that you will see out

there,

floating point 16 is only represented on

16 bits.

compared to 32. So it takes basically

half the memory if if you want to think

about it this way. Uh but then it's less

precise, meaning it's less granular. So

I guess one idea that we could have is

to somehow decrease the granularity of

these weights and these numbers

hoping that it will not impact

performance too much.

So another thing that I want to uh

mention is you know back to that uh GPU

description

you see that there is also a bunch of

information around the compute speed. So

I'm not sure if you can see very well on

the can you see very well on the slide.

So I I'll read it. Uh the compute speed

as a function of which kind of

representation you're using. So here if

you're using FP64 which is this super

granular way of representing numbers you

only have 34 teraflops of comput speed

but then if you're using let's say P32

which is half of it you can uh kind of

double

your compute speed and so on you have

all these like other um numbers as well.

So the idea is you can save on memory,

you can also go faster.

So with this in mind, I just want to

touch on one last technique before

giving it to Shervin, which is mixed

precision training.

So the idea behind mixed precision

training is to leverage different

granularities of float representations

such that it will not hurt the

performance too much

but allow you to save on memory and do

things faster.

So the idea behind mixed precision

training is you have your models or you

have your model,

you keep your weights

in high precision, so FP32

and all the operations that you do in

the forward and backwards pass, you're

going to do it in a lower precision. So

in this case FP16

but then the weight updates will be done

still in FP32.

So the authors of the paper when they

when they did that basically what they

realized was the performance was not

degraded too much but then you had a lot

of savings on memory and then it was

running faster.

So now you may wonder well why are you

keeping the the weights at a high

precision but not the activations and so

on. Um so I can offer you just a bit of

intuition.

So whenever you perform a forward pass

you're performing that on a set of data

but the set of data can be noisy in

itself.

So maybe all the decimal points after so

all the numbers after the decimal points

may not be as useful.

So you can think of this update as being

more like you know in which direction

should the weights go to go to a more

optimal state. So it doesn't require

them to be super precise,

but then it's much more important to

keep your weights, so the weights of the

model in high precision

in order to not accumulate

errors due to contisation.

That's one way of just think about it.

But the long story short is if you

reduce the granularity of some numbers

then you will have a lot lot of benefits

and not a lot of disadvantages in terms

of the performance.

Cool.

Yeah.

That's a great question. So the question

is do you apply this uh technique to all

weights of all layers or just to some of

them? So there's been a lot of papers on

that. Uh and the answer is there's some

variations to it. Um and so the answer

is not necessarily. So I do have some

pointers happy to share share them with

you. But um some parts may be more

important than others.

So this strategy is a little bit uh like

a high level idea but there are always

some variations uh from setup to setup.

Yep.

So is your question that people rely on

the like relationship between compute

and sorry not compute parameters and

token from the chinchilla paper. Uh but

it may be different from their setup

which may introduce something is is

basically your question.

So the question is uh is there also an

optimal precision to use? So I would say

people use different things there's not

like a set way where everyone does

something but um I would say that these

uh scaling law like this chinchila law

is actually something that some authors

they try to reproduce for their own

model.

So actually uh for instance the lama 3

paper there's a whole part around trying

to uh have some relationship between uh

you know given a fixed amount of compute

what is the optimal

uh you know number of tokens and then

number of parameters that is uh unique

to their setup.

So what people do is they do this thing

themselves on a small amount of training

uh training set and on smaller models so

that it doesn't cost them too much. They

try to see what the relationship is for

their setup and then they extrapolate

and I believe that's how llama 3 so I

think there's a model that so lama 35

billion parameters I think they had the

whole section just justifying that they

came up to this number with some

experiments that they run. So to your

question um I think it highly depends on

the kind of model that someone is

training and they'll probably

have this analysis I guess with respect

to their own setup.

Does that answer your question?

Cool. Cool.

Okay. Yeah.

Yeah. Yeah. Yeah. The question is uh is

there something you can do about the

range? So there are several types of

quantization. We'll not go into details

here but I can give you like a zero

point quantization and apps quantization

which are different techniques that play

with these ranges. Uh there's also like

a quantization technique that Shvin will

cover that also talks about that. Um so

yeah stay tuned for this but um I think

0 point quantization and absmax are are

ones that maybe you can look into for

this.

Cool.

So with that I'm going to give it to

Shervin.

Thank you Afin. So now we have seen uh

how pre-training worked. We're going to

see together the next stages of

training. Um so as we saw with Afin

pre-training was a way to um build the

intuition to the model about how

language was constructed and about uh

the main characteristics of what is

contained in the training data that is

general enough. So this is why we had

corpuses that spans huge corpora of text

that are typically of the size of what

you find on the internet. So typically

very large and raw and conveying um

aspects of language.

So now you might want to ask yourself

how can we make such knowledge actually

helpful. So this is something that we're

going to discuss

just in a second. But before we discuss

about what could motivate further

training, I want to come back to some

very simple example

that will

um surface what could be such a need. So

let's take our f favorite example about

our teddy bear and let's say we have a

very um practical use case. So, we have

thoroughly loved our teddy bear. Uh, but

now it has become a bit dirty. So, we

need to wash it. And you might want to

ask your favorite LLM um if you can put

it in the washer. So, with the

description that Afin mentioned of LLM

pre-training, I wanted to ask you all

what was your opinion about what could

the output to this be?

So, any guesses?

Yep.

Yeah.

Yeah. Excellent. Yeah. So the the answer

the guess was maybe another question. So

I think that is a great guess that could

definitely be it. So here in the example

we put some other sentence that could be

likely but the the gist is the same. it

has been trained on next token

prediction not being an assistant or

someone that is helpful to you. So this

is why uh it will try to mimic what

could be the pattern here and what could

be potential likely next words. So for

example, the example we put here is

something that relates to uh like how

the teddy bear is composed of um and uh

because it's in the same domain as kind

of teddy bear and then when you talk

about washer then maybe it has been

trained on data that includes materials

of teddy bears. So that could be it. Um

and I have a question with you all. Um,

are you happy with it?

No. Yeah, exactly. So, this is why we're

going to see what we can do with the

model to make it helpful to you.

And this stage is called fine-tuning. So

we're going to see um that it enables us

to uh like in the case of general like

the general case of LLMs to make the LLM

a helpful assistant but also if you have

a specific use case you can also use

this technique to tune the general

representation of the language towards

your task of interest. Uh but we're

going to focus first on what is done for

LLMs in general.

Uh so first I want to define some terms.

So model fine-tuning is commonly known

as SFT. SFT is uh short for supervised

fine-tuning. So the supervised part

suggests that we need some labels and

it's actually the case. So it's pairs of

inputs and outputs that we provide the

model to be trained on. And this is why

uh it's called supervised. And then

finetuning is uh the term that refers to

uh refining the weights that have been

already trained. So you you start from

the pre-trained weights and you further

train the model on additional data to

get a fine-tuned model. Um and um and

then one interesting note that I will

tell you all is that the objective

function even though it's the same will

differ a bit from the pre-training task.

So in the pre-training task you start

with the BOS token beginning of sentence

and you fit uh all your corpora of data

trying to predict the next token. But

here since you have uh this supervised

fine-tuning setup where you want the

model to do something useful in cases of

interest. So let's say when I ask the

model about washing my teddy bear. So

washing my teddy bear is actually an

input that I give to the model. It's

it's fixed. I don't want the model to

parrot what I say. I want the model to

be a helpful assistant to what I

condition it on. So in this setup, the

input is not a location where you would

predict the next token. You would not do

teacher forcing on it, but rather start

from this input and then predict the

next token onwards. And then you're

going to tune what the optimal um like

distribution of next token can be based

on this input.

So any questions on the overall idea of

SFT before we dive in a bit deeper?

Yep.

So the question is we don't put a loss

over the input. That's right. So you

start from the input and you start

predicting the next token and then the

loss calculation starts from there.

Yeah,

everything good.

Okay.

So uh what I want uh what I was just

telling you um is that SFT can be done

to tune your own uh task of interest but

also it's something that is being done

at uh the one of the main stages of

training LLMs as you use it every day

and then this transition from uh the

model being a good representation of

language to a useful assistance is a

subcategory of SFT called instruction

instruction tuning. So instruction

tuning comes from the fact that we want

the model to answer instructions.

So now I'm going to like show with the

graph what I was just mentioning

regarding predicting the next word. Uh

okay. So actually just in a bit first

we're going to look at the data. So um a

with a we saw that the composition of

the data for pre-training was basically

the whole internet where you had uh

sources that were carrying knowledge

such as Wikipedia um and in general like

masses of text of English that um that

tells that tells the model about how

English and other languages work as well

as coding and here it's slightly

different for instruction tuning. We're

going to look at data that presents to

the model how it can most helpfully

respond to instructions. So you can

divide this in several categories. So

here we present some categories that

could be helpful to users of LLM. For

example, uh story writing, you know,

you're interested in writing a story.

you give it a set of instructions and

then you have a ground truth that is

attached to it. Other um examples

include poem creation, list generation,

explanation and and many more that are

part of the data mixture of SFT.

So basically you uh run all these

supervised fine-tuning

um training with the input that is given

and then you train uh the model on

predicting the output

and I'm going now to show uh what it

looks like. So let's say your

instruction is here. So you ask your

model to do something. Um, so it could

be formulated in a different way, but

just to keep it short, I just said do X

and then the yellow part is where you

start fitting your objective function

on.

Um, does this make sense?

Great.

So as I was uh mentioning the data

mixture in includes um instruction

um geared data. So for example you had

all of these uh kinds of tasks that uh

users might ask for at inference time.

So it's what I put under the category

assistant dialogues.

Um, and these days, uh, since you have

all these, uh, large models that already

exist out there, and let's say you want

to generate a new one, you don't

necessarily have to start from scratch.

So, all of these that I mentioned were

originally human written. So you had all

these instructions that people gathered

and you had uh expert linguists that had

a set of instructions

on how to write the best answer that was

fluent, helpful and um everything that

was geared towards maximizing your

happiness as a user. But these days you

can use these already trained LLMs to

generate such data. So typically uh when

you have a large model that that is very

performant you can fit it such

instructions sample some generated

outputs and then have a human or some

other LLM review the quality. So it's

something that speeds up a bit uh the

curation of such data sets. So I just

wanted to emphasize on the fact that

it's not only human generated these days

but you can have some assistance

and besides all of the categories that

you care about. So typically

u math or um you know how you write a

proof, how things follow each other or

codes uh like high quality code bases.

You also have other aspects that um that

are very important when you when you

release a product like this to the wider

population that come under the umbrella

of safety. So uh you want your model to

be helpful but also harmless. So you

also have uh in the data mixture often

times a subset that enforces the model's

behavior not to uh repeat some of the

harmful content it might have seen on

its pre-training corpus. So it will

include techniques that might um reject

some um some some user prompts. So you

might have tested you know on your own

the limits of an LLM today. If you

submit queries that might leads to

something that that is considered bad it

would just say okay sorry I cannot

answer this and this sorry I cannot

answer this is you know could be done

via regex but very rarely because it's

not scalable and then people actually

embed this property as part of the

model. So um you have techniques that

like rejects user queries that are seen

as harmful or uh there is some other um

phenomenon such as hedging

that nuance the um the output of the

model in order to not make blanket

statement as well. So you might have

data sets that includes such um flavors

and you can have many more um depending

on the task of interest that you're

dealing with. But what I'm listing here

is what usually the LLM out there that

aim at being general assistants gather.

They tend to be like general tasks um

that there are being trained on.

Any questions on the data?

Yep.

Yep. So the question is the story

writing example doesn't precise what

kind of poetry uh the story should be

about. So it might be ambiguous. How can

it generalize? So that that is a great

point. So we rely on the model's uh

knowledge that it has accumulated at

pre-training time to be able to

generalize beyond the examples that it

has been fed. So let's say now you have

um a story about uh you know poetry and

from pre-training it knows about all

kinds of poetry that exist out there. So

you could imagine after fitting such an

example at supervised fine-tuning time

that you could give more details to this

prompt and then the model has the

ability to generalize the generation of

the story with respect to those other

attributes. So you can have I think like

the example that you mentioned is a

great um example of u the the the

magical ability of LLMs to adapt to

natural language. uh and all of that has

to do with the kind of distribution it

has seen in the past and what we teach

it. So uh I think like the concept

itself of writing stories is the key

learning to the model that it can

generize.

Does does that make sense? Yeah. Okay.

Great. I think there was another

question here.

Yep. So the question is about the term

of alignment. Um is this post training

uh what is called as being aligned? So

you are ahead of me. In a few slides

we're going to talk about it. Yeah.

Okay. Great great questions. Uh any

other question?

Awesome.

So uh and then just like Afin did it for

the pre-training part I'm going to give

some orders of magnitude. So the same

GPT3 and LMA 3 papers that Afin referred

some numbers from don't actually give

their uh statistics in terms of token

numbers for the for the instruction

tuning site but actually in number of

examples. So you see you you have about

13k used by GPT3 and uh 10 million for

lama 3 and uh okay so let's do some

quick estimation let's say each example

is about a thousand token when you

multiply that with the number of of

examples you see that the size of the

data sets used for SFT is several orders

of magnitude

lower than the one used for pre-training

so The mental model that you need to

have here is that pre-training is a lot

of data. You want to learn about general

um characteristics about language and an

SFT is more exactly what's uh so someone

was mentioning just now regarding

alignment was aligning the goal of the

model to being uh suited for your tasks.

So typically very high quality data sets

and much more um like concise and

precise in terms of number. So like

lower like less order of magnitude.

Okay, great. So now let's try this

exercise again. What do you think could

be the answer to our instruction tuned

now model to the same question?

Anyone wants to take a guess?

You want to try again?

>> Yeah. So the the the answer was yes, you

can put in the teddy bear in the washer

except that you shouldn't put your teddy

bear in the washer. You should hand wash

it. Um so it will give you the answer

that is helpful to the user and indeed

uh this time respond to the query.

So which is which is pretty nice.

So now we talked about all the things

that are great about supervised

finetuning. Now that I'm going to detail

a bit more about what could make it hard

or challenging and then motivate the

optimization part that we're going to

see afterwards. So first we talked about

it a a bit. uh when we have these data

sets, these data mixtures for SFT

training, we need high quality data and

when you say highquality data means

often times human involved in the loop,

you need to make sure it abides with all

the rules and all the characteristics

that the user will care about. So

typically it's highly um involved. So

originally these first models they were

almost all human but now you might have

some um some mixture of human and uh

generation

um and but one good thing about these

things is that the data sets they are

trained on are reused. So so it's a work

that you do once and you can complement

with respect to time uh but it's still

very expensive in time and resources.

So the second point I want to mention

comes back to a point that someone just

made regarding uh the distribution of

your SFT data set and how it would align

with actual inference distribution. So

in the case of the story regarding a

specific kind of poetry, the

distribution of prompts that precise the

kind of poetry seems close enough for

generalation generalization to be made.

But you could think about examples that

widely differ. So let's say you ask for

a story that is in some movie plots and

it differs from the stories that you

might uh like see in in textbooks. So

this could be slightly out of

distribution and it could have trouble

generalizing to that. So um prom

distribution is very important and

aligning that prom distribution with

respect to the target task is of

interest. Yep.

Yeah, great point. So the question is

about let's say you have SFT tuned your

model and now you put a training input

back to the model. Will you get the same

story? So it has to do with the

phenomenon that Afin was uh mentioning

regarding

memorization.

So in practice you will not see the same

story most likely because the sampling

that you're doing is uh with a non-zero

temperature it will u maybe have the

same flavor but not word word for word.

Um

then the flavor of that story if you

have the exact same prompts might be the

same. You know it depends on what it has

seen as at pre-training time. Um but

yeah so the answer is that

if I had to give a guess on that precise

example it might be the same flavor but

definitely not the same wording.

Yeah. So and it might be a different

story um if the sampling goes through

some other region of the space. You know

stories are highly creative and the

pre-training uh corpora of of like data

mixture had uh like all kinds of stories

in there. So like the specific u example

of story is likely to generate different

um different like outcomes.

>> Yep. So the question is uh will how much

the model wanders around depend on the

temperature parameter and yes. So

typically when you have a high higher

temperature you also so it's also called

more creative and it's for that reason

it might also uh generate things that

are less likely u on its output

distribution with respect to what it has

learned and this is exactly you know a

case of something that we learn. So yeah

the answer is yes.

So but I so the question is is there any

way to tune that? So when you tune the

temperature, you you exactly do that.

Yeah.

Or were you thinking about something

else?

Yep. So the question is, you know, is

there something else you could do to

have like variations in the response? So

for that I have a simple but hard

solution. you would need more data in

that category that uh will ground the

model into seeing what kind of

distribution we target with such queries

and I think that will be one way for the

model to generalize more. Yeah. So I

think like it will be working on the

data part

was there another question

all good.

So yes, so this is a very relevant

question because it touches on the topic

of generalization.

So yeah, so data mixture matters a lot

and having points that are um sparse

enough in the distribution space to give

the gist to the model of what it needs

to learn rather than repeating the same

story again and again will have a lot to

do with generalization powers. Um and

then this is something we are going to

see in a second. How do you evaluate

such models? Um the feeling that you get

as a user on how helpful a model is

tends to be subjective. So how can you

put some number on it is going to be a

key topic of interest and then also um

very soon we're going to see how to make

the computations

not that expensive.

So Ashen talked about training

optimization techniques, but now uh when

we're at the finetuning stage, maybe we

can go one step further and and and see

what simplification assumption we could

make.

Okay, awesome. Now let's dive into one

of these challenges which was the

evaluation part. So people have

decomposed what they care about in

categories. So I'm listing here some

dimensions that are evaluated against

and that give some quantitative number.

So you have a general language where

this benchmark that is popular um is

generally one score that users that

people reports. It's a MMLU

um which is massive multitask language

understanding. So it has I think 50 or

so tasks in there where the model is

being evaluated against and you have

some score that you can compare with.

You have uh benchmarks on reasoning uh

on uh math reasoning as well as code

generation. So here you have all sorts

of acronyms and there is like even way

more. So uh benchmark um basically the

the setting up benchmark has been uh one

area of research. So you always see more

and more benchmark coming in because

people tend to optimize for those that

exist and then there are some gaps that

might be filled by further ones. And the

GSM 8K stands for

um

grad school. So yeah, grad school math

8K stands for the number of examples. Uh

yeah, I think so. Or maybe maybe G

stands for something else, but it's like

high school topics basically.

And one very interesting uh pattern that

people have seen as these models came

out and as they were evaluated against

these benchmarks is that sometimes for a

same kind of model you see all of a

sudden a spike in uh numbers uh with

respect to some of these benchmarks

without a clear explanation behind it.

And there is this paper that I highly

recommend taking a look which um

explores the phenomenon of training on

the test task

not the test set the test task. So when

you have benchmarks regarding u math

reasoning let's say it matters a lot if

the model has has been trained on data

that relates to that kind of reasoning

versus not. And this is why when you

type in those kinds of benchmarks, you

will see that there are sets of

so-called auxiliary training that can be

used for the model to be trained on the

same domain and um and in that way it's

enables um like the fact of comparing

models with respect to that specific uh

capability. So the what the paper uh

tries to convey here is that if you want

to compare models between each other,

you need to compare um the training

mixture it has been trained on and then

ensure parity in terms of training on a

test task. So you need to make sure that

for example the two have been trained on

the test task or not. And if it's one

and the other knots, then it might not

be a good thing to compare uh them. It

doesn't give you like the intrinsic

value of the model.

So yeah, that's an interesting um

phenomenon here. Uh any questions?

Okay, great. So we can move on. Uh so

one thing that I was mentioning here is

that it's very hard to get a sense of

how good a model is even with respect to

benchmarks course because often times

what happens is that as they come out

people design um data on the training

side that resembles more what we try to

solve on the benchmark side. So

sometimes you end up with models that

score great everywhere but you as a user

don't necessarily see uh added value to

it. Uh and uh you know it's not the

fault of models. It's not the fault of

the benchmarks. It's just that it's very

hard to give a number that conveys the

value to you of the model. So this is

why people have come up with other uh

techniques to put a number on uh model

evaluation and you might have heard of

JetBot Arena.

Have any folks heard of it here?

Yeah. So, it's a website where models

can be submitted and um users come in

and they ask their questions and they're

presented responses from two models and

they're being asked to judge which one

is better and then with some pair wise

computations done on the website side

they come up with some ranking in the

end that's ranks model with respect to

quote unquote user preference.

So it's a sort of a a number that puts

on the like that is being put on the

vibes scene by the user and is it um all

perfect actually uh no. So it suffers

from several uh issues that are uh like

another set of issues that are hard to

deal with. So like among them when you

have a new model that come in you have

some noise at the beginning with respect

to with which other model it's being

compared against and these first few

steps actually um influence quite a bit

the actual ranking which makes it um you

know kind of a brittle

uh property

and uh there are there is a paper that

actually shows that it's easily possible

to rig

such a leaderboard. So I'm not sure if

there is any evidence that it has been

done in the past but you know you can

use any model if you ask the question

who are you it's just going to say who

who it is if you ask GPT who are you

it's going to say hey I'm Chad GPT I'm

helpful assistant and this paper uh you

know observed on this very simple

property that it was able to detect what

model was being evaluated against and

let's say you have an adversarial player

it could rig the ranking just by

selecting the right model. So it's not

um foolproof.

And then on other aspects, so um some of

the benchmarks that evaluate these

models, they're actually cured by

experts that know very well about the

target distribution of what a given

prompt should give. So it can uh so it

has a set of guidelines that says

clearly okay this is factual this is

non-factual. So they're able to

determine what is good versus bad and

you as a user you might not know about

it cuz let's just take the example of

the teddy bear that you wanted to put in

the washer. If I don't know that I need

to hand wash my teddy bear and if I have

a detailed response regarding uh you

need to wash it with the machine wash

cold and I could as a user find these

pieces of advice helpful because they

are actionable to me but actually are

they factual that is a whole other set

of issues that you as a user wouldn't be

able to tell on a lot of queries

and uh another challenge that we have

Here is a user preference. So, who here

likes it when there are emojis in LLM

responses?

Yes. No.

Yeah. Yeah. Personally, I do. And I

think there are like strong opinions and

some people don't like it at all. So,

they will downrank such such responses.

But actually, it's something that the

user should be able to to tell and

choose. And then the distribution of

people who choose the best model with

respect to the distribution of the wider

population that is going to use this

model these models is going to be

different. Um you know in often times

and I think the emoji case is a good

example. Uh because I think generally

emojis are popular in the wider

population but I think domain experts

might not like it as much. So I think

that that mismatch is one that you would

see here. And then the last point I will

mention here is the safety side. So you

as a user, you don't really like it when

a model rejects your prompt. When you

ask about something and it says, okay,

hey, sorry, I cannot answer it. So there

will be a bias towards responses that

actually responds to your query rather

than uh respect some safety principle

that might be eventually the intended

product decision. So there is also this

kind of bias that pops up here

and as I was mentioning evaluation is a

hard problem. So you have all these

angles all these angles that you can

explore and it's not one number that's

going to tell tell you what you care

about. It's the combination of all of

these and in the end it's um it's

tailored to what you actually need. So

you would need to

see the strength and weaknesses of a

given model and to determine like which

one corresponds to your use case.

Okay. So I'm going to come back to one

question about alignment here. So we're

going to see at the next lecture

a further step into aligning the model

to do what you want to do in a step

called preference tuning. And then the

combination of fine-tuning and

preference tuning um which is which

comes after pre-training is what we call

alignment of the model. So like these

two steps are called alignment.

And I want to call out one other thing.

Uh there is one step we didn't mention

here that is called mid training that

has been emerging very recently which

consists in a step just after

pre-training in aligning the kind of

data that the model is being trained on

to the tasks that you really care about.

So it's the same pre-training objective

but aligning the kind of task and the

kind of data set to something that you

care about and uh yes so it's an

emerging trend I have not mentioned it

but just so that you know mid-training

is something that you would have between

pre-training and fine-tuning

all good

okay great so now we're going to tackle

one aspect of the challenges that we had

mentioned regarding fine-tuning which is

computational expense like the fact of

being computationally expensive and

we're going to look at it with one uh

well-known technique called LoRa which

is a technique to fine-tune your weights

at the finetuning stage in an efficient

manner. So it's widely used it's um yeah

and and it saves a lot of computes. So

uh when you look at your weight matrices

instead of directly fine-tuning the

whole weight matrix the Laura technique

decomposes the fine-tuning between the

weights of the pre-trained model and

additional weights that it decomposes

into a low rank multiplication.

So uh in this formulation

the pre-trained weights that you have is

frozen. So this W0 is frozen and then B

and A are the matrices that you need to

tune. So B and A are typically um so

they have dimensions that will match the

like number of rows and number of

columns respectively from for the number

of rows of B and number of columns of A.

But then the dimension of the columns of

B and then rows of A is R which is the

rank of these matrices which is

typically taken very small.

So the dimensions of W is typically

hundreds or thousands and the dimension

of R is typically um you know up to 10

or of them. So as you can imagine it

results in much less weights to train.

Um and then I will mention just other

techniques to improve efficiency uh and

then ease the finetuning uh some u

methods called prefix tuning and

adapters. So, um, they're explained in

the in the class textbook, but we're not

going to dive in deep into them, um,

because they're like, uh, less commonly

used, but just so that you know.

Okay, so let's walk through in detail

how uh, does Lora work. So you when you

want to fine-tune your model, so you

have all these ways for which you have

already learned a distribution at

pre-training time. So one um naive way

of operating fine-tuning would be to

directly iterate on these weights. But

what we're doing here as we said is to

decompose this into the weights that we

have already pre-trained on and this

product of matrices. And what Laura says

is that you can do a forward pass on

both these terms and then adds these uh

quantities in the end. And yeah, A and B

is um like is going to be um

characterization of your task of

interest. So let's say you have a model

that uh to Ain was mentioning the kinds

of tasks that you can maybe specialize

your model on. for example, spam

detection. So you would take your

pre-trained model uh instantiate this

BNA in weights like in the weights of

your model fine tune on them and then

what you you will have is that BNA are

going to be specific to this task of

span detection similarly for sentiment

extraction and so on. So it's a very

nice property that you can start from a

base model further tune your weights and

then have your A and Bs that are task

specific.

Um so I want to comment now on where

were these um you know matrices being

learned. So in the original paper uh of

Laura it mentioned training only on the

uh attention matrices but later on uh

people realized that it might not be the

place where um it has the most positive

impact on uh on performance and then

there is a blog that uh came out a few

weeks ago that um studies this into

detail. uh and then they realized that

the feed forward blocks are actually

those where putting Lora is most

beneficial.

So today typically both these components

uh carry Lora matrices but the bulk of

the of the performance improvements is

actually contained in the feed forward

uh block.

And then I want to mention two

interesting properties about LoRa. Uh so

when you fine-tune with LoRa weights uh

you want to use a higher learning rates.

So typically 10 times more is the is the

guidance.

And then uh one interesting fact is that

it doesn't perform as well when you

train it with larger batch size.

So I don't have a good theoretical

explanation to give you for each of

them. It's more empirical observations

that have been emitted but to give the

main lines behind the mindset shared

from people who study this phenomenon.

Um the first one might be guided by the

rank of the Laura matrices that might be

small and as a result like given the

regions of space that it needs to

explore you need a higher learning rate.

And then the second one uh the

hypothesis here is that the training d

the training um like dynamic of products

of matrices is different than a full

matrix. And this is where the increasing

batch size phenomenon occurs. So it's

basically the um explanations tentative

explanations given there.

So now we're going to explore um an

optimization of this. Uh but before I

dive deep into that part, does anyone

have any questions on the Laura part?

Yep.

Yeah, great point. So the question is uh

do you do grid search on the rank? So

you could so it's a design choice the

rank R. Uh typically people would have

done it before you. So you have an idea

of what rank could be uh well suited for

your for your use case. Uh so you could

definitely do grid search or you could

just pick a popular value. I think a

four is is used commonly. You can just

go with with one of them. Um yeah and

then the we are going to see that the

reduction in number of parameters is so

huge already that reducing uh you know

even more maybe doesn't matter that much

like that initial redu reduction by

orders of magnitude goes a long way. uh

and then that is just a hyperparameter

tuning and you can see a given rank in a

given setup as being a design choice.

Uh okay so we have two minutes and we're

going to quickly cover quantized

uh Laura. So the techniques that Afin

mentioned just now uh regarding

quantizing weights uh and um and so

reducing the memory footprint is

something that we're going to see just

here. Uh so when you look at these

matrices w0 and a and b what people have

done in this paper is to quantize

the uh weights of w0 into a format that

is very smart and then uh compute and

iterate on these matrices that are being

learned a and b in full precision which

is in that case bf16.

Um and then uh the quantization of these

u frozen weights is super smart. It's in

a format called NF4

which assumes that the weights are

distributed normally and it splits the

space into quantiles rather than buckets

of fixed size which puts about the same

amount of number of uh values into each.

So it kind of optimizes the bits that

you use for encoding

and what it does is that it does like a

double quant quantization process. So it

quantizes the weights and then in a

second step it quantizes the

quantization constants that we didn't

see at length here. Basically when you

want to convert your um your full

weights in and out of quantized states

you have constants that are generated

and they propose to to quantize these

constants as well. So uh this method

generated 16x

um VRAMm uh savings and then the double

quantization methods uh gave some extra

savings but not that much but it's

interesting to know and then we're

exactly on time. Uh thank you for your

time.

Loading...

Loading video analysis...