Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

By Stanford Online

Summary

## Key takeaways - **LLM Training: Data, Evaluation, and Systems Matter Most**: While academia often focuses on architecture and training algorithms, practical LLM development hinges on data quality, effective evaluation, and robust systems. Industry prioritizes these latter three components for successful model implementation. [01:50], [02:48] - **Tokenization Balances Vocabulary and Sequence Length**: Tokenization is crucial for LLMs, moving beyond simple word-based approaches. Techniques like Byte Pair Encoding create tokens from common sub-sequences, aiming for an average of three to four letters per token to manage sequence length and handle variations like typos. [10:48], [12:26] - **Scaling Laws Predict LLM Performance Improvements**: LLM performance scales predictably with increased compute, data, and model size. These scaling laws, often linear on log-log plots, allow for forecasting future performance and optimizing resource allocation, indicating that architecture nuances are secondary to scale. [41:03], [46:43] - **RLHF Aligns LLMs Beyond Supervised Fine-Tuning**: Reinforcement Learning from Human Feedback (RLHF) refines LLMs by maximizing human preference, addressing limitations of Supervised Fine-Tuning (SFT) like behavioral cloning and potential hallucinations. This process involves training a reward model or directly optimizing for human preferences, with DPO offering a simpler alternative to traditional RL. [01:00:07], [01:13:10] - **GPU Optimization: Mixed Precision and Operator Fusion**: Maximizing GPU utilization requires optimizing for throughput and leveraging matrix multiplication. Techniques like mixed precision (using 16-bit floats) and operator fusion (e.g., via `torch.compile`) reduce communication overhead and memory consumption, significantly speeding up computations. [01:38:00], [01:41:38]

Topics Covered

Data, evaluation, and systems matter more than model architecture.
The complex pipeline for cleaning internet data for training.
Scaling laws allow us to predict future AI performance.
The surprising $75 million cost of training a model.
Why pure language models are not useful AI assistants.

Full Transcript

so let's get started uh so I'll be

talking about building llms today um so

I think a lot of you have heard of llms

before uh but just as a quick recap uh

llms standing for large language models

are basically all the chat Bots uh that

you've been hearing about recently so uh

Chad GPT from open ey Claud from

entropic Gemini and and lman other type

of models like this and today we'll be

talking about how do they actually work

so it's going to be an overview because

it's only one lecture and it's hard to

compress everything but hopefully I'll

touch a little bit about all the

components that are needed to train uh

some of these llms uh also if you have

questions please interrupt me and ask uh

if you have a question most likely other

people in the room or on Zoom have other

have the same question so please ask um

great so what matters when training llms

um so there a few key components that

matter uh one is the architecture so as

you probably all know LMS are newal

networks and when you think about new

networks you have to think about what

architecture you're using and another

component which is really important uh

is the training loss and the training

algorithm um so how you actually train

these models then it's data so uh what

do you train these models on um the

evaluation which is how do you know

whether you're actually making progress

towards the goal of of uh llms and then

the system component so that is like how

do you actually make these models run on

uh Modern Hardware which is really

important because these models are

really large um so now more than ever

system is actually really an important

topic um for

llms so those five components um You

probably all know that llms and if you

don't know LMS are all based on

Transformers or at least some version of

Transformers uh I'm actually not going

to talk about the AR lecture today uh

one because I gave a SE lecture on um

Transformers a few weeks ago and two

because you can find so much information

online on uh Transformers but I think

you can it's there's much less

information about the other four topics

so I really want to talk about those um

another thing to say is that most of

Academia actually focuses on

architecture and training algorithm and

losses um as academics and I've done

that for a lot big part of my career is

simply we like thinking that this is uh

like we make new architectures new

models and it it seems like it's very

important but in reality honestly what

matters in practice is mostly the three

other topics so data evaluation and

systems uh which is what of most of

Industry actually focuses on um so

that's also one of the reason why I

don't want to talk too much about the

architecture uh because really the rest

is super

important um great so overview of the

lecture I'll be talking about

pre-training so pre-training uh you

probably heard that word this is the

general word this is kind of the

classical language modeling uh Paradigm

uh where you basically train your

language model to essentially model all

of internet and then there's a post

training which is a more recent Paradigm

which is taking these large language

models and making them essentially AI

assistants um so this is more of a

recent Trend since Chad GPT uh so if you

ever heard of gpt3 or gpt2 that's really

pre-training land uh if you heard of

chat GPT which you probably have this is

really posttraining land uh so I'll be

talking about both but I'll start with

pre-training and uh specifically I'll

talk about what is the task of

pre-training llms and what is the laws

that people actually

use so language modeling this is a quick

recap uh language models at a high level

are simply models of probability

distribution over sequences of tokens or

of words so it's basically some uh model

of P of X1 to XL where X1 is basically

word one and Excel is the last one in

the sequence or in the sentence um so

very concretely if you have a sentence

like the mouse ate the cheese what the

language model gives you is simply a

probability of this sentence being

uttered by a human or being found on on

online uh so if you have another

sentence like the the mouse at cheese uh

here there's grammatical mistakes so the

model should know that this uh should

have some syntactic knowledge so it

should know that this has less

likelihood of appearing

online uh if you have another sentence

like the cheese ate the mouse uh then

the model should hopefully know about

the fact that usually cheese don't eat

Mouse um so there's some semantic

knowledge and this is less likely than

the first sentence so this is basically

at a high level what language models are

um one word that you probably have been

hearing a lot in the news are generative

models uh so this is just something that

can generate models that can generate

sentences or can generate some data uh

the reason why we say language models

are generative models is that once you

have a model of a distribution you can

simply sample from this model and now we

can generate data uh so you can generate

sentences uh using a language

model so the type of models that uh

people are all currently using are what

we call Auto regressive language models

and the key idea of autor regressive

language models is that you take this

distribution over words and you

basically decompose it into the into the

distribution of the first word multiply

the by the distribution of or the

likelihood of the distribution of the

second word given the first word uh

multiply by P of the third word given

the first two words um so there's no

approximation here this is just the

chain rule of probability which you

hopefully all know about uh really no

approximation this is just one way of

modeling a

distribution uh so slightly more

concisely you can write it as a product

of U of PS of the next word given

everything which happened in the past so

of the context and uh so this this is

what we call Auto regressive language

models again this is really not the only

way of modeling distribution this is

just one way uh it has some benefits and

some downsides one downside of

autoaggressive language models is that

when you actually sample from this

autoaggressive language model you

basically have a for Loop which

generates the next word then conditions

on that next word and then regenerate an

other word so basically if you have a

longer sentence that you want to

generate you it takes more time to

generate it uh so there are some

downsides of this current Paradigm but

that's what we currently have so I'm

going to talk about this

one uh great so Auto regressive language

models at a high level um what the task

of autoregressive language model is is

simply predicting the next word as I

just said so if you have a sentence like

she likely prefers uh one potential next

word might be dogs and the the way we do

it is that we first tokenize so you take

these words or subwords you tokenize

them um and then you give an IDE for

each token so here you have 1 2 three uh

then you pass it through this black box

as I already said we're not going to

talk about the architecture you just

pass it pass it through a model and you

then get a distribution a probability

distribution over the next word over the

next token and then you sample uh from

this distribution you get a new token

and then you DET tokenize so you get a

new ID you then DET toonize and that's

how you basically sample from a language

model uh one thing which is important to

not is that the last two TS uh two steps

are actually only need needed during

inference uh when you do training you

just need to predict uh the most likely

token and you can just compare to the

real token which happen in practice and

then you basically change the weights of

your model to increase the probability

of generating that

token um great so autoaggressive neural

language models so to be slightly more

specific still without talking about the

architecture uh the first thing we do is

that we have all of these oh sorry yes

on the previous slide when you're

predicting the probability of the next

tokens does this mean that your final

like output VOR has to be the same

dimensionality as the number of tokens

that you have yes how do you deal with

like if you have more to like if you're

adding more tokens to your cor something

yeah so we're going to talk about

tokenization actually later uh so you

will get some sense of this you

basically can deal with adding new

tokens I am I'm kind of exaggerating

there are methods for doing it but

essentially people don't do it um so

it's really important to think about how

you tokenize your text and that's why

we'll talk about that later but it's a

very good point to notice that you

basically the vocabulary size so the

number of tokens that you have is

essentially the output of your uh

language model so it's actually pretty

pretty

large okay so autoaggressive new

language models first thing you do is

that you take every word or every token

you embed them so you get a um some

Vector representation for each of these

tokens um you pass them through some ual

Network as we said it's a Transformer

then you get a representation for all

the word in all the words in the context

so it's basically representation of the

entire sentence uh you pass it through a

linear layer as you just said to

basically map it to the number so that

the output the number of outputs is the

number of tokens uh you then pass it

through some soft Max and you basically

get uh probity distribution over the

next words given every word in the

context

and the law that you use is basically

it's essentially a task of classifying

the next token so it's a very simple

kind of machine learning task so you use

the cross entry P loss where you

basically you look at the actual Target

that happened which is a target

distribution which is a one hot encoding

which here in this in this case says I

saw uh the real word that happened is

cat so that's a one hot um distribution

over cat and here this is the actual uh

do you see my mouse oh yeah this is the

distribtion that you generated and

basically you do cross entropy which

really just increases the probability of

generating cat and decreases all the the

probility of generating all the other

tokens one thing to notice is that as

you all know again uh this is just

equivalent to maximizing the text log

like the text log likelihood because you

can just rewrite the the max over the

probability of um this autoregressive

language moding task as just being this

minimum over I just added the log here

and minus which is just the minimum of

the loss which is the cross enty loss so

basically minimizing the loss is the

same thing as maximizing the likelihood

of your text any question

questions

okay

tokenizer um so this is one thing that

people usually don't talk that much

about tokenizers are extremely important

uh so it's really important that you

kind of understand at least uh what they

do at a high level so why do we need

token in the first place uh first it's

more General than words so one simple

thing that you might think is oh we're

just going to take every word that we

will have you just say every word is a

new is a token in its own um but then

what happens is if there's a typo in

your word then you might not have any

token associated with this this word

with a typo and then you don't know how

to actually pass this word with a typo

into a large language model so what do

you do next and also even if you think

about words words is a very like words

are fine with like Latin based languages

uh but if you think about a language

like taii you won't have a simple way of

tokenizing by spaces because there are

no spaces between words um so really uh

tokens are much more General Than Words

first thing second thing that you might

think is that you might tokenize every

sentence character by character you

might say a is one token b is another

token uh that would actually work and

probably very well the issue is that

then your sequence becomes super long

and as you probably remember from the

lecture on on Transformers uh the

complexity uh grows quadratically with

the length of sequences so you really

don't want to have a super long sequence

um so tokenizers basically try to deal

with those two problems and give common

subsequences a certain token and usually

how you should be think about is around

uh an average every token is around

three four letters

um and there are many algorithm for

tokenization I'll just talk about one of

them to give you a high level which is

what we call bite P en coding which is

actually pretty common one of the two

most common tokenizers and the way that

you train a tokenizer is that first you

start with a very large Corpus of text

and here I'm really not talking about

training a large language model yet this

is purely for the tokenization step uh

so this is my large Corpus of text with

these five words um then you associate

every character in this Corpus of text a

different token uh so here I just split

up every character with a different

token uh and I just color coded all of

those tokens and then what you do is

that you go through your text and every

time you see pairs of tokens that are

very common the most common pair of

token you just merge them so here you

see three times the the the tokens T and

O next to each other so you're just

going to say this is a new token and

then you continue you repeat that so now

you have to talk which happens three

times to with an E that happens sorry

two times and an token which happens

twice and then ex which also happen

twice so this is that if you were to

train a tokenizer on this Corpus of text

which is very small that's how you would

uh finish with a token with a pre like a

trained tokenizer uh in reality you do

it on on much larger corpuses of text um

and this is the real tokenizer of uh

actually I think this is gpt3 or chat

GPT uh and here you see how it would

actually separate these words so

basically you see the same thing as what

we gave in the previous example token

becomes its own token so tokenizer is

actually split up into two tokens token

and iser um so yeah that's all about

tokenizers any questions on that yeah

how do you deal with spes and how do you

deal

with yeah so actually there's a a step

before tokenizers which is what we call

pre- tokenizers which is exactly what

you just said uh so this is mostly

in theory there's no reason to deal with

spaces and punctuation separately you

could just say every space gets its own

token every um uh punctuation get its

own token and you can just do all the

merging the problem is that so there's

an efficiency question actually training

these tokenizes takes a long time uh so

you better off because you have to

consider every pair of token so what you

end up doing is saying if there's a

space this is very like pre- tokenizes

are very English specific you say if

there's a space we're not going to start

looking at the the token that came

before and the token that came

afterwards so you're not merging in

between spaces but this is just like a

optimiz like a computation optimization

you could theoretically just deal with

it um the same way as you deal with any

other character and yeah when you merge

tokens do you delete the tokens that you

merged away or do you keep the the

smaller tokens that merge um you

actually keep the smaller tokens I mean

in reality it doesn't matter much

because um usually on large Corpus of

text you will have actually everything

uh but you usually keep the small ones

and the reason why you want to do that

is because if in case there's as we said

before you have some um some grammatical

mistakes so some typos you still want to

be able to represent these words by

character um so yeah yes are the tokens

unique so I mean say in this case T Ken

is there only one occurrence or could do

you need to leave multiple occurr so

they could have take on different

meanings or something oh oh I see what

you say no no it's every token has its

own uh unique ID um so a usual this is a

great question for example if you think

about a bank which could be bank for

like money or bank like water um it will

have the same token but the model will

learn the Transformer will learn that

based on the words that are around it it

should associate that I'm saying I'm

being very high wavy here but associate

that with the with a with a

representation that is either more like

the bank money side or the Bank water

side um but that's a Transformer that

does that it's not a

tokenizer yes yeah so you mentioned

during tokenization keep the smaller

tokens you started with right like if

you start with a t you keep the T and

then you build your tokenizer to the

that you can now in token so let's say

maybe you didn't train on token but like

in your data you are trying to encode

token so how does the tokenizer know to

encode it with token or

a great question you basically when you

so when you tokenize so that's after

training of the tokenizer when you

actually apply the tokenizer you

basically always choose the largest uh

token that you can apply uh so if you

can do token you will never do T you

will always do token um but there's

actually so people don't usually talk

that much about tokenizers but uh

there's a lot of of computational

benefits uh or computational tricks that

you can do for making these things

faster uh so I really don't think we and

honestly I think a lot of people think

that we should just get away from

tokenizers um and just kind of tokenize

character by character or bites by bites

uh but as I said right now there's this

issue of like length uh but maybe one

day like in five or 10 years we will

have different architectures that don't

scale quadratically with the length of

the sequence and uh maybe we'll um yeah

move away from tokenizes so can you

share with us the drawback why do people

want to move away from the tokenizer oh

um yeah so think

one good example is uh math if you think

about math actually numbers right now

are not tokenized so for example 327

might have its own token which means

that models when they see numbers they

don't see them the same way as we do and

this is very annoying because what I

mean the reason why we can kind of

generalize with math is because we can

deal with every every letter separately

and we can then do composition where you

know that basically if you add stuff

it's just the same thing as adding every

one separately plus like whatever the

unit that you add so they can do that um

so then you have to do like special

tokenization and like one of the big

changes that GPT 4 did uh is changing

the way that they tokenize uh code so

for example uh if you have code you know

you have like often in Python these four

spaces at the beginning those were dealt

with uh kind of strangely before um and

as a result like the model couldn't

really understand uh how to deal with

code uh so so toiz actually a lot um

okay so I'll move on right now but we

can come back later on token Isis great

so we talked about the task the L the

tokenizer let's talk a little bit about

evaluation uh so the way that LMS are

usually evaluated is what we call is

using what we call perplexity um at a

high level it's basically just your

validation loss uh the slight difference

with perplexity is that we use something

that is slightly more interpretable

which is that we use the average per

token loss and then you expon entiate it

and the reason why you exponentiate it

is because you want I mean the loss has

a log inside and you like one humans are

actually pretty bad at thinking in log

space but two logs depend on the base of

the log uh while when you exponentiate

you basically have everything in the uh

kind of the vocabulary size uh unit um

and the average proten is just so that

your your complexity is independent of

the length of your sequence um so

perplexity is just two to the power uh

average of the loss of the sequence

um so perplexity is between one and the

length of the vocabulary of your

tokenizer uh one it's simply well if you

predict perfectly the thing which uh

every word then every word will have

basically product of ones uh so the best

perplexity you can have is one if you

really have no idea you basically

predict with one divided by uh size of

vocabulary um and then you do simple

math and you basically get perplexity of

size of vocabulary uh so the intuition

of perplexity is that basically the

number of tokens that your model is kind

of hesitating between uh so if you if

your model is perfect it doesn't

hesitate it know exactly the word if it

really has no idea then it hesitates

between uh all of the

vocabulary uh so perplexity really

improved that's perplexity on a standard

data set between 2017 and 2023 it it

went from kind of 70 tokens to less than

10 tokens over these five six years so

that means that the models were

previously as dating between 70 words

every time it was generating a word and

now it's as dating between like less

than 10 words so that's much better

perplexity is actually not used anymore

in academic benchmarking mostly because

it depends on the tokenizers that you

use uh it depends on the actual data

that people are evaluating on but it's

still very important for development of

llms so when you when you actually train

your own llm people will still really

look at the

perplexity uh one common other way and

now more common in Academia of

evaluating these llms is just by taking

all the classical NLP benchmarks and

I'll give you a few examples later and

just kind of aggregating everything um

so collect as many automatically

evaluatable benchmarks and just evaluate

across all of them um so one such if uh

or actually two such uh benchmarks of

what we call uh Helm which is from

Stanford and another one is the hugging

face open LM leader board which are the

probably two two most common ones right

now um so just to give you an idea in

Helm there are all of these type of

tasks which are mostly things that can

be easily evaluated uh like question

answering so think about many different

question answering uh tasks um and the

benefit with question answering is that

you usually know what is the real answer

um so you can the way that you evaluate

these models and I'll give you a

concrete example in one second um is

that you can just look at How likely the

language model is to generate the real

answer compared to some other answers

and that's essentially at a high level

how you evaluate these models um so to

give you a specific example mlu is

probably the most common um academic

Benchmark for

llms uh and this is just a collection of

many question and answers in all of

those domains for example College

medicine College physics astronomy and

these type of topics and the questions

are things like so this in astronomy

what is true for type 1 a supernova then

you give uh four different potential

answers and you just ask the model which

one is more likely so there are many

different ways of doing it either you

can look at the likelihood of generating

all these answers uh or you can ask the

model which one is the most likely uh so

there are different ways that you can

promp the model but at a high level you

know which one is correct and there are

three other mistakes um yes kind

creating is like unconstrained text as

the output yeah how do you evaluate a

model if it give something that's you

know semantically completely identical

but is not the exact token list that

expect yeah so that's a great question

I'll talk more about that later here in

this case we don't do unconstrained so

the way you would evaluate MML is

basically either you you ask the first

question and then you look at the

likelihood of the model generating a the

likelihood of the model generating b c

and d and you look at which one is the

most likely or you can as the model out

of ABC d which one is the most likely

and you look at whe the to the most

likely next token is A B C or D so uh

you can strain the model to say it can

only answer these four things you say

you constraint the model you mean you

constraint The Prompt or do you mean of

its whole probability distribution

outputs you only comparing the outputs

like you're only comparing the

a so uh in the second case I gave you

you would do exactly the I actually you

would do both you would prompt the model

saying ABC or D plus you would constrain

to only uh look at these two these four

tokens in the first case you don't even

need to generate anything so in the

first case you literally just look given

that it's a language model it can give a

distribution over sentences you just

look at what is the likelihood of

generating all of these words what is

the likelihood of generating the second

choice and you just look at whether the

most likely sentence is actually the

real answer so you don't actually sample

from it you really just use P of x one

to excel does that make sense uh that

being said evaluation of open-ended

questions is something we're going to

talk about later and is actually really

important and really challenging yes

earlier you mentioned that um like um

metrics like flexity are not are not

like usually used because it depends on

like how you do your terization some

design choices I was wondering if you

could speak more to that oh um yeah so

think about perplexity I told you

perplexity is between one and vocabulary

size so now imagine that Chad GPT uses a

tokenizer that has like 10,000 tokens

but Gemini from Google uses a tokenizer

that had 100,000 uh potential tokens

then actually the Gemini one will will

have like the upper bound of the the

perplexity that you can get is actually

worse for Gemini than for Chad GPT does

that make sense so that's just an idea

it's actually a little bit more

complicated than that but that's just

like one uh first or the bit of you can

see that the tokenizer actually

matters um

great okay so evaluation challenges

there are many I'll just talk about two

really briefly uh one as I told you

there are two ways of doing evaluation

for these mlu actually there are many

more than two but I give you two

examples um and it happens that for a

long time even though that was a very

classical Benchmark that everyone used

uh actually different uh different

companies and different um different uh

uh different organization were actually

using different ways of evaluating mlu

and as a result you could you get

completely different results for example

Lama

65b uh which was the first model of meta

in the Lama series uh had on Helm 63.7

accuracy but on this other um Benchmark

had like

48.8 um so really the way that you

evaluate and this is not even talking

about prompting this is really just kind

of the the way that you evaluate the uh

the models prompting is another issue so

really there are a lot of

inconsistencies it's not as easy as it

looks uh first thing yeah sorry how can

we make sure that all these models AR

trained on The Benchmark okay second

thing this is a great question uh chain

test contamination uh this is something

which I would say is really important in

Academia in uh given that the talk is

mostly about training large language

models uh for companies it's maybe not

that important CU they know what they

trained on uh for us we have no idea so

for us it's a real problem uh so there

are many different ways of trying to

test whether uh the test set sorry

whether the test set was actually in the

training Set uh one kind of cute trick

um that people uh in in the lab on T lab

have found is that what you can do is

that given that most of the data set

online are not randomized

you can just look at and in that

language models what they do is just

predict the next word um you can just

look at the entire test Set uh what if

you generate all the examples in order

versus all the examples in a different

order and if it's more likely to

generate a thing in order given that

there's no real order there then it

means that probably was in a training

set does that make sense um so there are

many that's like one of them there are

many other ways of doing it train test

contamination again not that important

for development really important for

academic

benchmarking great so there are many

other challenges but uh I'll move on for

now great data um so data is another

really big topic um at a high level

people just say oh you basically train

large language models on all of Internet

what does that even mean um so or people

sometimes say all of clean internet

which is even less defined um so

internet is very dirty and really not

representative of what we want in

practice if I download a random website

right now you would be shocked at what

is in there it's definitely not your

Wikipedia um so I'll go really briefly

on like what people do um I can answer

some questions but I mean data is on its

own is a huge topic uh basically first

what you do is download all of Internet

what that means is that you use uh web

crowlers that will go on every web page

on Internet or every web page that is um

on Google uh and that is around 250

billion pages right now um and that's

around one petabyte of of data so this

is actually a common common C is one web

crowler so people will usually write

their own web crowlers what they do is

that they use standard web crowlers and

we common crawl is one of them uh that

basically every month adds all the new

websites that were added on uh internet

that are found by by Google and they put

it in a big uh basically a big data set

um so that's on common call you have

around 250 billion pages right now so 1

E6 gigabytes of data once you have this

uh so this is a random web page like

literally random uh from this common

craw and what you see is that one it

really doesn't look at type of things

that you would usually see but actually

so this is an HTML page uh it's hard to

see but if you look through you will see

some content for example here here uh

tesing world is your ultimate source for

the system X high performance server and

then you have three dots so you don't

even the sentence is not even finished

that's how a random internet looks like

uh so of course it's not that useful if

you just train a like large language

model to generate things like this so

what are some of the steps that are

needed first one you extract the text

from the HTML so that's what I just try

to do by looking at uh basically the

correct text uh there are a lot of

challenges by through this for example

extracting math is actually very

complicated but pretty important for

training large language models um or for

example boiler plates a lot of your

forums will have the same type of

headers the same type of Footers uh you

don't want to repeat all of this in your

data um then you will filter undesirable

content uh so not safe for work harmful

content pii uh so usually every company

has basically a a black list of websites

that they don't want to train the models

on that Black List is very long and you

basically say if it comes from there we

don't train on this there are other ways

of doing these things is that you can

train a small model for classifying what

is pii removing these things um it's

hard every Point here that I'm going to

show you is like a hard amount of work

uh but I'm going to go go quickly

through it so filter undesirable content

second or fourth is the dup D

duplication as I said um you might have

things like headers and Footers in

forums that are always the same you want

to remove that another thing that you

might have is a lot of URLs that are

different but actually show the same

website um and you might also have a lot

of like U um paragraphs that come from

like common books that are basically

duplicated a thousand times or 10,000

times on internet so you have to

duplicate also very challenging uh

because you have to do that at scale

once you do duplication you will do some

heuristic filtering you will try to

remove low quality documents uh the way

you do that are things like rules-based

um filtering for example if you see that

there are some outlier tokens if the

distribution of tokens in the website is

very different than the usual

distribution of tokens then it's

probably some outlier if you see that

the length of the words in this website

is super long there's something strange

going on on that website if you see that

the the website has only three words

maybe is it worth training on it maybe

not if it has like 10 million words

maybe there's something also

wrong going on that page um so a lot of

rules like this yes why we filter out

undesirable content from our dat set

instead of kind

of putting it in is like a supervised

loss right like can we not just say like

you know here's this like hate speech

website let's actively try to Let's

actively penalize the for generating

we'll do exactly that but not at this

step that's where the posttraining will

come from uh pre-training um the idea is

just to say I want to model kind of how

humans speak essentially um and I want

to remove all these like headers photos

and and menus and things like this but

it's a very good uh like idea that you

just had and that's exactly what we'll

do

later Next Step modelbased filtering so

once you filtered a lot of data what you

will do uh that's actually a very cute

trick uh you will take all of Wikipedia

and you will look at all the links that

are linked through Wikipedia p

because probably if something is

referenced by Wikipedia it's probably

some high quality website and you will

train a classifier to predict whether

something comes from whether a document

comes from one of these references uh

from Wikipedia or whether it's from the

random web and you will try to basically

say I want more of the things that come

from Wikipedia references does that make

sense so yeah so you will train a a

machine learning uh model usually also

very simp simple models because you need

to do that really at scale I mean just

think about the 250 billion

Pages uh next one you will try to

classify your data into different

different um domains you will say okay

this is entertainment this is books this

is code this is like these type of

domains and then you will try to either

um up or down weight some of the domains

uh for example you might say uh you

might see that actually if you train

more on code then actually your model

becomes bettered on reasoning so that's

something that people usually say in a

very handwavy way if you train your

model more code actually it helps

reasoning so you want to upweight the

coding uh distribution because that

helps for General language modeling

skills uh books is usually also another

one that people usually um upweight

entertainment they usually downweight uh

so things like this of course you want

to do it so people used to do it maybe

uh kind of theistically now there's

entire pipelines that we'll talk about

of how to do these things uh slightly

more um

automatically and then at the end of

training uh usually train um after

training on all of this data that we saw

usually train on very high quality data

at the end of of training your large

language model where you decrease your

learning rate uh and that basically

means that you're kind of overfitting

your model on a very high quality data

so usually what you do there is like

Wikipedia you basically overfit on

Wikipedia yeah and you overfit on like

human uh data that was collected um the

other things like continual pre-training

for getting longer context I'm I'm going

to skip over all of these things uh but

I just to give you a sense of how hard

it is when people just say oh I'm going

to train on internet that's a lot of

work um and really we haven't figured it

out yet so collecting World data is a

huge part of practical large language

model uh some might say it's actually

the key yes

about data so basic question so usually

when you start with like the terabyte of

data after I go through all that steps

the typical amount of data you have in

and then like how how large a team does

it typically think to go through all the

steps you talk about so how is the

question how large is the data after you

filter yeah after you filter and then to

go through all the step how large a team

do you need to go through like the the

other fation sttion uh how slow is it or

how like how how many people would you

need to be able to do this uh okay

that's a great question I'm going to

somewhat answer about the data uh how

large is the data set uh at the end of

this slide uh for number of people that

work on

it um that's a good question I'm

actually not quite sure but I would

say yeah I actually don't quite no but I

would say it's probably even bigger than

the number of people that work on kind

of the two tuning of the pre-training of

the model uh so the data is bigger than

kind of the modeling aspect um yeah I I

don't think I have a good sense I would

say probably in Lama's team which have

like 70 years people I would say maybe

15 work on data uh I yeah all these

things you don't need that many people

you need a lot of computer so because

for data you need a lot of CPUs um so

yeah and I'll answer the second question

at the end of this slide so as I just

kind of alluded to really we haven't

solved data at all for pre-training so

there's a lot of research that that has

to be done first how do you process

these things super efficiently uh second

how do you balance kind of like all of

these different domains uh can you do

synthetic data generation that's

actually a big one right now uh and

because we don't have uh we'll talk

about that later we don't have enough

data on the internet um can you use

multimodal data instead of just text

data and how does that improve even your

text performance um

there's a lot of seccy because really

this is the key of most of the pre-train

pre-trained large language models so for

competitive Dynamics uh usually these

these um these companies don't talk

about how they do the data collection

and also there's a copyright liability

issue they definitely don't want to tell

you that they've trained on books even

though they did um because if not you

can uh sue them uh common academic

benchmarks uh so that will kind of

answer what you asked um it started so

those are the smaller ones it's the

names are not that important but it

started from around 150 billion tokens

which around uh 800 GB of data now it's

around 15 trillion of to 15 trillion

tokens which is also uh the size of the

models that are right now the best

models are probably trained on that

amount of data so 15 trillion tokens uh

which is probably I guess two order of

manage bigger than that so 80 uh E3 gab

so that would be

around 100 to thousand times uh

filtering of the common crawl if I'm not

mistaken um so yeah one very one very uh

famous one is the pile so this is

academic Benchmark of the pile and we

can just look at what distribution of

data they have it's things like um

archive PBM Central uh which is all the

the biology stuff uh here it's Wikipedia

you see stack exchange um some GitHub

and some books and things like this um

again this is on the smaller side so

this is if we look at here this is on

280b so in reality it's like 100 times

bigger so you cannot have that much of

GitHub and and of

Wikipedia um in terms of close Source

models just to give you an idea uh Lama

2 um it was trained on 20 two trillion

tokens lamb 3 15 trillion tokens which

is currently the best model that we know

on how much it was trained on which is

the same thing as this the the the best

academic or the biggest academic

Benchmark which is 15 trillion tokens

GPD 4 we don't really know but it's

probably in the same water of magnitude

or it's probably around that actually

it's probably around 13 um from leaks if

the leaks are true

um great so scaling laws um any other

questions on Data before you go to

scaling

laws sorry I know I'm giving you a lot

of information but uh there's a lot into

training at large language models great

scaling laws so so the idea is that what

people saw um around 2020 or at least

from a long time but they've been able

to kind of theoretically show it or

impurely show it since 2020 is that the

more data you train your models on and

the larger the models the better the

performance this is actually pretty

different than what you've seen in this

class in this class we teach you about

overfitting overfitting doesn't happen

with large language models uh larger

models better performance um it's

something that really took a long time

for the community who took this type of

class to realize um but for the exam

overfitting

exists so okay the idea of scaling laws

is that if given that you know that more

data and larger models will always give

you better performance can we predict

how much better your performance will be

if you increase the amount of data and

the size of your model and surprisingly

it works uh so here you see three plots

from a very famous paper called scaling

loss from openi um here you see on the

x-axis compute so how much did you train

like how much compute did you did you

spend for training and here you see test

loss so this is essentially I mean it's

not perplexity but it's your validation

loss um so it's a log of the perplexity

and if you put these two on uh log scale

uh then you see that uh the the

performance or like the this the sorry

the the scaling law is linear uh that

means that if you increase your compute

by a certain amount you can you can say

by how much your test loss will actually

decrease same thing with data and same

thing for parameters if you increase the

data set size your loss will will

decrease by an amount that is somewhat

predictable if you increase the number

of parameters it will decre the loss

will decrease by amount which is

somewhat predictable this is really

amazing um very surprising I mean it

looks in nocuous when you look at these

type of plots but that's crazy because

it means that you can predict uh how

well we're going to perform in 2 3 years

depending on how much compute we will

add assuming that these things will hold

there's nothing theoretical about it um

yes two things one what is the loss that

they're using here is this perplexity or

so it's it's you know I said perplexity

was like two to the power of the LW so

this is the the the power of the

perplexity and then the second thing is

when you like increase the number of

parameters or you increase the total

data set size going dat times doesn't

that just inherently increase your

compute like do all this work to

just specific no this is a great

question so the compute here is actually

a factor of two things the data and the

parameter what I'm showing here is that

you can um well actually we're going to

talk about that in details but basically

if you increase the number of parameters

you should increase the number of data

that you have um so you actually don't

go multiple times through the same data

set no one does EPO in a lar at least

not yet uh because we have still kind of

enough data um so yeah this is all the

same Trend which is increase compute

decrease

loss yes have we seen the numbers for

the last two years or is it still

holding it is still holding I I don't

have like good numbers to show you uh

but it is still holding

surprisingly yes is there no evidence

like empirical evidence that you

plateau expected PL

no empirical evidence of plateauing

anytime soon um why we don't know um

will it happen probably I mean it

doesn't need to because it's actually in

log scale so it's not like as if it had

to go it had to Plateau like

mathematically it could continue

decreasing like this I mean most people

think that it will probably Plateau at

some point we don't know

when um okay so that's I'll talk more

about scaling laws now

so why are scaling laws really cool

imagine that I give you um you're very

fortunate I gave you 10,000 gpus for

this month what model will you train how

do you even go about answering that

question and I mean this is a a

hypothetical but that's exactly what

these companies are faced with uh the

old pipeline um which was basically you

tune High parameters on the big models

so let's say I have 30 days I will train

30 models for one day each I will pick

the best one uh and that will be the

final model that I will use in

production um that means that the model

that I actually used was only trained

for one day the new pipeline is that you

first find a scaling recipe so you find

something that tells you for example oh

like one common thing is that if you

increase the size of your model you

should decrease your learning rate so

you find a scaling recipe such that you

know if I increase the the the the size

of my model here's what I should do with

some high parameters then you tune your

high parameter

on smaller models of different sizes

let's say I will say for 3 Days of my 30

days I will train many different models

and I would do highper parameter tuning

on these small models each of different

sizes then I will fit a scaling law and

try to extrapolate from these smaller

models which one will be the best if I

if I train it for much longer or sorry

if I train it for a larger model and

then I will train the final huge model

for 27 days instead of just one day

um so the new pipeline is not train

things or do high prity tuning on the

real scale of the model that you're

going to use in practice but do things

on smaller ones at different scales try

to predict how well they will perform

once you make them bigger I will give I

will give you a very concrete example

right now uh let's say Transformers

versus lstms let's say you you have

these 10,000 gpus you will not sure

which one you should be using should I

be using Transformer based model or LCM

based model what I will do is I will

train Transformers at different skills

so here you see different parameters on

the x-axis Y axis is my test loss I will

then train different different lstms at

different scales once I have these

points I will see oh it kind of fits a

scaling law I will fit my scaling law

and then I will be able to predict oh if

I had 10 times more compute here's how

well I would perform for the LM it's

actually slightly less linear for the

lstm but like you could probably try to

predict where you would end up and

clearly from this plot you would see

that Transformers are better um one

thing to notice when you read these type

of scaling laws is that are two things

that are important uh one is really your

scaling rate uh which is kind of the uh

the slope of the the slope of the

scaling law the other thing is your um

your intercept like you could start

worse but actually become better over

time it just happens that lstms are

worse for both uh but I could show you

another one where things you can predict

that actually after a certain scale

you're better off using that type of

model than others uh so that's why

scaling laws are actually really

useful any questions on

that yeah so these are all kind of very

how how sensitive are these to like

small differences in the architecture

like one one like Transformer

architecture versus another Transformer

architecture you basically have to like

fit your own curve and make basically

say like oh scaling law has tell me

there should be some like logarithmic

function let me extrapolate that for my

own yeah so uh usually for example if

you're an academic and you want to now

at least that's like pretty recent and

you want to propose a new like

activation uh that's exactly what you

will do you will fit a scaling law show

another scaling law with the standard

like I don't know G and you will say

that it's better in reality once you

start thinking about it in scaling loss

terms you really realize that actually

all the architecture differences that we

can make like the small minor ones all

they do is maybe change a little bit the

The

Intercept but really that doesn't matter

uh cuz just train it for 10 hours longer

or like wait for the next uh for the

next Compu gpus and these things are

really secondary which is exactly why I

was telling you originally people spend

too much time on the architecture and

losses um in reality these things don't

matter as much data though if you use

good data you will have much better

scaling loss than if use bad data so

that really matters

uh another really cool thing you can do

with scaling laws is that you can ask

yourself uh how to optimally allocate

training resources should I train larger

models because we saw that it's better

when you train larger models but we saw

that it's also better when you use more

data so which one should I do should I

just train on more data a smaller model

or should I train a larger model on less

data um so chinchilla is a very famous

paper that first showed this uh the way

they did it I want to give you a little

bit of a sense of what these plots are

uh here you see training loss again on

the x-axis you see parameter parameter

differences uh sorry parameter size uh

number of parameters so the size of the

model and here all these curves are what

we call isof flops which is that all the

models on this curve H have been trained

with the same amount of

compute um the way that you do that is

that you train you change sorry you vary

the number of tokens that we trained on

and the size of the models but you vary

in such a way that the total compute is

constant

okay so all these curves that you see

with different colors have different

amount of computers that were trained on

then you take the best one for each of

those curves once you have the best one

for each of those curves um you can ask

you can plot um how much flops it was

and which curve were you on and how much

parameters did you actually use for

training that specific point you put

that on the on the log log uh scale

again and now you fit a scaling law

again so now I have something which

tells me if I want to train a model of

10^ 23 flops here's exactly the number

of parameters that I should be using 100

100b and you can do the same thing with

flops and

tokens so now you can predict if if I

tell you exactly I have one month of

compute what size of model should I be

training F your scaling law and I tell

you um of course that all looks

beautiful in reality like there's like

there's a lot of like small things of

like should you be counting like

embedding parameters like there's

there's a lot of complexities but if you

do things well these things actually do

hold um so the optimal number of

parameters that that chinchilla Pap have

found is to use 20 tokens for every

parameter that you train uh so if you

add one more parameter you should add

you should train your thing on your

model on 20 more tokens so one caveat

here is that this is optimal training

resources so that is telling me if you

have 10^ 23 FL

or if you have like 100 I don't know how

much that is100 million or 10 no that's

much less actually let's say I have $5

million to to train my best model that

gets the lowest loss how how what would

I train on in reality these companies

need to think about inference also if

you have a smaller model they will spend

less over time um so actually if you

consider the inference cost you have

other papers that Tred to show that um

it's around

150 uh parameters per sorry tokens per

parameters because you prefer having a

smaller model cuz over time you're going

to you're going to actually um spend

less money on inference of these models

so 150 to one that's around what the

best models are trained on right now at

least the ones that are that are used um

in practice for in

production

great any question on

chin great oh sorry in practice how

expensive is inference for these models

rela to

train actually very expensive uh I will

not talk about inference because that

would be another entire lecture but just

think about Chad GPT where they have I

don't know how much it is now like 600

million people that used it um like

that's a lot

um yeah so it's actually very expensive

there's a lot of optimization you can do

for in though um and that's an entire

other lecture so I'm going to skip that

uh this time but it's very

interesting okay tuning um as I said

there are many things that you can uh

answer with scaling laws I just try to

give you two examples uh but really

there are many things what data do you

use what mixture what data mixing

waiting you use data mixtures that's

what we talked about before uh what

architecture you use whether you should

make your models uh wider or deeper um

should you be paying for more gpus or

actually collecting more data um all

these things are things you can try to

answer with scaling

laws one thing I want to say is the bit

lesson if you ever heard of Richard

sudden a very famous blog post in 2019

um what he realized uh which I think not

enough people realize I didn't

definitely did not realize at that time

um is that once you see these type of

scaling laws you know that the more

compute you have the better models you

will get so with skill you will get

better model and you also know by Mo law

or these type of variant of Mo law that

you will always have better compute then

the only thing that matters is just to

have architectures that can leverage

computation so what matters is basically

systems data and less so the

architecture like the small architecture

differences like your your your

activation and things like this uh so I

think that's like one of the reasons why

most of research focuses on um some

things that for industry matters less

and I was one of those researchers for a

large part of my my career um so don't

spend time over complicating do the

simple things do it well seal them

that's really what openi taught us with

um with chat gpg and with all the gpts

before okay I want to give you some

backup the envelope computation so I

might be off by a few factors here but I

just want to give you a sense of how

costly it is to train some of these

models I'll give as an example

Lama 3 400b which is currently the best

open source model that you can get uh it

was trained on 15.6 tokens it has 45

billion parameters so just now that you

know what is like this uh optimal tokens

per parameter that's around 40 so that's

a little bit more than chinchilla but

less than this like inference uh optimal

um model so they went for training

optimality uh flops for this model so

one simple uh way to compute flops is

six uh times the number of parameters

times the number of data you train on uh

so if you do the simple calculation here

it's 3.8 e25 flops the reason why this

is important is that if you follow the

little bit the news there's an executive

order from Biden that basically says

that once you have uh 1 e26 parameters

uh sorry flops uh then you have special

scrutiny on your models so they went 2x

less than that so they really went right

below this to not have special scrutiny

so 38 uh I might be off by a little bit

but it's definitely under the 1

26 oh um so paramet p is parameters n is

data number of tokens this is a uh this

is just an

approximation we

yeah okay uh compute and we know that

they trained on 16,000

h100s um and we know the throughput but

they they said it too uh so if you do

the computation it takes around 70 days

um or 26 million GPU hours at least

that's with my uh back of the envelope

computation they actually said that they

use 30 million instead of 26 million GPU

hours um so maybe they had like some uh

some challenges I don't really know but

if you follow the simple computation

it's around 70 days um cost uh I mean

this it's hard to to approximate but I'm

just going to say it's kind of the rent

like what if I were to rent h100s that

many h100s for that many days how much

will I pay uh h100 a lower bound on the

on the renting uh cost of h100 is around

2 hours uh $2 per hour so if you

multiply this by 26 million uh hours uh

you get 52 million uh dollars so they

probably pay less than that but not

actually much less because all these um

all these services that actually rent

gpus they don't make that much money so

it's it's probably slightly less but not

that much less um now salary I said 50

employees 500k per

year say yeah it's probably the right

ballpark 25 million uh so if you put all

together around 75 million um dollars

for

training uh this Slammer model I'm

probably off by like 10 million but but

that's kind of right uh bpk

carbon emitted um a lot of people might

ask like also the cost is not the only

thing that is important so I did the

computation um it's around 4 uh 4,000 um

tons of CO2 equivalent that is actually

only 2,000 return tickets from JFK to uh

London so right now uh carbon emitted is

actually not uh I mean it's huge but

it's not like um meaningful yeah yet I

think in maybe GPT 6 gpt7 once you

multiply this by 100 that might become a

real issue right now it's still not uh I

think um an issue in the grand scheme of

things next model the way you should be

thinking about these models is that

every new generation the number of flops

essentially uh multiplies 10x or at

least that's what they try uh if they

have enough energy and if they can buy

enough

gpus uh great any question on these back

of the envelope math

no

okay so now we talked about pre-training

I wanted to also chat about systems

because now we know computer is really

important so there's a question of how

do you optimize the how do you optimize

your computer I will leave that for the

end because I'm not sure how much time

we will have I think it's important but

hopefully I I'll be able to to talk

about it later it's slightly different

than what we've been talking about right

now so I'll move on to post training for

now

so the task of post training ER the

reason why we need to do Post training

is as I told you before um it's to make

AI assistants so language modeling is

not uh really the thing that you want

when you have an AI assistant uh for

example if you ask to gbd3 which is a

purely language Model A pure language

model not a um not an aligned one if you

ask a question like explain the moon

landing to a

six-year-old the completion that you

would get is something like explain the

theory of gravity to a six-year-old

because what it learned is that on on on

internet if you have one question you

usually have maybe another bullet point

of other similar questions you don't

usually have question and then answer

later uh this is not what you want from

an AI assistant so how do we uh do this

alignment which is this post training

and making these models

assistance um so the goal of this

alignment is to basically get LMS follow

the instructions that are given um by

users and and maybe some designers kind

of desires um so think about moderation

you don't want the model like open ey

definitely doesn't want the model to say

stuff that is very

toxic um so here you see on the left

hand side uh that when you ask a

question it actually provides a a real

answer so it's not like uh before the

llm and on the right hand side you see

that it would if you ask to write a

tweet describing how a certain part of

the population are evil it will say that

it cannot do that um so that's kind of

this

alignment uh the background here is that

uh basically the data that you want for

training some of these models um is like

we know what we want which is just

asking humans this is a question this is

the answer that you want uh but the

thing is that it's very expensive to

collect that data and it's hard to find

it online uh in contrast pre-training

data is not what you want but there's a

lot of it um so what what we will do a

the main idea is simply take a pre-train

large language model pre-train all of

internet and then you just fine tune so

you just change a little bit of weights

on the type of data that you actually

want and hopefully given it you already

pre-train it on all of Internet it

basically learns or knows how to speak

in English and and knows a standard um

language syntax uh then you can really

find tune in with very little

data okay sft so supervis fine tuning is

really exactly what I just said which is

the idea of fine-tuning the large

language model on uh basically the

desired answers that are collected from

humans um so why is it called supervis

fine tuning because you basically want

to do language modeling on the real

ansers so language modeling is this like

next word prediction and and that's the

fine-tuning part and then you want to do

it on desired answers given by humans so

that's why we call it

supervis so how do we collect this data

well we I just said it you just ask

humans uh to to tell you this is the

this is a question this is the answer

that you uh you would want from some of

these models so this is an example um

sorry I can't read very well on my

computer but uh my kid uh needs to do a

science um no let's read this one can

you write a short introduction about the

relevance of the term monopsony and then

it says monopsony refers to a market

structure blah blah blah and that's a

human that wrote that um so actually

this is open Assistant which was a a way

to collect um uh data online by

humans so this type of supervised fine

tuning or alignment is really the key of

Chad GPT this is what made uh the big

jump from gpt3 which was mostly

something that was known by AI

researchers to Chad GPT which became

known by basically

everyone

um so the problem with uh human data is

that it's uh very slow to collect and

very expensive um so

one possible simple idea is to use llms

to scale data collection uh so that's

exactly what we did with alpaca uh one

year ago what we did is that we asked uh

humans or we use a data set of human uh

question answers so there were 175 uh

question answers here and we asked the

best mod at the time so text3 to

basically generate many more of these

question and answers so all we did is

like this is what humans would write now

write similar answers and similar

questions and we collected 52,000 LM

generated question answers and then what

we did is simply we took Lama 7B which

was the best pre-train model at the time

and we just fine- tuned this with

supervised fine tuning as I told you and

that's how we got um the Alpac s7b

model uh and this is the type of data

that we collected so things like what

does algorithm mean an algorithm is a

step by a stepbystep uh set of

instruction used to solve a problem or

achieve a goal blah blah blah blah so

the data is not actually it's actually

pretty good given it was LM generated by

LMS from essentially two generations ago

um so that really started at least for

us kind of as an academic replication of

chat GPT uh now it really there's a big

field of like synthetic data generation

of how to use llms to basically make

development of llms faster um and by

basically by decreasing the amount of of

human hours that you need

quantity of data so we talked about what

type of data and how we collect it um

one thing which is surprising with sft

is that you don't need that much data uh

so what this paper showed this is called

Lima is that if you have if you scale

the amount of data that use from uh

supervised fine training from 2,000 to

32,000 it really doesn't help much so

here scaling laws definitely don't help

um so the the intuition here is that all

you learn um is is you learn how to

format your desired answers another way

of saying it is that your pre-trained

models they essentially model the

distribution of every user on internet

one that might write bullet points

another one that might answer qu answer

question with an answer so all you tell

your model is like wait you should

actually be optimizing more for this

type of user than another one so you're

not actually teaching it and you're not

teaching anything through this um sft uh

so supervis fine tuning all you do is

you tell the model to kind of optimize

for one type of user that it saw already

in a pre-train data set so the knowledge

is already in the pre-train llm uh and

you basically just specialize to one

type of

user great any question on

sft yes so I know it's a big issue with

synthetic data where uh if you keep

generating data from the same

distribution eventually you're not

learning a new distribution you're

essentially playing with it it just

bootstrapping that yeah surely

you can't scale that forever right you

can't keep going on and generating from

the same distribution you hope to learn

something new yeah uh so are there it's

an active area of research but any

thoughts that you have around how people

are maybe thinking around this and uh

better ways to bootstrap or to give up

on this idea and and realize that the

chart shows you don't need that many so

just get humans to generate 2,000 really

good uh yeah so that's a very good

question uh so for the data stuff so I'm

saying it's not that important for sft

but there will be another thing we'll

talk about right after where actually

data does

matter my intuition based on not that

much empirical results is that you can

still get um even though you use your

LMS if you use purely LM generated text

and you do that for like three four

generations of llms I agree with you

that probably you won't improve much but

for me what is important is how do you

use like human in the loop with llms not

purely LMS not purely uh humans but

maybe what you can do is just have the

model generate some new text and just uh

humans write a few Edits edits are much

faster than writing the entire text and

I think that if you have that type of

collaboration then from like kind of an

information theoretical point of view

you still get additional information but

you still much faster than if you use

humans and I think that as a field we'll

probably move towards these type of

things uh which is um really just

finding the examples that are important

and and asking humans it's kind of

active learning just asking humans

exactly when uh you need to to get

inputs yes do we train with like the

same loss function the same like General

training algorithm for the supervis

tuning bit as we do for the for the

pre-training right because like the

examples you showed I think the the

important thing of the good examples is

they're like supera accurate there's

these more complex still just like chain

same so that's why here I yeah I didn't

maybe didn't emphasize enough this is

just language modeling fine tun the LM

with language model on the desired

answers so this is literally the same

loss um it will be different in two

seconds but the first step of sft is

literally the same loss where you just

say Okay I want to actually specialize

on that type of data so there's even a

question of like what is pre-training

what is post-training because in reality

it's just like a different data that you

use the reason why we usually call it

post training is that the way we collect

that data is very

different great great questions uh yes

maybe it's the same question but why

would these 2,000 examples have such an

overweighted

influence you tun so that's why we uh

also that's another reason why we call

it post training is that we use

different type of hyper parameters so

you know I told you basically at the end

of pre training you essentially end up

with a learning rate of zero and here

you're going to increase your learning

rate so like 1 eus 5 one E Yeah and and

so um the weight that you give to them

is actually

different

um okay uh Second Step or second part of

this post training um is what we call

reinforcement learning from Human

feedback or rhf uh some of you might

have heard of that um the idea is that

sft has a problem namely that uh you do

behavioral cloning which means that you

just try to clone what the humans would

say and that had that has many issues

one of them is that you're bound by

human abilities so if um like humans

actually humans won't generate the

things that they think is actually the

best thing to generate so if you ask me

to write a book I mean I can definitely

enjoy a book I can probably say one book

is better than another but I'm

definitely not going to be as good as

writing the book that I want to read uh

so you're going to be bound by the human

ability to generate things even though

the humans might be better at

distinguishing between things that's one

issue issue number two uh I find that

actually pretty interesting is that it

might if you ever heard of the word

hallucination so this is llms generating

F like false information

hallucination might these people have um

hypothesized that that can come from the

supervised fine tuning even if you do

supervised fine tuning on data that is

correct and the reason why that is is

that if uh given I told you that

basically sftt is with very little data

and it's with data that doesn't the

model doesn't learn anything new so what

if the human gives an answer that the

model didn't know was true from the

model perspective you the human

basically is telling the the model uh

generate this thing that seems plausible

but actually have no idea if it's true

or not um so just to give you a very

concrete example if we go back to this

uh monopsony example can you write blah

blah blah about monopsony uh imagine

that a human uh wrote a reference on

this type of book um and that book might

exist that might be a correct reference

but what if the llm never saw this

reference during pre-training then it

doesn't know that it's a correct

reference so really what you tell the

model is to generate or make up some

plausibly sounding reference um rather

than actually tell the real reference

that it saw during pre-training uh so

hallucination might be um uh a re like

might be caused by this sft that's

problem number two does that all make

sense great problem number three price

generating the ideal answers is very

pricey and that comes back to your

question um of like humans writing

answer is actually pretty

expensive um so that's where rhf comes

in the idea is that instead of cloning

the behaviors of humans we're going to

maximize human preference um and the way

we're going to do that so the pipeline

is that for a certain for every

instruction you're going to ask a model

to generate two answers um and usually

use a pretty good model so you usually

don't use an LM here you use a sft uh

fine tune you use a fine tuned llm

already to give like pretty good answers

and then you ask labelers which of these

two answers was better so select the

preferred one and then with different

type of algorithms we're going to talk

about the algorithms um you just

fine-tune the model to generate more of

the green thing than the red thing so

more of the good stuff uh so now the

question is how and we're going to talk

about that right

now so there are two ways that we're

going to talk about and two that are

mainly used in the community um the

first one is simply the idea of of using

reinforcement learning so hopefully you

all know what reinforcement learning is

now um so when you think about using

reinforcement learning one important

question is like what is the reward that

we're optimizing uh so in this case

there are really two options that I

could think about the first one you

could just say I'm going to compare the

output generated by some baseline the

output generated by my model U and I'm

just going to ask the human to say which

one is better and I'm going to use this

as a reward so if I'm better than the

Baseline this is a plus one if not it's

a minus one one uh so now it's binary

reward the problem with binary reward is

that it's very sparse and you don't get

much information out of it uh like maybe

your answer was slightly better maybe it

was like way better and you don't really

know from this um how much better it was

so option two is that you can train what

we call a reward model which is simply a

classifier uh so you use machine

learning to to classify how much better

uh two outputs are from the preference

from the perspective of the human um so

this is a little bit meta but what you

basically do is that you train uh you

take um a reward model R which is a uh

just a large also a large um a large

classifier and you basically ask this

reward model you give it the input and

the actual output that you have one of

the two outputs uh and you just um

exponentiate that so that's the soft Max

law that you all know about and now you

divide by um the the exponential

reward uh on the first example sorry on

the first output and this is on the

second output and you basically train so

the reason why you do that is that you

train your your model you train this

reward model to be able to classify um

how much better one output is to another

one so another uh slightly less

convoluted way of saying it is that your

reward model will output some reward

that will be used as the logits of your

soft Max so now if you have high logic

in your softmax it means that you highly

likely this um output is

better uh so that's what we call Bradley

ter model yes is this reward model going

over the entire output or is it

going um so this takes the

entire uh yeah this takes the entire

output at once so it takes all the input

and all the output and it gives one

number

yes would human be sorry with the reward

model where would a human be like oh I

see okay sorry maybe I wasn't clear um

you train this reward model to fit this

green and and red preference from humans

so basically you train a classifier to

say whether the humans prefer red or

green uh but instead of using the binary

reward which is what the human would

tell you you basically use the logits of

the soft Max and the thing with the

logits is that that logits are

continuous so now you know that if your

reward model said it has high logits

then in some ways the human highly

prefer this answer to some other

answer great um so as I just said

continuous information so it's better so

that's what people uh use in practice or

at least used to use in practice I'll

tell you about uh the other algorithm

later uh so what you do at the end is

that you basically try to just use

reinforcement learning that you know

about now we know we have reward what

you sample through is the generation

from your large language model um and

then you just use some regularization

term so the reason why you do this

regularization term is for avoiding what

we call over optimization so this reward

model might not be really represent like

might not perfectly model human

preferences so you don't want to

maximize this thing to essentially

Infinity um and you do it using uh po

which is a common uh reinforcement

learning algorithm um one thing to note

here because it will be important for

later is that when we use maximum

likelihood

um sorry now the large language models

are actually a policy for your

reinforcement learning it's not

maximizing maximum likelihood anymore

which means that you're not modeling any

distribution anymore and the reason why

this is important is that models that

went through this type of Po actually

don't give you likelihoods of text that

are meaningful cuz what you optimize

them to do is B basically just optimized

for generating the most likely thing not

optimize for modeling like all the

answers that humans might say another

way of saying that is that there's

nothing that incentivizes here the model

to not give a like a um a single

possible generation nothing here says

it's good if you have some distribution

with some

entropy um okay if you haven't followed

it's not that important but just good to

knowe great so PO is exact what chat GPT

did originally so here's the on the blog

post or what they have is step one do

supervise fine training which now you

all know about step two train a reward

model on human preferences step three do

po multiple steps which is where you see

this this blue arrow so you continue you

train the model once with po you collect

new data you continue uh and that's why

and that's exactly what Chad GPT did uh

that was a big breakthrough between gpt3

and Chad GPT

one thing to note is that uh P has many

challenges reinforcement learning is

something that's super nice

theoretically in practice anyone who

ever worked with reinforcement learning

knows it's such a mess uh there's a lot

of things like roll outs out of Loops

clipping so many complications um so

it's messy this is the idealized PO used

for LM settings so that's already much

more complicated than this expectation

we saw before and in practice it's

actually much more complicated so we

have one implementation of it that we

had to do and I'm not going to go

through it but basically you have like

so much stuff that you have to think

about when you implement that type of of

uh po algorithm so you have clipping

everywhere you have a lot of

complexities and things are not well

documented all this to say um that we're

going to there was a new method that was

proposed uh also from Sanford one year

ago called DPO which is essentially a

simplification of Po um and the way uh

what they did or the idea that they have

is that instead of using reinforcement

learning you can just maximize the

probability of generating the stuff that

you like and minimizing the probability

of the stuff that you don't like uh so

if you think about the human preference

the red and green maximize uh green

minimize red um so the loss is actually

this one uh where what you see this is

simply um some log of the model so this

is the likelihood of a model generating

the things that the human preferred

given the the inputs um and what you try

to do is basically

maximize uh the likelihood of generating

the things that you like minimize the

likelihood of the things that you don't

like um all the rest of the terms here

it's not too important it's actually

really not that complicated to

understand but at a high level it's

really just maximizing the things you

like minimizing the the rest um and one

thing to note uh which I was going to

say just here is that actually all the

rest is chosen such that um the global

Minima of of Po and a global Minima of

like this DPO under some assumptions are

essentially equivalent so this is the

right thing to do mathematically I'm not

going to go through the derivations but

that's the right thing to do uh it's

pretty different with Po in the sense

that now and with P what you had to do

is collect the human preferences then

train a uh reward model with maximum

likelihood then use reinforcement

learning now all you do is basically

maximum likelihood much simpler yes I

mean yeah so it seems like this is a

much simpler and B like what you just

intuitively do if this why did they

start with this reward model like what

what led them doing that I think it's a

great question uh I don't really know

what I can tell you is that at open ey

the people who did the um uh who did

basically this PP uh sorry who did Chad

GPT initially are the ones who actually

wrote Po and I think they were just like

there are a lot of reinforcement

learning people and I think that for

them it was very intuitive um so there's

also some additional like potential

benefits for example I don't want to

yeah for example if you use the reward

model uh the cool thing here with

reinforcement learning is that you can

use unlabeled data with the reward model

so here you can only use the label data

for doing DPO um for PP for po you first

train your reward model and then you can

use unlabeled data uh where the reward

model will basically label this

unlabeled data so there there's

additional kind of potential uh

there could be potential improvements in

practice it happens at down and on and I

think just that a lot of people in this

team were reinforcement learning experts

including uh the main author of Po John

hman um so much simpler in poo and is

basically performs as well uh so now

this is the standard uh thing that

people use at least in the open source

Community I believe it's actually the

standard also in in Industry so that's

called DPO gains

um so those are all the papers on the

left here this is on a summarization

task you see all I want to show you is

that basically the pre-train models uh

were okay and they improve with scale if

you do supervised fine tuning you

improve them a little bit more if you do

po or something with all HF with human

feedback you get performance that are as

often times depending on a benchmark

even better than uh humans so this is

the human uh reference summaries same

thing this is on a uh on a paper that we

have Alpaca Farm

where we see uh the evaluation here is

not too important but basically you see

pre-train model you jump to sft and then

you jump to PPO and popo have the exact

same

performance so basically all HF helps

that's kind of the conclusion and DPO is

simple uh data uh the way that you

collect that type of data um first idea

is just use humans as we already talked

about uh guidelines are very complicated

for what humans should be labeling and

and it's really not that easy and

actually if you ever do some of the

labeling you will see that it's

extremely complicated like if I zoom in

to this uh here I have a question tell

tell me about self-driving cars and you

read both self-driving cars are vehicles

that are capable of detecting their

surroundings blah blah blah self-driving

cars are cars that are equipped with

sensors blah blah blah to navigate

without the need for a driver I mean

both seem okay like which one is better

it's actually hard to say at a glance um

and as a result uh the problem with

humans is that you will start optimizing

a lot of like high level features for

example the second one is longer I can

guarantee you that most humans will

choose second one even though I mean

maybe the first one is better I don't

know I haven't read it carefully so

challenges with humans first slow and

expensive uh second as I just mentioned

it's hard to focus on things that matter

like correctness and people uh usually

look at things that don't matter as much

like the form like length uh and as a

result so what I show here is that uh

when you do lhf the more you do of lhf

the longer the output of the of the

models become so if you've ever been

annoyed at chat GPT answering you super

long sentences this is because of all

rhf um annotator distribution shift uh

like the distribution of annotators that

you use matters a lot and you have to

think like what is what is even the

humans that we want to represent in

these models uh now the question is like

crowdsourcing ethics uh like usually

these basically a lot of the the

labeling that is done um like the people

who do them are not paid well and they

have to go through a lot of toxic data

uh because you basically want the model

to avoid saying the toxic data um so

crowdsourcing ethics

too so many challenges with human data

um so what we did also last year is

again the same thing as alpaca just the

idea of like oh well they're challenges

with humans maybe we can just replace

them with llms uh so what we did is

simply replace

um oh I see that I'm just realizing that

the slides are not sented anyways uh you

replace a human preference with LM

preferences uh so here on this uh figure

you see on the xaxis the price that we

paid uh for collecting human data it's

around

$300 for 1,000 examples and this is on

mechanical turkers which are usually

like cheaper than than maybe some of the

other um companies that you could go

through and on the Y AIS it's basically

the agreement with uh other humans with

the mode of other humans and what you

see is that actually as I told you

before labeling is really complicated

humans agree with themselves only around

66% of the time on a binary Tas and it's

not that the humans are not good here

because uh we were five main authors on

this paper we tried to label this data

ourselves and we only had like say 67 or

68% accuracy even though we talk like we

talk for like 3 hours of how we should

be doing labeling really it's

complicated it's not an easy task um and

here I just showed many different models

and um basically you see that models are

much cheaper and they can actually get

higher agreement with the mode of humans

than human humans themselves and the

reason why is because humans have a lot

of varant models have no varant so they

might be a little bit more biased but

have less virence uh so it works

surprisingly well and now it's kind of

the standard in open uh Source Community

I think even in Industry a lot of people

use both humans and llms for improving

uh the colle collection of allf data

um and this is like this is the paper

from last year but honestly now it's

more like that llms would be around this

agreement and this cost so around I

would say 50x cheaper than humans and

better agreement with human than humans

themselves okay so that gets us to

evaluation of post

training um that goes back to your

initial question at the beginning of the

lecture how do you evaluate something

like chpt uh the answers that chpt could

give are basically unbounded and it's

not that there one right answer there

are many answers that are just as good

um so there are many challenges one you

can't use validation loss because one

method might use po the other one might

use DPO validation loss is not

comparable second you can't use Cal uh

sorry perplexity that's the thing I told

you before these models uh are not

calibrated they don't give distributions

they they just optimize for one thing so

you can't use perplexity for actually

evaluating uh these type of models once

they're aligned sorry one Z lined third

uh there's a large diversity of

questions that human might ask to these

models generation open QA like some

question answering some summarization

and all of these things so there's so

many things you have to cover um then

the tasks are really open-ended so it's

very hard to automate so that's what you

were alluding to before so the idea uh

is that instead of trying to come up

with really easily automated uh

benchmarks uh it's just we're going to

ask questions that that users actually

ask to these models in practice and

we're just going to ask annotators to

say between these two models which one

is better like what's the what's the

better output so basically do exact same

thing as um basically the data from rhf

but you use it now for evaluation yes

I'm not sure I understand what you mean

by like can't use perplexity and not

calibrated right like LM is still doing

like next token

prediction so I can't so think about um

the optim solution after doing PO is

basically one model that gives you uh

essentially a Delta um like basically

says that there's only one sentence that

is that could be generated for that

question so now if you use it on

something that is slightly semantically

differently different it would actually

give a likelihood of zero for that

answer so in reality it's not that

extreme because as you say it's still a

distribution but I just shows you that

there's a there's a fundamental issue

with perplexity once these models are

not llms anymore they were not trained

at least with P they were not trained to

to do maximum likelihood anymore they

were trained to be

policies okay um so probably the most

common or like the most um yeah the most

common Benchmark or the most trusted one

is what we call Chad uh sorry chatbot

Arena uh which is basically go on

internet have random users on the

internet blindly talk with two chat Bots

just ask many questions see the two

answers and rate which one is better and

and you do that over hundred of

thousands of users and then you get uh

the actual preferences and you get

rankings of models uh so you can go

right now on chatbot Arena and actually

interact with these models um one

potential issue just to highlight is

that while people who want to do these

type of things are usually more like

Tech driven um or like techsavvy uh so a

lot of the questions that you will ask

are more like Tech stuff discussing

software errors inquiries about AI tools

and all these things um so another issue

is cost and speed if you really want to

use something like this for development

process um it will be too costly because

you would need to basically pay a lot of

humans to do that so one simple idea is

again as we said many times just use LM

instead of humans uh you probably know

the drill at this point uh steps for

every instruction generate outputs by

some baseline and the model that you

want to evaluate um so here you imagine

that I I'm comparing an answer from Chad

GPT and from

I'm just asking a model uh another model

uh which one is better and I just

basically average that out uh yeah I

asked gp4 which one is better I average

that out over my entire distribution

over my entire Benchmark or data set and

that gives me a RN rate so RN

probability for one model compared to

another one and now you can rank models

uh and this is the Alpa eval uh

leaderboard so the benefits of this is

that actually we show we get 98%

correlation with Chad B Arena so very

high correlation with humans um so this

is yeah comparison with correlation with

other benchmarks and it takes less than

three minutes and less than $10 to run

so it's pretty cheap um there are

downsides though uh one of them is purus

correlation um so as we already saw

before LMS prefer this is one SP

correlation not many I'll just talk

about one LMS prefer longer outputs

actually humans also prefer longer

outputs but the problem or the issue

once you use llms is that once there

bias you will continue optimizing that

humans at some point I can guarantee you

if I ask a simple question and you give

me five pages of answers I'll be like no

I don't like that answer but LMS if they

have this bius and they were trained for

that they will continue preferring

longer outputs so uh here we see um the

the preference just showing that like

humans and models prefer longer outputs

um and here is another view of the

initial apaka eval data uh Benchmark

where when we asked um when we we rank

gp4 when we look at the Run rate of gp4

versus actually uh gp4 itself if we com

if we use the standard GPT 4 it gets 50%

kind of by definition because we're

comparing GPT 4 versus gp4 but if we ask

a gbd4 to be slightly more verose so we

just say in the prompt be Vos in your

answers then it gets a r rate of

64.4% so really there's a huge variance

and if we ask it to be concise it gets

20% so there's a huge variance depending

on um whether you ask it to be concise

of

that's very annoying um so one possible

solution which is what we did is uh just

use some regression analysis I'm not

going to go into details but basically

use Cal inference tools to control for

length and right now uh actually length

matters much less so if you ask it to be

veros we still get some gains but much

less great so that's all about post

training and now for the next eight

minutes I might talk about systems or

just answer questions yes can you um go

back to your post training in terms of

post training how did we tune those

parameters using the small body of

fine-tuning data and have such big

effect on the model you mentioned

earlier that there's a different set of

hyperparameters are we changing just

some of the weights the later weights or

all the weights what's actually

happening yeah uh yeah I I kind of

skimmed through all of this you change

all the weights actually um industry

would change all the weights in open

source land you might have heard of

Laura which is going to change basically

only some of the weights or it actually

to be more specific it's going to add

some differences to the output of every

of every layer but but in Industry

you're going to just fine tune all the

weights um and also to say something

else about the data actually the SL St

all HF you usually going to collect uh a

lot more data than with sft so if fft is

like 5,000 10,000 maybe 50,000 with rhf

I think you're going to be more around

like the 1 million

uh order of magnitude it's still much

less than pre-training though yeah

because pre-training is 15 trillion

tokens I mean this is like that's not

even a drop and yet you influence the

weight a lot so because you do it I mean

you have to think that how you do it is

you use um I mean as I said the learning

rate that you're going to use is going

to be different but also you only do

that so just imagine if I train even if

I train on one sentence but over and

over again all at some point my model

will only that sentence even if uh it

was just one sentence instead of the 15

trillion tokens so if you use a large

enough learning rate and for enough time

you will basically overfit that sentence

so the the the key thing to to remember

is that um the data is not I it's not as

if you mix some posttraining data and

some pre-training data you do

pre-training and then you just start

fine-tuning only on the post trining so

another way maybe another perspective is

that the post the pre-training is just

the initialization of your model

and once you view it that way that this

is just initialization of Weights then

there's nothing special like you don't

need to remember that you train a lot of

data before the only thing that matters

is that you had an initialization and

now I actually train a model so maybe

think about it that way like there's a

there's a mark of property in some way

just like you had your weights this is

my initialization now I'm training that

one does that kind of answer your

question kind of but you said something

just now about it's almost the

equivalence of just rerunning the find

tuning data many times is it actually is

that what actually happens in order to

give so much more preference

um you might I actually don't know right

now how they do it in Industry when we

did alpaca we had to do three box so you

did run it three times to it

um but I mean even the number of times

that you run it through it's actually

not important the only thing like the

only thing is the is kind of the

effective learning rate that what

matters

um so

yeah

great so I think I have five minutes

[Music]

right okay I might try to give a high

level Overview at least from one of the

systems trick systems as we said uh for

everyone Bott neck is a sorry compute is

the huge bottleneck uh one question you

might ask is why not buy more gpus uh

gpus are expensive but also are scarce

even if you have $10 million right now

you cannot buy the best gpus um

there's oh yeah there's also some

physical limitations when you have when

you have multiple gpus you have to

communicate between them that takes time

um so just buying more gpus is not that

easy um so it's really important to

think about how do you allocate

resources and how do you optimize your

pipeline so system 101 on gpus I'm sorry

I'm going slightly faster I hope for

that some of you at least can follow uh

gpus are basically optimized for

throughput CPUs are optimized uh for

latency so gpus the way you have to

think about it is that there's one Comm

there's one command that is run on many

many Calles at the same time on

different type of data um so this is how

you see a GPU you see there are many

different CES we call them streaming

multiprocessors which is very different

than the usual CPU architecture so just

think High throughput paralyzation for

gpus uh gpus are optimized for fast

matrix multiplication so every time you

will do uh you will do something on GPU

if you can do it with a a matrix

multiplication it's going to be 10 times

faster than with anything else uh that

is a little bit annoying because it

means that we're kind of uh bottlenecked

to doing anything with Matrix

multiplications um another thing to note

with gpus is that compute has been

improving faster than memory and

communication so right now gpus usually

are hard to keep uh like the data that

you send that send to gpus is actually

hard to keep up with the processess so

most of your gpus are actually going to

be idle if you just run normal code if

you don't optimize your code so

communication and this will continue

over time another thing to know about

gpus is that there's a memory hierarchy

this is the same thing actually with

CPUs but basically the closer you are to

your cuse the less memory there is but

the faster things run if you're further

more memory slower

um okay I'm going to skip that okay

actually I'm going to say it I told you

about this uh the fact of communication

uh the metric that people usually look

at is model flop utilization so what is

the theoretical maximum that GPU could

run at no more flops that you could use

per second divide sorry the number of OB

observed through put divided by this

theoretical um maximum and in general if

you reach 50% you're very happy like

Facebook I looked at Lama was at 45 or

something like this so that that means

that data doesn't come fast enough even

for these big

companies so one simple trick and that

might be the only one I'm going to tell

you about is low Precision one simple

idea is that well if I'm going to put my

floats in lower Precision then there's

going to be fewer bits that I have to

send to my gpus if there's fewer bits

it's faster communication lower memory

consumption things are going to go

faster uh and for deep learning it just

happens that de decimal is not that

important uh so so when you do matrix

multiplication when you do like for

example SGD there's already so much

noise that if you update something by

0.01 or

0.015 who cares uh so basically instead

of using uh 32 bits per float which is

um what people used to use or 64 for

example which is what you would use in

other domains you use 16 bits uh for

matrix multiplication so for every float

you use 16 bits um and for training you

have this type of like uh what we call

aut atic mix Precision which is that uh

some of the things are in 32 bits others

are in 60 bit in 16 bits um generally

the way you should be thinking about it

is that your weights are stored of your

model are stored in 32 bits um but just

before the computation you put

everything in 16 16 bits like this you

do computation super fast and at the end

you update your weights in 32 Bits And

the reason why you do all the updates in

32 bits it's just think that if your

learning rate for example is very small

you still want to be able to like make a

difference in your weights uh so all the

computation is done in 16 bits but the

weights are actually stored in 32 bits

so that's like the standard way that

people are doing it um okay I'll

actually talk just about this and then

I'll skip all the rest operator Fusion

because I think this is actually pretty

cool as I just said communication is

very slow and actually every time you

use a pie torch line it basically moves

variable to Global memory of your GPU so

when you have something like this x do

cosine uh equal X1 and then you do X1 do

cosine what is happening behind the

scenes is that you take the X which is

data you ship it to your um to your

actual processes of your gpus you apply

the coign you ship it back to the main

memory of your GPU and then you see the

next sign you ship it back to the

computer to the GPU processor you apply

another cosign and you ship it back

again um so another way to see that is

that you go from your Dam which is your

Global memory in your GPU and you ship

it to compute you ship it back for every

line This is a naive way of doing it

this seems very wasteful um so the idea

simple idea of operative Fusion is just

communicate do all the computation ship

it back once and this is exactly what

fuse kernels are um so if you ever want

to make your comp your computations in

pytorch much faster just apply torch.

compile on your model this is going to

make your model around two times faster

and what it does is simply that it

rewrites your code uh your P like your

py torch code basically in C++ in Cuda

uh to to do the communication only once

then do all the operations then uh ship

it back okay I'm not going to have time

to talk about tiling tiling is important

paration paration is important um and

mixture of experts mixture of experts is

important Outlook there are many things

we haven't T talked about we haven't

talked about architectures we definitely

haven't talked about inference um there

are many other things that are important

with LMS what is the UI that you use I

mean arguably chat jpt the big novelty

was just have a simple UI to use it

multimodality what are all the misuses

you could have uh the fact that there

might not be enough data on the internet

to train all these models legality of

data collection so many other things if

you are interested in all these topics

uh I would suggest three classes cs224n

is probably the one that touches the

least on uh LMS uh but it gives some

background and historical context um of

all the LMS and gives kind of some

adjacent material CS 324 I think it's

called Uh I think it's just called large

language models uh more in-depth reading

and lectures on everything I talked

about CS 336 which is large language

model from scratch you actually build

your own llm uh it's an amazing class

also given by my two supervisors very

heavy workload so be careful and um

great

Loading...

Loading video analysis...