Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)
By Stanford Online
Summary
## Key takeaways - **LLM Training: Data, Evaluation, and Systems Matter Most**: While academia often focuses on architecture and training algorithms, practical LLM development hinges on data quality, effective evaluation, and robust systems. Industry prioritizes these latter three components for successful model implementation. [01:50], [02:48] - **Tokenization Balances Vocabulary and Sequence Length**: Tokenization is crucial for LLMs, moving beyond simple word-based approaches. Techniques like Byte Pair Encoding create tokens from common sub-sequences, aiming for an average of three to four letters per token to manage sequence length and handle variations like typos. [10:48], [12:26] - **Scaling Laws Predict LLM Performance Improvements**: LLM performance scales predictably with increased compute, data, and model size. These scaling laws, often linear on log-log plots, allow for forecasting future performance and optimizing resource allocation, indicating that architecture nuances are secondary to scale. [41:03], [46:43] - **RLHF Aligns LLMs Beyond Supervised Fine-Tuning**: Reinforcement Learning from Human Feedback (RLHF) refines LLMs by maximizing human preference, addressing limitations of Supervised Fine-Tuning (SFT) like behavioral cloning and potential hallucinations. This process involves training a reward model or directly optimizing for human preferences, with DPO offering a simpler alternative to traditional RL. [01:00:07], [01:13:10] - **GPU Optimization: Mixed Precision and Operator Fusion**: Maximizing GPU utilization requires optimizing for throughput and leveraging matrix multiplication. Techniques like mixed precision (using 16-bit floats) and operator fusion (e.g., via `torch.compile`) reduce communication overhead and memory consumption, significantly speeding up computations. [01:38:00], [01:41:38]
Topics Covered
- Data, evaluation, and systems matter more than model architecture.
- The complex pipeline for cleaning internet data for training.
- Scaling laws allow us to predict future AI performance.
- The surprising $75 million cost of training a model.
- Why pure language models are not useful AI assistants.
Full Transcript
so let's get started uh so I'll be
talking about building llms today um so
I think a lot of you have heard of llms
before uh but just as a quick recap uh
llms standing for large language models
are basically all the chat Bots uh that
you've been hearing about recently so uh
Chad GPT from open ey Claud from
entropic Gemini and and lman other type
of models like this and today we'll be
talking about how do they actually work
so it's going to be an overview because
it's only one lecture and it's hard to
compress everything but hopefully I'll
touch a little bit about all the
components that are needed to train uh
some of these llms uh also if you have
questions please interrupt me and ask uh
if you have a question most likely other
people in the room or on Zoom have other
have the same question so please ask um
great so what matters when training llms
um so there a few key components that
matter uh one is the architecture so as
you probably all know LMS are newal
networks and when you think about new
networks you have to think about what
architecture you're using and another
component which is really important uh
is the training loss and the training
algorithm um so how you actually train
these models then it's data so uh what
do you train these models on um the
evaluation which is how do you know
whether you're actually making progress
towards the goal of of uh llms and then
the system component so that is like how
do you actually make these models run on
uh Modern Hardware which is really
important because these models are
really large um so now more than ever
system is actually really an important
topic um for
llms so those five components um You
probably all know that llms and if you
don't know LMS are all based on
Transformers or at least some version of
Transformers uh I'm actually not going
to talk about the AR lecture today uh
one because I gave a SE lecture on um
Transformers a few weeks ago and two
because you can find so much information
online on uh Transformers but I think
you can it's there's much less
information about the other four topics
so I really want to talk about those um
another thing to say is that most of
Academia actually focuses on
architecture and training algorithm and
losses um as academics and I've done
that for a lot big part of my career is
simply we like thinking that this is uh
like we make new architectures new
models and it it seems like it's very
important but in reality honestly what
matters in practice is mostly the three
other topics so data evaluation and
systems uh which is what of most of
Industry actually focuses on um so
that's also one of the reason why I
don't want to talk too much about the
architecture uh because really the rest
is super
important um great so overview of the
lecture I'll be talking about
pre-training so pre-training uh you
probably heard that word this is the
general word this is kind of the
classical language modeling uh Paradigm
uh where you basically train your
language model to essentially model all
of internet and then there's a post
training which is a more recent Paradigm
which is taking these large language
models and making them essentially AI
assistants um so this is more of a
recent Trend since Chad GPT uh so if you
ever heard of gpt3 or gpt2 that's really
pre-training land uh if you heard of
chat GPT which you probably have this is
really posttraining land uh so I'll be
talking about both but I'll start with
pre-training and uh specifically I'll
talk about what is the task of
pre-training llms and what is the laws
that people actually
use so language modeling this is a quick
recap uh language models at a high level
are simply models of probability
distribution over sequences of tokens or
of words so it's basically some uh model
of P of X1 to XL where X1 is basically
word one and Excel is the last one in
the sequence or in the sentence um so
very concretely if you have a sentence
like the mouse ate the cheese what the
language model gives you is simply a
probability of this sentence being
uttered by a human or being found on on
online uh so if you have another
sentence like the the mouse at cheese uh
here there's grammatical mistakes so the
model should know that this uh should
have some syntactic knowledge so it
should know that this has less
likelihood of appearing
online uh if you have another sentence
like the cheese ate the mouse uh then
the model should hopefully know about
the fact that usually cheese don't eat
Mouse um so there's some semantic
knowledge and this is less likely than
the first sentence so this is basically
at a high level what language models are
um one word that you probably have been
hearing a lot in the news are generative
models uh so this is just something that
can generate models that can generate
sentences or can generate some data uh
the reason why we say language models
are generative models is that once you
have a model of a distribution you can
simply sample from this model and now we
can generate data uh so you can generate
sentences uh using a language
model so the type of models that uh
people are all currently using are what
we call Auto regressive language models
and the key idea of autor regressive
language models is that you take this
distribution over words and you
basically decompose it into the into the
distribution of the first word multiply
the by the distribution of or the
likelihood of the distribution of the
second word given the first word uh
multiply by P of the third word given
the first two words um so there's no
approximation here this is just the
chain rule of probability which you
hopefully all know about uh really no
approximation this is just one way of
modeling a
distribution uh so slightly more
concisely you can write it as a product
of U of PS of the next word given
everything which happened in the past so
of the context and uh so this this is
what we call Auto regressive language
models again this is really not the only
way of modeling distribution this is
just one way uh it has some benefits and
some downsides one downside of
autoaggressive language models is that
when you actually sample from this
autoaggressive language model you
basically have a for Loop which
generates the next word then conditions
on that next word and then regenerate an
other word so basically if you have a
longer sentence that you want to
generate you it takes more time to
generate it uh so there are some
downsides of this current Paradigm but
that's what we currently have so I'm
going to talk about this
one uh great so Auto regressive language
models at a high level um what the task
of autoregressive language model is is
simply predicting the next word as I
just said so if you have a sentence like
she likely prefers uh one potential next
word might be dogs and the the way we do
it is that we first tokenize so you take
these words or subwords you tokenize
them um and then you give an IDE for
each token so here you have 1 2 three uh
then you pass it through this black box
as I already said we're not going to
talk about the architecture you just
pass it pass it through a model and you
then get a distribution a probability
distribution over the next word over the
next token and then you sample uh from
this distribution you get a new token
and then you DET tokenize so you get a
new ID you then DET toonize and that's
how you basically sample from a language
model uh one thing which is important to
not is that the last two TS uh two steps
are actually only need needed during
inference uh when you do training you
just need to predict uh the most likely
token and you can just compare to the
real token which happen in practice and
then you basically change the weights of
your model to increase the probability
of generating that
token um great so autoaggressive neural
language models so to be slightly more
specific still without talking about the
architecture uh the first thing we do is
that we have all of these oh sorry yes
on the previous slide when you're
predicting the probability of the next
tokens does this mean that your final
like output VOR has to be the same
dimensionality as the number of tokens
that you have yes how do you deal with
like if you have more to like if you're
adding more tokens to your cor something
yeah so we're going to talk about
tokenization actually later uh so you
will get some sense of this you
basically can deal with adding new
tokens I am I'm kind of exaggerating
there are methods for doing it but
essentially people don't do it um so
it's really important to think about how
you tokenize your text and that's why
we'll talk about that later but it's a
very good point to notice that you
basically the vocabulary size so the
number of tokens that you have is
essentially the output of your uh
language model so it's actually pretty
pretty
large okay so autoaggressive new
language models first thing you do is
that you take every word or every token
you embed them so you get a um some
Vector representation for each of these
tokens um you pass them through some ual
Network as we said it's a Transformer
then you get a representation for all
the word in all the words in the context
so it's basically representation of the
entire sentence uh you pass it through a
linear layer as you just said to
basically map it to the number so that
the output the number of outputs is the
number of tokens uh you then pass it
through some soft Max and you basically
get uh probity distribution over the
next words given every word in the
context
and the law that you use is basically
it's essentially a task of classifying
the next token so it's a very simple
kind of machine learning task so you use
the cross entry P loss where you
basically you look at the actual Target
that happened which is a target
distribution which is a one hot encoding
which here in this in this case says I
saw uh the real word that happened is
cat so that's a one hot um distribution
over cat and here this is the actual uh
do you see my mouse oh yeah this is the
distribtion that you generated and
basically you do cross entropy which
really just increases the probability of
generating cat and decreases all the the
probility of generating all the other
tokens one thing to notice is that as
you all know again uh this is just
equivalent to maximizing the text log
like the text log likelihood because you
can just rewrite the the max over the
probability of um this autoregressive
language moding task as just being this
minimum over I just added the log here
and minus which is just the minimum of
the loss which is the cross enty loss so
basically minimizing the loss is the
same thing as maximizing the likelihood
of your text any question
questions
okay
tokenizer um so this is one thing that
people usually don't talk that much
about tokenizers are extremely important
uh so it's really important that you
kind of understand at least uh what they
do at a high level so why do we need
token in the first place uh first it's
more General than words so one simple
thing that you might think is oh we're
just going to take every word that we
will have you just say every word is a
new is a token in its own um but then
what happens is if there's a typo in
your word then you might not have any
token associated with this this word
with a typo and then you don't know how
to actually pass this word with a typo
into a large language model so what do
you do next and also even if you think
about words words is a very like words
are fine with like Latin based languages
uh but if you think about a language
like taii you won't have a simple way of
tokenizing by spaces because there are
no spaces between words um so really uh
tokens are much more General Than Words
first thing second thing that you might
think is that you might tokenize every
sentence character by character you
might say a is one token b is another
token uh that would actually work and
probably very well the issue is that
then your sequence becomes super long
and as you probably remember from the
lecture on on Transformers uh the
complexity uh grows quadratically with
the length of sequences so you really
don't want to have a super long sequence
um so tokenizers basically try to deal
with those two problems and give common
subsequences a certain token and usually
how you should be think about is around
uh an average every token is around
three four letters
um and there are many algorithm for
tokenization I'll just talk about one of
them to give you a high level which is
what we call bite P en coding which is
actually pretty common one of the two
most common tokenizers and the way that
you train a tokenizer is that first you
start with a very large Corpus of text
and here I'm really not talking about
training a large language model yet this
is purely for the tokenization step uh
so this is my large Corpus of text with
these five words um then you associate
every character in this Corpus of text a
different token uh so here I just split
up every character with a different
token uh and I just color coded all of
those tokens and then what you do is
that you go through your text and every
time you see pairs of tokens that are
very common the most common pair of
token you just merge them so here you
see three times the the the tokens T and
O next to each other so you're just
going to say this is a new token and
then you continue you repeat that so now
you have to talk which happens three
times to with an E that happens sorry
two times and an token which happens
twice and then ex which also happen
twice so this is that if you were to
train a tokenizer on this Corpus of text
which is very small that's how you would
uh finish with a token with a pre like a
trained tokenizer uh in reality you do
it on on much larger corpuses of text um
and this is the real tokenizer of uh
actually I think this is gpt3 or chat
GPT uh and here you see how it would
actually separate these words so
basically you see the same thing as what
we gave in the previous example token
becomes its own token so tokenizer is
actually split up into two tokens token
and iser um so yeah that's all about
tokenizers any questions on that yeah
how do you deal with spes and how do you
deal
with yeah so actually there's a a step
before tokenizers which is what we call
pre- tokenizers which is exactly what
you just said uh so this is mostly
in theory there's no reason to deal with
spaces and punctuation separately you
could just say every space gets its own
token every um uh punctuation get its
own token and you can just do all the
merging the problem is that so there's
an efficiency question actually training
these tokenizes takes a long time uh so
you better off because you have to
consider every pair of token so what you
end up doing is saying if there's a
space this is very like pre- tokenizes
are very English specific you say if
there's a space we're not going to start
looking at the the token that came
before and the token that came
afterwards so you're not merging in
between spaces but this is just like a
optimiz like a computation optimization
you could theoretically just deal with
it um the same way as you deal with any
other character and yeah when you merge
tokens do you delete the tokens that you
merged away or do you keep the the
smaller tokens that merge um you
actually keep the smaller tokens I mean
in reality it doesn't matter much
because um usually on large Corpus of
text you will have actually everything
uh but you usually keep the small ones
and the reason why you want to do that
is because if in case there's as we said
before you have some um some grammatical
mistakes so some typos you still want to
be able to represent these words by
character um so yeah yes are the tokens
unique so I mean say in this case T Ken
is there only one occurrence or could do
you need to leave multiple occurr so
they could have take on different
meanings or something oh oh I see what
you say no no it's every token has its
own uh unique ID um so a usual this is a
great question for example if you think
about a bank which could be bank for
like money or bank like water um it will
have the same token but the model will
learn the Transformer will learn that
based on the words that are around it it
should associate that I'm saying I'm
being very high wavy here but associate
that with the with a with a
representation that is either more like
the bank money side or the Bank water
side um but that's a Transformer that
does that it's not a
tokenizer yes yeah so you mentioned
during tokenization keep the smaller
tokens you started with right like if
you start with a t you keep the T and
then you build your tokenizer to the
that you can now in token so let's say
maybe you didn't train on token but like
in your data you are trying to encode
token so how does the tokenizer know to
encode it with token or
a great question you basically when you
so when you tokenize so that's after
training of the tokenizer when you
actually apply the tokenizer you
basically always choose the largest uh
token that you can apply uh so if you
can do token you will never do T you
will always do token um but there's
actually so people don't usually talk
that much about tokenizers but uh
there's a lot of of computational
benefits uh or computational tricks that
you can do for making these things
faster uh so I really don't think we and
honestly I think a lot of people think
that we should just get away from
tokenizers um and just kind of tokenize
character by character or bites by bites
uh but as I said right now there's this
issue of like length uh but maybe one
day like in five or 10 years we will
have different architectures that don't
scale quadratically with the length of
the sequence and uh maybe we'll um yeah
move away from tokenizes so can you
share with us the drawback why do people
want to move away from the tokenizer oh
um yeah so think
one good example is uh math if you think
about math actually numbers right now
are not tokenized so for example 327
might have its own token which means
that models when they see numbers they
don't see them the same way as we do and
this is very annoying because what I
mean the reason why we can kind of
generalize with math is because we can
deal with every every letter separately
and we can then do composition where you
know that basically if you add stuff
it's just the same thing as adding every
one separately plus like whatever the
unit that you add so they can do that um
so then you have to do like special
tokenization and like one of the big
changes that GPT 4 did uh is changing
the way that they tokenize uh code so
for example uh if you have code you know
you have like often in Python these four
spaces at the beginning those were dealt
with uh kind of strangely before um and
as a result like the model couldn't
really understand uh how to deal with
code uh so so toiz actually a lot um
okay so I'll move on right now but we
can come back later on token Isis great
so we talked about the task the L the
tokenizer let's talk a little bit about
evaluation uh so the way that LMS are
usually evaluated is what we call is
using what we call perplexity um at a
high level it's basically just your
validation loss uh the slight difference
with perplexity is that we use something
that is slightly more interpretable
which is that we use the average per
token loss and then you expon entiate it
and the reason why you exponentiate it
is because you want I mean the loss has
a log inside and you like one humans are
actually pretty bad at thinking in log
space but two logs depend on the base of
the log uh while when you exponentiate
you basically have everything in the uh
kind of the vocabulary size uh unit um
and the average proten is just so that
your your complexity is independent of
the length of your sequence um so
perplexity is just two to the power uh
average of the loss of the sequence
um so perplexity is between one and the
length of the vocabulary of your
tokenizer uh one it's simply well if you
predict perfectly the thing which uh
every word then every word will have
basically product of ones uh so the best
perplexity you can have is one if you
really have no idea you basically
predict with one divided by uh size of
vocabulary um and then you do simple
math and you basically get perplexity of
size of vocabulary uh so the intuition
of perplexity is that basically the
number of tokens that your model is kind
of hesitating between uh so if you if
your model is perfect it doesn't
hesitate it know exactly the word if it
really has no idea then it hesitates
between uh all of the
vocabulary uh so perplexity really
improved that's perplexity on a standard
data set between 2017 and 2023 it it
went from kind of 70 tokens to less than
10 tokens over these five six years so
that means that the models were
previously as dating between 70 words
every time it was generating a word and
now it's as dating between like less
than 10 words so that's much better
perplexity is actually not used anymore
in academic benchmarking mostly because
it depends on the tokenizers that you
use uh it depends on the actual data
that people are evaluating on but it's
still very important for development of
llms so when you when you actually train
your own llm people will still really
look at the
perplexity uh one common other way and
now more common in Academia of
evaluating these llms is just by taking
all the classical NLP benchmarks and
I'll give you a few examples later and
just kind of aggregating everything um
so collect as many automatically
evaluatable benchmarks and just evaluate
across all of them um so one such if uh
or actually two such uh benchmarks of
what we call uh Helm which is from
Stanford and another one is the hugging
face open LM leader board which are the
probably two two most common ones right
now um so just to give you an idea in
Helm there are all of these type of
tasks which are mostly things that can
be easily evaluated uh like question
answering so think about many different
question answering uh tasks um and the
benefit with question answering is that
you usually know what is the real answer
um so you can the way that you evaluate
these models and I'll give you a
concrete example in one second um is
that you can just look at How likely the
language model is to generate the real
answer compared to some other answers
and that's essentially at a high level
how you evaluate these models um so to
give you a specific example mlu is
probably the most common um academic
Benchmark for
llms uh and this is just a collection of
many question and answers in all of
those domains for example College
medicine College physics astronomy and
these type of topics and the questions
are things like so this in astronomy
what is true for type 1 a supernova then
you give uh four different potential
answers and you just ask the model which
one is more likely so there are many
different ways of doing it either you
can look at the likelihood of generating
all these answers uh or you can ask the
model which one is the most likely uh so
there are different ways that you can
promp the model but at a high level you
know which one is correct and there are
three other mistakes um yes kind
creating is like unconstrained text as
the output yeah how do you evaluate a
model if it give something that's you
know semantically completely identical
but is not the exact token list that
expect yeah so that's a great question
I'll talk more about that later here in
this case we don't do unconstrained so
the way you would evaluate MML is
basically either you you ask the first
question and then you look at the
likelihood of the model generating a the
likelihood of the model generating b c
and d and you look at which one is the
most likely or you can as the model out
of ABC d which one is the most likely
and you look at whe the to the most
likely next token is A B C or D so uh
you can strain the model to say it can
only answer these four things you say
you constraint the model you mean you
constraint The Prompt or do you mean of
its whole probability distribution
outputs you only comparing the outputs
like you're only comparing the
a so uh in the second case I gave you
you would do exactly the I actually you
would do both you would prompt the model
saying ABC or D plus you would constrain
to only uh look at these two these four
tokens in the first case you don't even
need to generate anything so in the
first case you literally just look given
that it's a language model it can give a
distribution over sentences you just
look at what is the likelihood of
generating all of these words what is
the likelihood of generating the second
choice and you just look at whether the
most likely sentence is actually the
real answer so you don't actually sample
from it you really just use P of x one
to excel does that make sense uh that
being said evaluation of open-ended
questions is something we're going to
talk about later and is actually really
important and really challenging yes
earlier you mentioned that um like um
metrics like flexity are not are not
like usually used because it depends on
like how you do your terization some
design choices I was wondering if you
could speak more to that oh um yeah so
think about perplexity I told you
perplexity is between one and vocabulary
size so now imagine that Chad GPT uses a
tokenizer that has like 10,000 tokens
but Gemini from Google uses a tokenizer
that had 100,000 uh potential tokens
then actually the Gemini one will will
have like the upper bound of the the
perplexity that you can get is actually
worse for Gemini than for Chad GPT does
that make sense so that's just an idea
it's actually a little bit more
complicated than that but that's just
like one uh first or the bit of you can
see that the tokenizer actually
matters um
great okay so evaluation challenges
there are many I'll just talk about two
really briefly uh one as I told you
there are two ways of doing evaluation
for these mlu actually there are many
more than two but I give you two
examples um and it happens that for a
long time even though that was a very
classical Benchmark that everyone used
uh actually different uh different
companies and different um different uh
uh different organization were actually
using different ways of evaluating mlu
and as a result you could you get
completely different results for example
Lama
65b uh which was the first model of meta
in the Lama series uh had on Helm 63.7
accuracy but on this other um Benchmark
had like
48.8 um so really the way that you
evaluate and this is not even talking
about prompting this is really just kind
of the the way that you evaluate the uh
the models prompting is another issue so
really there are a lot of
inconsistencies it's not as easy as it
looks uh first thing yeah sorry how can
we make sure that all these models AR
trained on The Benchmark okay second
thing this is a great question uh chain
test contamination uh this is something
which I would say is really important in
Academia in uh given that the talk is
mostly about training large language
models uh for companies it's maybe not
that important CU they know what they
trained on uh for us we have no idea so
for us it's a real problem uh so there
are many different ways of trying to
test whether uh the test set sorry
whether the test set was actually in the
training Set uh one kind of cute trick
um that people uh in in the lab on T lab
have found is that what you can do is
that given that most of the data set
online are not randomized
you can just look at and in that
language models what they do is just
predict the next word um you can just
look at the entire test Set uh what if
you generate all the examples in order
versus all the examples in a different
order and if it's more likely to
generate a thing in order given that
there's no real order there then it
means that probably was in a training
set does that make sense um so there are
many that's like one of them there are
many other ways of doing it train test
contamination again not that important
for development really important for
academic
benchmarking great so there are many
other challenges but uh I'll move on for
now great data um so data is another
really big topic um at a high level
people just say oh you basically train
large language models on all of Internet
what does that even mean um so or people
sometimes say all of clean internet
which is even less defined um so
internet is very dirty and really not
representative of what we want in
practice if I download a random website
right now you would be shocked at what
is in there it's definitely not your
Wikipedia um so I'll go really briefly
on like what people do um I can answer
some questions but I mean data is on its
own is a huge topic uh basically first
what you do is download all of Internet
what that means is that you use uh web
crowlers that will go on every web page
on Internet or every web page that is um
on Google uh and that is around 250
billion pages right now um and that's
around one petabyte of of data so this
is actually a common common C is one web
crowler so people will usually write
their own web crowlers what they do is
that they use standard web crowlers and
we common crawl is one of them uh that
basically every month adds all the new
websites that were added on uh internet
that are found by by Google and they put
it in a big uh basically a big data set
um so that's on common call you have
around 250 billion pages right now so 1
E6 gigabytes of data once you have this
uh so this is a random web page like
literally random uh from this common
craw and what you see is that one it
really doesn't look at type of things
that you would usually see but actually
so this is an HTML page uh it's hard to
see but if you look through you will see
some content for example here here uh
tesing world is your ultimate source for
the system X high performance server and
then you have three dots so you don't
even the sentence is not even finished
that's how a random internet looks like
uh so of course it's not that useful if
you just train a like large language
model to generate things like this so
what are some of the steps that are
needed first one you extract the text
from the HTML so that's what I just try
to do by looking at uh basically the
correct text uh there are a lot of
challenges by through this for example
extracting math is actually very
complicated but pretty important for
training large language models um or for
example boiler plates a lot of your
forums will have the same type of
headers the same type of Footers uh you
don't want to repeat all of this in your
data um then you will filter undesirable
content uh so not safe for work harmful
content pii uh so usually every company
has basically a a black list of websites
that they don't want to train the models
on that Black List is very long and you
basically say if it comes from there we
don't train on this there are other ways
of doing these things is that you can
train a small model for classifying what
is pii removing these things um it's
hard every Point here that I'm going to
show you is like a hard amount of work
uh but I'm going to go go quickly
through it so filter undesirable content
second or fourth is the dup D
duplication as I said um you might have
things like headers and Footers in
forums that are always the same you want
to remove that another thing that you
might have is a lot of URLs that are
different but actually show the same
website um and you might also have a lot
of like U um paragraphs that come from
like common books that are basically
duplicated a thousand times or 10,000
times on internet so you have to
duplicate also very challenging uh
because you have to do that at scale
once you do duplication you will do some
heuristic filtering you will try to
remove low quality documents uh the way
you do that are things like rules-based
um filtering for example if you see that
there are some outlier tokens if the
distribution of tokens in the website is
very different than the usual
distribution of tokens then it's
probably some outlier if you see that
the length of the words in this website
is super long there's something strange
going on on that website if you see that
the the website has only three words
maybe is it worth training on it maybe
not if it has like 10 million words
maybe there's something also
wrong going on that page um so a lot of
rules like this yes why we filter out
undesirable content from our dat set
instead of kind
of putting it in is like a supervised
loss right like can we not just say like
you know here's this like hate speech
website let's actively try to Let's
actively penalize the for generating
we'll do exactly that but not at this
step that's where the posttraining will
come from uh pre-training um the idea is
just to say I want to model kind of how
humans speak essentially um and I want
to remove all these like headers photos
and and menus and things like this but
it's a very good uh like idea that you
just had and that's exactly what we'll
do
later Next Step modelbased filtering so
once you filtered a lot of data what you
will do uh that's actually a very cute
trick uh you will take all of Wikipedia
and you will look at all the links that
are linked through Wikipedia p
because probably if something is
referenced by Wikipedia it's probably
some high quality website and you will
train a classifier to predict whether
something comes from whether a document
comes from one of these references uh
from Wikipedia or whether it's from the
random web and you will try to basically
say I want more of the things that come
from Wikipedia references does that make
sense so yeah so you will train a a
machine learning uh model usually also
very simp simple models because you need
to do that really at scale I mean just
think about the 250 billion
Pages uh next one you will try to
classify your data into different
different um domains you will say okay
this is entertainment this is books this
is code this is like these type of
domains and then you will try to either
um up or down weight some of the domains
uh for example you might say uh you
might see that actually if you train
more on code then actually your model
becomes bettered on reasoning so that's
something that people usually say in a
very handwavy way if you train your
model more code actually it helps
reasoning so you want to upweight the
coding uh distribution because that
helps for General language modeling
skills uh books is usually also another
one that people usually um upweight
entertainment they usually downweight uh
so things like this of course you want
to do it so people used to do it maybe
uh kind of theistically now there's
entire pipelines that we'll talk about
of how to do these things uh slightly
more um
automatically and then at the end of
training uh usually train um after
training on all of this data that we saw
usually train on very high quality data
at the end of of training your large
language model where you decrease your
learning rate uh and that basically
means that you're kind of overfitting
your model on a very high quality data
so usually what you do there is like
Wikipedia you basically overfit on
Wikipedia yeah and you overfit on like
human uh data that was collected um the
other things like continual pre-training
for getting longer context I'm I'm going
to skip over all of these things uh but
I just to give you a sense of how hard
it is when people just say oh I'm going
to train on internet that's a lot of
work um and really we haven't figured it
out yet so collecting World data is a
huge part of practical large language
model uh some might say it's actually
the key yes
about data so basic question so usually
when you start with like the terabyte of
data after I go through all that steps
the typical amount of data you have in
and then like how how large a team does
it typically think to go through all the
steps you talk about so how is the
question how large is the data after you
filter yeah after you filter and then to
go through all the step how large a team
do you need to go through like the the
other fation sttion uh how slow is it or
how like how how many people would you
need to be able to do this uh okay
that's a great question I'm going to
somewhat answer about the data uh how
large is the data set uh at the end of
this slide uh for number of people that
work on
it um that's a good question I'm
actually not quite sure but I would
say yeah I actually don't quite no but I
would say it's probably even bigger than
the number of people that work on kind
of the two tuning of the pre-training of
the model uh so the data is bigger than
kind of the modeling aspect um yeah I I
don't think I have a good sense I would
say probably in Lama's team which have
like 70 years people I would say maybe
15 work on data uh I yeah all these
things you don't need that many people
you need a lot of computer so because
for data you need a lot of CPUs um so
yeah and I'll answer the second question
at the end of this slide so as I just
kind of alluded to really we haven't
solved data at all for pre-training so
there's a lot of research that that has
to be done first how do you process
these things super efficiently uh second
how do you balance kind of like all of
these different domains uh can you do
synthetic data generation that's
actually a big one right now uh and
because we don't have uh we'll talk
about that later we don't have enough
data on the internet um can you use
multimodal data instead of just text
data and how does that improve even your
text performance um
there's a lot of seccy because really
this is the key of most of the pre-train
pre-trained large language models so for
competitive Dynamics uh usually these
these um these companies don't talk
about how they do the data collection
and also there's a copyright liability
issue they definitely don't want to tell
you that they've trained on books even
though they did um because if not you
can uh sue them uh common academic
benchmarks uh so that will kind of
answer what you asked um it started so
those are the smaller ones it's the
names are not that important but it
started from around 150 billion tokens
which around uh 800 GB of data now it's
around 15 trillion of to 15 trillion
tokens which is also uh the size of the
models that are right now the best
models are probably trained on that
amount of data so 15 trillion tokens uh
which is probably I guess two order of
manage bigger than that so 80 uh E3 gab
so that would be
around 100 to thousand times uh
filtering of the common crawl if I'm not
mistaken um so yeah one very one very uh
famous one is the pile so this is
academic Benchmark of the pile and we
can just look at what distribution of
data they have it's things like um
archive PBM Central uh which is all the
the biology stuff uh here it's Wikipedia
you see stack exchange um some GitHub
and some books and things like this um
again this is on the smaller side so
this is if we look at here this is on
280b so in reality it's like 100 times
bigger so you cannot have that much of
GitHub and and of
Wikipedia um in terms of close Source
models just to give you an idea uh Lama
2 um it was trained on 20 two trillion
tokens lamb 3 15 trillion tokens which
is currently the best model that we know
on how much it was trained on which is
the same thing as this the the the best
academic or the biggest academic
Benchmark which is 15 trillion tokens
GPD 4 we don't really know but it's
probably in the same water of magnitude
or it's probably around that actually
it's probably around 13 um from leaks if
the leaks are true
um great so scaling laws um any other
questions on Data before you go to
scaling
laws sorry I know I'm giving you a lot
of information but uh there's a lot into
training at large language models great
scaling laws so so the idea is that what
people saw um around 2020 or at least
from a long time but they've been able
to kind of theoretically show it or
impurely show it since 2020 is that the
more data you train your models on and
the larger the models the better the
performance this is actually pretty
different than what you've seen in this
class in this class we teach you about
overfitting overfitting doesn't happen
with large language models uh larger
models better performance um it's
something that really took a long time
for the community who took this type of
class to realize um but for the exam
overfitting
exists so okay the idea of scaling laws
is that if given that you know that more
data and larger models will always give
you better performance can we predict
how much better your performance will be
if you increase the amount of data and
the size of your model and surprisingly
it works uh so here you see three plots
from a very famous paper called scaling
loss from openi um here you see on the
x-axis compute so how much did you train
like how much compute did you did you
spend for training and here you see test
loss so this is essentially I mean it's
not perplexity but it's your validation
loss um so it's a log of the perplexity
and if you put these two on uh log scale
uh then you see that uh the the
performance or like the this the sorry
the the scaling law is linear uh that
means that if you increase your compute
by a certain amount you can you can say
by how much your test loss will actually
decrease same thing with data and same
thing for parameters if you increase the
data set size your loss will will
decrease by an amount that is somewhat
predictable if you increase the number
of parameters it will decre the loss
will decrease by amount which is
somewhat predictable this is really
amazing um very surprising I mean it
looks in nocuous when you look at these
type of plots but that's crazy because
it means that you can predict uh how
well we're going to perform in 2 3 years
depending on how much compute we will
add assuming that these things will hold
there's nothing theoretical about it um
yes two things one what is the loss that
they're using here is this perplexity or
so it's it's you know I said perplexity
was like two to the power of the LW so
this is the the the power of the
perplexity and then the second thing is
when you like increase the number of
parameters or you increase the total
data set size going dat times doesn't
that just inherently increase your
compute like do all this work to
just specific no this is a great
question so the compute here is actually
a factor of two things the data and the
parameter what I'm showing here is that
you can um well actually we're going to
talk about that in details but basically
if you increase the number of parameters
you should increase the number of data
that you have um so you actually don't
go multiple times through the same data
set no one does EPO in a lar at least
not yet uh because we have still kind of
enough data um so yeah this is all the
same Trend which is increase compute
decrease
loss yes have we seen the numbers for
the last two years or is it still
holding it is still holding I I don't
have like good numbers to show you uh
but it is still holding
surprisingly yes is there no evidence
like empirical evidence that you
plateau expected PL
no empirical evidence of plateauing
anytime soon um why we don't know um
will it happen probably I mean it
doesn't need to because it's actually in
log scale so it's not like as if it had
to go it had to Plateau like
mathematically it could continue
decreasing like this I mean most people
think that it will probably Plateau at
some point we don't know
when um okay so that's I'll talk more
about scaling laws now
so why are scaling laws really cool
imagine that I give you um you're very
fortunate I gave you 10,000 gpus for
this month what model will you train how
do you even go about answering that
question and I mean this is a a
hypothetical but that's exactly what
these companies are faced with uh the
old pipeline um which was basically you
tune High parameters on the big models
so let's say I have 30 days I will train
30 models for one day each I will pick
the best one uh and that will be the
final model that I will use in
production um that means that the model
that I actually used was only trained
for one day the new pipeline is that you
first find a scaling recipe so you find
something that tells you for example oh
like one common thing is that if you
increase the size of your model you
should decrease your learning rate so
you find a scaling recipe such that you
know if I increase the the the the size
of my model here's what I should do with
some high parameters then you tune your
high parameter
on smaller models of different sizes
let's say I will say for 3 Days of my 30
days I will train many different models
and I would do highper parameter tuning
on these small models each of different
sizes then I will fit a scaling law and
try to extrapolate from these smaller
models which one will be the best if I
if I train it for much longer or sorry
if I train it for a larger model and
then I will train the final huge model
for 27 days instead of just one day
um so the new pipeline is not train
things or do high prity tuning on the
real scale of the model that you're
going to use in practice but do things
on smaller ones at different scales try
to predict how well they will perform
once you make them bigger I will give I
will give you a very concrete example
right now uh let's say Transformers
versus lstms let's say you you have
these 10,000 gpus you will not sure
which one you should be using should I
be using Transformer based model or LCM
based model what I will do is I will
train Transformers at different skills
so here you see different parameters on
the x-axis Y axis is my test loss I will
then train different different lstms at
different scales once I have these
points I will see oh it kind of fits a
scaling law I will fit my scaling law
and then I will be able to predict oh if
I had 10 times more compute here's how
well I would perform for the LM it's
actually slightly less linear for the
lstm but like you could probably try to
predict where you would end up and
clearly from this plot you would see
that Transformers are better um one
thing to notice when you read these type
of scaling laws is that are two things
that are important uh one is really your
scaling rate uh which is kind of the uh
the slope of the the slope of the
scaling law the other thing is your um
your intercept like you could start
worse but actually become better over
time it just happens that lstms are
worse for both uh but I could show you
another one where things you can predict
that actually after a certain scale
you're better off using that type of
model than others uh so that's why
scaling laws are actually really
useful any questions on
that yeah so these are all kind of very
how how sensitive are these to like
small differences in the architecture
like one one like Transformer
architecture versus another Transformer
architecture you basically have to like
fit your own curve and make basically
say like oh scaling law has tell me
there should be some like logarithmic
function let me extrapolate that for my
own yeah so uh usually for example if
you're an academic and you want to now
at least that's like pretty recent and
you want to propose a new like
activation uh that's exactly what you
will do you will fit a scaling law show
another scaling law with the standard
like I don't know G and you will say
that it's better in reality once you
start thinking about it in scaling loss
terms you really realize that actually
all the architecture differences that we
can make like the small minor ones all
they do is maybe change a little bit the
The
Intercept but really that doesn't matter
uh cuz just train it for 10 hours longer
or like wait for the next uh for the
next Compu gpus and these things are
really secondary which is exactly why I
was telling you originally people spend
too much time on the architecture and
losses um in reality these things don't
matter as much data though if you use
good data you will have much better
scaling loss than if use bad data so
that really matters
uh another really cool thing you can do
with scaling laws is that you can ask
yourself uh how to optimally allocate
training resources should I train larger
models because we saw that it's better
when you train larger models but we saw
that it's also better when you use more
data so which one should I do should I
just train on more data a smaller model
or should I train a larger model on less
data um so chinchilla is a very famous
paper that first showed this uh the way
they did it I want to give you a little
bit of a sense of what these plots are
uh here you see training loss again on
the x-axis you see parameter parameter
differences uh sorry parameter size uh
number of parameters so the size of the
model and here all these curves are what
we call isof flops which is that all the
models on this curve H have been trained
with the same amount of
compute um the way that you do that is
that you train you change sorry you vary
the number of tokens that we trained on
and the size of the models but you vary
in such a way that the total compute is
constant
okay so all these curves that you see
with different colors have different
amount of computers that were trained on
then you take the best one for each of
those curves once you have the best one
for each of those curves um you can ask
you can plot um how much flops it was
and which curve were you on and how much
parameters did you actually use for
training that specific point you put
that on the on the log log uh scale
again and now you fit a scaling law
again so now I have something which
tells me if I want to train a model of
10^ 23 flops here's exactly the number
of parameters that I should be using 100
100b and you can do the same thing with
flops and
tokens so now you can predict if if I
tell you exactly I have one month of
compute what size of model should I be
training F your scaling law and I tell
you um of course that all looks
beautiful in reality like there's like
there's a lot of like small things of
like should you be counting like
embedding parameters like there's
there's a lot of complexities but if you
do things well these things actually do
hold um so the optimal number of
parameters that that chinchilla Pap have
found is to use 20 tokens for every
parameter that you train uh so if you
add one more parameter you should add
you should train your thing on your
model on 20 more tokens so one caveat
here is that this is optimal training
resources so that is telling me if you
have 10^ 23 FL
or if you have like 100 I don't know how
much that is100 million or 10 no that's
much less actually let's say I have $5
million to to train my best model that
gets the lowest loss how how what would
I train on in reality these companies
need to think about inference also if
you have a smaller model they will spend
less over time um so actually if you
consider the inference cost you have
other papers that Tred to show that um
it's around
150 uh parameters per sorry tokens per
parameters because you prefer having a
smaller model cuz over time you're going
to you're going to actually um spend
less money on inference of these models
so 150 to one that's around what the
best models are trained on right now at
least the ones that are that are used um
in practice for in
production
great any question on
chin great oh sorry in practice how
expensive is inference for these models
rela to
train actually very expensive uh I will
not talk about inference because that
would be another entire lecture but just
think about Chad GPT where they have I
don't know how much it is now like 600
million people that used it um like
that's a lot
um yeah so it's actually very expensive
there's a lot of optimization you can do
for in though um and that's an entire
other lecture so I'm going to skip that
uh this time but it's very
interesting okay tuning um as I said
there are many things that you can uh
answer with scaling laws I just try to
give you two examples uh but really
there are many things what data do you
use what mixture what data mixing
waiting you use data mixtures that's
what we talked about before uh what
architecture you use whether you should
make your models uh wider or deeper um
should you be paying for more gpus or
actually collecting more data um all
these things are things you can try to
answer with scaling
laws one thing I want to say is the bit
lesson if you ever heard of Richard
sudden a very famous blog post in 2019
um what he realized uh which I think not
enough people realize I didn't
definitely did not realize at that time
um is that once you see these type of
scaling laws you know that the more
compute you have the better models you
will get so with skill you will get
better model and you also know by Mo law
or these type of variant of Mo law that
you will always have better compute then
the only thing that matters is just to
have architectures that can leverage
computation so what matters is basically
systems data and less so the
architecture like the small architecture
differences like your your your
activation and things like this uh so I
think that's like one of the reasons why
most of research focuses on um some
things that for industry matters less
and I was one of those researchers for a
large part of my my career um so don't
spend time over complicating do the
simple things do it well seal them
that's really what openi taught us with
um with chat gpg and with all the gpts
before okay I want to give you some
backup the envelope computation so I
might be off by a few factors here but I
just want to give you a sense of how
costly it is to train some of these
models I'll give as an example
Lama 3 400b which is currently the best
open source model that you can get uh it
was trained on 15.6 tokens it has 45
billion parameters so just now that you
know what is like this uh optimal tokens
per parameter that's around 40 so that's
a little bit more than chinchilla but
less than this like inference uh optimal
um model so they went for training
optimality uh flops for this model so
one simple uh way to compute flops is
six uh times the number of parameters
times the number of data you train on uh
so if you do the simple calculation here
it's 3.8 e25 flops the reason why this
is important is that if you follow the
little bit the news there's an executive
order from Biden that basically says
that once you have uh 1 e26 parameters
uh sorry flops uh then you have special
scrutiny on your models so they went 2x
less than that so they really went right
below this to not have special scrutiny
so 38 uh I might be off by a little bit
but it's definitely under the 1
26 oh um so paramet p is parameters n is
data number of tokens this is a uh this
is just an
approximation we
yeah okay uh compute and we know that
they trained on 16,000
h100s um and we know the throughput but
they they said it too uh so if you do
the computation it takes around 70 days
um or 26 million GPU hours at least
that's with my uh back of the envelope
computation they actually said that they
use 30 million instead of 26 million GPU
hours um so maybe they had like some uh
some challenges I don't really know but
if you follow the simple computation
it's around 70 days um cost uh I mean
this it's hard to to approximate but I'm
just going to say it's kind of the rent
like what if I were to rent h100s that
many h100s for that many days how much
will I pay uh h100 a lower bound on the
on the renting uh cost of h100 is around
2 hours uh $2 per hour so if you
multiply this by 26 million uh hours uh
you get 52 million uh dollars so they
probably pay less than that but not
actually much less because all these um
all these services that actually rent
gpus they don't make that much money so
it's it's probably slightly less but not
that much less um now salary I said 50
employees 500k per
year say yeah it's probably the right
ballpark 25 million uh so if you put all
together around 75 million um dollars
for
training uh this Slammer model I'm
probably off by like 10 million but but
that's kind of right uh bpk
carbon emitted um a lot of people might
ask like also the cost is not the only
thing that is important so I did the
computation um it's around 4 uh 4,000 um
tons of CO2 equivalent that is actually
only 2,000 return tickets from JFK to uh
London so right now uh carbon emitted is
actually not uh I mean it's huge but
it's not like um meaningful yeah yet I
think in maybe GPT 6 gpt7 once you
multiply this by 100 that might become a
real issue right now it's still not uh I
think um an issue in the grand scheme of
things next model the way you should be
thinking about these models is that
every new generation the number of flops
essentially uh multiplies 10x or at
least that's what they try uh if they
have enough energy and if they can buy
enough
gpus uh great any question on these back
of the envelope math
no
okay so now we talked about pre-training
I wanted to also chat about systems
because now we know computer is really
important so there's a question of how
do you optimize the how do you optimize
your computer I will leave that for the
end because I'm not sure how much time
we will have I think it's important but
hopefully I I'll be able to to talk
about it later it's slightly different
than what we've been talking about right
now so I'll move on to post training for
now
so the task of post training ER the
reason why we need to do Post training
is as I told you before um it's to make
AI assistants so language modeling is
not uh really the thing that you want
when you have an AI assistant uh for
example if you ask to gbd3 which is a
purely language Model A pure language
model not a um not an aligned one if you
ask a question like explain the moon
landing to a
six-year-old the completion that you
would get is something like explain the
theory of gravity to a six-year-old
because what it learned is that on on on
internet if you have one question you
usually have maybe another bullet point
of other similar questions you don't
usually have question and then answer
later uh this is not what you want from
an AI assistant so how do we uh do this
alignment which is this post training
and making these models
assistance um so the goal of this
alignment is to basically get LMS follow
the instructions that are given um by
users and and maybe some designers kind
of desires um so think about moderation
you don't want the model like open ey
definitely doesn't want the model to say
stuff that is very
toxic um so here you see on the left
hand side uh that when you ask a
question it actually provides a a real
answer so it's not like uh before the
llm and on the right hand side you see
that it would if you ask to write a
tweet describing how a certain part of
the population are evil it will say that
it cannot do that um so that's kind of
this
alignment uh the background here is that
uh basically the data that you want for
training some of these models um is like
we know what we want which is just
asking humans this is a question this is
the answer that you want uh but the
thing is that it's very expensive to
collect that data and it's hard to find
it online uh in contrast pre-training
data is not what you want but there's a
lot of it um so what what we will do a
the main idea is simply take a pre-train
large language model pre-train all of
internet and then you just fine tune so
you just change a little bit of weights
on the type of data that you actually
want and hopefully given it you already
pre-train it on all of Internet it
basically learns or knows how to speak
in English and and knows a standard um
language syntax uh then you can really
find tune in with very little
data okay sft so supervis fine tuning is
really exactly what I just said which is
the idea of fine-tuning the large
language model on uh basically the
desired answers that are collected from
humans um so why is it called supervis
fine tuning because you basically want
to do language modeling on the real
ansers so language modeling is this like
next word prediction and and that's the
fine-tuning part and then you want to do
it on desired answers given by humans so
that's why we call it
supervis so how do we collect this data
well we I just said it you just ask
humans uh to to tell you this is the
this is a question this is the answer
that you uh you would want from some of
these models so this is an example um
sorry I can't read very well on my
computer but uh my kid uh needs to do a
science um no let's read this one can
you write a short introduction about the
relevance of the term monopsony and then
it says monopsony refers to a market
structure blah blah blah and that's a
human that wrote that um so actually
this is open Assistant which was a a way
to collect um uh data online by
humans so this type of supervised fine
tuning or alignment is really the key of
Chad GPT this is what made uh the big
jump from gpt3 which was mostly
something that was known by AI
researchers to Chad GPT which became
known by basically
everyone
um so the problem with uh human data is
that it's uh very slow to collect and
very expensive um so
one possible simple idea is to use llms
to scale data collection uh so that's
exactly what we did with alpaca uh one
year ago what we did is that we asked uh
humans or we use a data set of human uh
question answers so there were 175 uh
question answers here and we asked the
best mod at the time so text3 to
basically generate many more of these
question and answers so all we did is
like this is what humans would write now
write similar answers and similar
questions and we collected 52,000 LM
generated question answers and then what
we did is simply we took Lama 7B which
was the best pre-train model at the time
and we just fine- tuned this with
supervised fine tuning as I told you and
that's how we got um the Alpac s7b
model uh and this is the type of data
that we collected so things like what
does algorithm mean an algorithm is a
step by a stepbystep uh set of
instruction used to solve a problem or
achieve a goal blah blah blah blah so
the data is not actually it's actually
pretty good given it was LM generated by
LMS from essentially two generations ago
um so that really started at least for
us kind of as an academic replication of
chat GPT uh now it really there's a big
field of like synthetic data generation
of how to use llms to basically make
development of llms faster um and by
basically by decreasing the amount of of
human hours that you need
quantity of data so we talked about what
type of data and how we collect it um
one thing which is surprising with sft
is that you don't need that much data uh
so what this paper showed this is called
Lima is that if you have if you scale
the amount of data that use from uh
supervised fine training from 2,000 to
32,000 it really doesn't help much so
here scaling laws definitely don't help
um so the the intuition here is that all
you learn um is is you learn how to
format your desired answers another way
of saying it is that your pre-trained
models they essentially model the
distribution of every user on internet
one that might write bullet points
another one that might answer qu answer
question with an answer so all you tell
your model is like wait you should
actually be optimizing more for this
type of user than another one so you're
not actually teaching it and you're not
teaching anything through this um sft uh
so supervis fine tuning all you do is
you tell the model to kind of optimize
for one type of user that it saw already
in a pre-train data set so the knowledge
is already in the pre-train llm uh and
you basically just specialize to one
type of
user great any question on
sft yes so I know it's a big issue with
synthetic data where uh if you keep
generating data from the same
distribution eventually you're not
learning a new distribution you're
essentially playing with it it just
bootstrapping that yeah surely
you can't scale that forever right you
can't keep going on and generating from
the same distribution you hope to learn
something new yeah uh so are there it's
an active area of research but any
thoughts that you have around how people
are maybe thinking around this and uh
better ways to bootstrap or to give up
on this idea and and realize that the
chart shows you don't need that many so
just get humans to generate 2,000 really
good uh yeah so that's a very good
question uh so for the data stuff so I'm
saying it's not that important for sft
but there will be another thing we'll
talk about right after where actually
data does
matter my intuition based on not that
much empirical results is that you can
still get um even though you use your
LMS if you use purely LM generated text
and you do that for like three four
generations of llms I agree with you
that probably you won't improve much but
for me what is important is how do you
use like human in the loop with llms not
purely LMS not purely uh humans but
maybe what you can do is just have the
model generate some new text and just uh
humans write a few Edits edits are much
faster than writing the entire text and
I think that if you have that type of
collaboration then from like kind of an
information theoretical point of view
you still get additional information but
you still much faster than if you use
humans and I think that as a field we'll
probably move towards these type of
things uh which is um really just
finding the examples that are important
and and asking humans it's kind of
active learning just asking humans
exactly when uh you need to to get
inputs yes do we train with like the
same loss function the same like General
training algorithm for the supervis
tuning bit as we do for the for the
pre-training right because like the
examples you showed I think the the
important thing of the good examples is
they're like supera accurate there's
these more complex still just like chain
same so that's why here I yeah I didn't
maybe didn't emphasize enough this is
just language modeling fine tun the LM
with language model on the desired
answers so this is literally the same
loss um it will be different in two
seconds but the first step of sft is
literally the same loss where you just
say Okay I want to actually specialize
on that type of data so there's even a
question of like what is pre-training
what is post-training because in reality
it's just like a different data that you
use the reason why we usually call it
post training is that the way we collect
that data is very
different great great questions uh yes
maybe it's the same question but why
would these 2,000 examples have such an
overweighted
influence you tun so that's why we uh
also that's another reason why we call
it post training is that we use
different type of hyper parameters so
you know I told you basically at the end
of pre training you essentially end up
with a learning rate of zero and here
you're going to increase your learning
rate so like 1 eus 5 one E Yeah and and
so um the weight that you give to them
is actually
different
um okay uh Second Step or second part of
this post training um is what we call
reinforcement learning from Human
feedback or rhf uh some of you might
have heard of that um the idea is that
sft has a problem namely that uh you do
behavioral cloning which means that you
just try to clone what the humans would
say and that had that has many issues
one of them is that you're bound by
human abilities so if um like humans
actually humans won't generate the
things that they think is actually the
best thing to generate so if you ask me
to write a book I mean I can definitely
enjoy a book I can probably say one book
is better than another but I'm
definitely not going to be as good as
writing the book that I want to read uh
so you're going to be bound by the human
ability to generate things even though
the humans might be better at
distinguishing between things that's one
issue issue number two uh I find that
actually pretty interesting is that it
might if you ever heard of the word
hallucination so this is llms generating
F like false information
hallucination might these people have um
hypothesized that that can come from the
supervised fine tuning even if you do
supervised fine tuning on data that is
correct and the reason why that is is
that if uh given I told you that
basically sftt is with very little data
and it's with data that doesn't the
model doesn't learn anything new so what
if the human gives an answer that the
model didn't know was true from the
model perspective you the human
basically is telling the the model uh
generate this thing that seems plausible
but actually have no idea if it's true
or not um so just to give you a very
concrete example if we go back to this
uh monopsony example can you write blah
blah blah about monopsony uh imagine
that a human uh wrote a reference on
this type of book um and that book might
exist that might be a correct reference
but what if the llm never saw this
reference during pre-training then it
doesn't know that it's a correct
reference so really what you tell the
model is to generate or make up some
plausibly sounding reference um rather
than actually tell the real reference
that it saw during pre-training uh so
hallucination might be um uh a re like
might be caused by this sft that's
problem number two does that all make
sense great problem number three price
generating the ideal answers is very
pricey and that comes back to your
question um of like humans writing
answer is actually pretty
expensive um so that's where rhf comes
in the idea is that instead of cloning
the behaviors of humans we're going to
maximize human preference um and the way
we're going to do that so the pipeline
is that for a certain for every
instruction you're going to ask a model
to generate two answers um and usually
use a pretty good model so you usually
don't use an LM here you use a sft uh
fine tune you use a fine tuned llm
already to give like pretty good answers
and then you ask labelers which of these
two answers was better so select the
preferred one and then with different
type of algorithms we're going to talk
about the algorithms um you just
fine-tune the model to generate more of
the green thing than the red thing so
more of the good stuff uh so now the
question is how and we're going to talk
about that right
now so there are two ways that we're
going to talk about and two that are
mainly used in the community um the
first one is simply the idea of of using
reinforcement learning so hopefully you
all know what reinforcement learning is
now um so when you think about using
reinforcement learning one important
question is like what is the reward that
we're optimizing uh so in this case
there are really two options that I
could think about the first one you
could just say I'm going to compare the
output generated by some baseline the
output generated by my model U and I'm
just going to ask the human to say which
one is better and I'm going to use this
as a reward so if I'm better than the
Baseline this is a plus one if not it's
a minus one one uh so now it's binary
reward the problem with binary reward is
that it's very sparse and you don't get
much information out of it uh like maybe
your answer was slightly better maybe it
was like way better and you don't really
know from this um how much better it was
so option two is that you can train what
we call a reward model which is simply a
classifier uh so you use machine
learning to to classify how much better
uh two outputs are from the preference
from the perspective of the human um so
this is a little bit meta but what you
basically do is that you train uh you
take um a reward model R which is a uh
just a large also a large um a large
classifier and you basically ask this
reward model you give it the input and
the actual output that you have one of
the two outputs uh and you just um
exponentiate that so that's the soft Max
law that you all know about and now you
divide by um the the exponential
reward uh on the first example sorry on
the first output and this is on the
second output and you basically train so
the reason why you do that is that you
train your your model you train this
reward model to be able to classify um
how much better one output is to another
one so another uh slightly less
convoluted way of saying it is that your
reward model will output some reward
that will be used as the logits of your
soft Max so now if you have high logic
in your softmax it means that you highly
likely this um output is
better uh so that's what we call Bradley
ter model yes is this reward model going
over the entire output or is it
going um so this takes the
entire uh yeah this takes the entire
output at once so it takes all the input
and all the output and it gives one
number
yes would human be sorry with the reward
model where would a human be like oh I
see okay sorry maybe I wasn't clear um
you train this reward model to fit this
green and and red preference from humans
so basically you train a classifier to
say whether the humans prefer red or
green uh but instead of using the binary
reward which is what the human would
tell you you basically use the logits of
the soft Max and the thing with the
logits is that that logits are
continuous so now you know that if your
reward model said it has high logits
then in some ways the human highly
prefer this answer to some other
answer great um so as I just said
continuous information so it's better so
that's what people uh use in practice or
at least used to use in practice I'll
tell you about uh the other algorithm
later uh so what you do at the end is
that you basically try to just use
reinforcement learning that you know
about now we know we have reward what
you sample through is the generation
from your large language model um and
then you just use some regularization
term so the reason why you do this
regularization term is for avoiding what
we call over optimization so this reward
model might not be really represent like
might not perfectly model human
preferences so you don't want to
maximize this thing to essentially
Infinity um and you do it using uh po
which is a common uh reinforcement
learning algorithm um one thing to note
here because it will be important for
later is that when we use maximum
likelihood
um sorry now the large language models
are actually a policy for your
reinforcement learning it's not
maximizing maximum likelihood anymore
which means that you're not modeling any
distribution anymore and the reason why
this is important is that models that
went through this type of Po actually
don't give you likelihoods of text that
are meaningful cuz what you optimize
them to do is B basically just optimized
for generating the most likely thing not
optimize for modeling like all the
answers that humans might say another
way of saying that is that there's
nothing that incentivizes here the model
to not give a like a um a single
possible generation nothing here says
it's good if you have some distribution
with some
entropy um okay if you haven't followed
it's not that important but just good to
knowe great so PO is exact what chat GPT
did originally so here's the on the blog
post or what they have is step one do
supervise fine training which now you
all know about step two train a reward
model on human preferences step three do
po multiple steps which is where you see
this this blue arrow so you continue you
train the model once with po you collect
new data you continue uh and that's why
and that's exactly what Chad GPT did uh
that was a big breakthrough between gpt3
and Chad GPT
one thing to note is that uh P has many
challenges reinforcement learning is
something that's super nice
theoretically in practice anyone who
ever worked with reinforcement learning
knows it's such a mess uh there's a lot
of things like roll outs out of Loops
clipping so many complications um so
it's messy this is the idealized PO used
for LM settings so that's already much
more complicated than this expectation
we saw before and in practice it's
actually much more complicated so we
have one implementation of it that we
had to do and I'm not going to go
through it but basically you have like
so much stuff that you have to think
about when you implement that type of of
uh po algorithm so you have clipping
everywhere you have a lot of
complexities and things are not well
documented all this to say um that we're
going to there was a new method that was
proposed uh also from Sanford one year
ago called DPO which is essentially a
simplification of Po um and the way uh
what they did or the idea that they have
is that instead of using reinforcement
learning you can just maximize the
probability of generating the stuff that
you like and minimizing the probability
of the stuff that you don't like uh so
if you think about the human preference
the red and green maximize uh green
minimize red um so the loss is actually
this one uh where what you see this is
simply um some log of the model so this
is the likelihood of a model generating
the things that the human preferred
given the the inputs um and what you try
to do is basically
maximize uh the likelihood of generating
the things that you like minimize the
likelihood of the things that you don't
like um all the rest of the terms here
it's not too important it's actually
really not that complicated to
understand but at a high level it's
really just maximizing the things you
like minimizing the the rest um and one
thing to note uh which I was going to
say just here is that actually all the
rest is chosen such that um the global
Minima of of Po and a global Minima of
like this DPO under some assumptions are
essentially equivalent so this is the
right thing to do mathematically I'm not
going to go through the derivations but
that's the right thing to do uh it's
pretty different with Po in the sense
that now and with P what you had to do
is collect the human preferences then
train a uh reward model with maximum
likelihood then use reinforcement
learning now all you do is basically
maximum likelihood much simpler yes I
mean yeah so it seems like this is a
much simpler and B like what you just
intuitively do if this why did they
start with this reward model like what
what led them doing that I think it's a
great question uh I don't really know
what I can tell you is that at open ey
the people who did the um uh who did
basically this PP uh sorry who did Chad
GPT initially are the ones who actually
wrote Po and I think they were just like
there are a lot of reinforcement
learning people and I think that for
them it was very intuitive um so there's
also some additional like potential
benefits for example I don't want to
yeah for example if you use the reward
model uh the cool thing here with
reinforcement learning is that you can
use unlabeled data with the reward model
so here you can only use the label data
for doing DPO um for PP for po you first
train your reward model and then you can
use unlabeled data uh where the reward
model will basically label this
unlabeled data so there there's
additional kind of potential uh
there could be potential improvements in
practice it happens at down and on and I
think just that a lot of people in this
team were reinforcement learning experts
including uh the main author of Po John
hman um so much simpler in poo and is
basically performs as well uh so now
this is the standard uh thing that
people use at least in the open source
Community I believe it's actually the
standard also in in Industry so that's
called DPO gains
um so those are all the papers on the
left here this is on a summarization
task you see all I want to show you is
that basically the pre-train models uh
were okay and they improve with scale if
you do supervised fine tuning you
improve them a little bit more if you do
po or something with all HF with human
feedback you get performance that are as
often times depending on a benchmark
even better than uh humans so this is
the human uh reference summaries same
thing this is on a uh on a paper that we
have Alpaca Farm
where we see uh the evaluation here is
not too important but basically you see
pre-train model you jump to sft and then
you jump to PPO and popo have the exact
same
performance so basically all HF helps
that's kind of the conclusion and DPO is
simple uh data uh the way that you
collect that type of data um first idea
is just use humans as we already talked
about uh guidelines are very complicated
for what humans should be labeling and
and it's really not that easy and
actually if you ever do some of the
labeling you will see that it's
extremely complicated like if I zoom in
to this uh here I have a question tell
tell me about self-driving cars and you
read both self-driving cars are vehicles
that are capable of detecting their
surroundings blah blah blah self-driving
cars are cars that are equipped with
sensors blah blah blah to navigate
without the need for a driver I mean
both seem okay like which one is better
it's actually hard to say at a glance um
and as a result uh the problem with
humans is that you will start optimizing
a lot of like high level features for
example the second one is longer I can
guarantee you that most humans will
choose second one even though I mean
maybe the first one is better I don't
know I haven't read it carefully so
challenges with humans first slow and
expensive uh second as I just mentioned
it's hard to focus on things that matter
like correctness and people uh usually
look at things that don't matter as much
like the form like length uh and as a
result so what I show here is that uh
when you do lhf the more you do of lhf
the longer the output of the of the
models become so if you've ever been
annoyed at chat GPT answering you super
long sentences this is because of all
rhf um annotator distribution shift uh
like the distribution of annotators that
you use matters a lot and you have to
think like what is what is even the
humans that we want to represent in
these models uh now the question is like
crowdsourcing ethics uh like usually
these basically a lot of the the
labeling that is done um like the people
who do them are not paid well and they
have to go through a lot of toxic data
uh because you basically want the model
to avoid saying the toxic data um so
crowdsourcing ethics
too so many challenges with human data
um so what we did also last year is
again the same thing as alpaca just the
idea of like oh well they're challenges
with humans maybe we can just replace
them with llms uh so what we did is
simply replace
um oh I see that I'm just realizing that
the slides are not sented anyways uh you
replace a human preference with LM
preferences uh so here on this uh figure
you see on the xaxis the price that we
paid uh for collecting human data it's
around
$300 for 1,000 examples and this is on
mechanical turkers which are usually
like cheaper than than maybe some of the
other um companies that you could go
through and on the Y AIS it's basically
the agreement with uh other humans with
the mode of other humans and what you
see is that actually as I told you
before labeling is really complicated
humans agree with themselves only around
66% of the time on a binary Tas and it's
not that the humans are not good here
because uh we were five main authors on
this paper we tried to label this data
ourselves and we only had like say 67 or
68% accuracy even though we talk like we
talk for like 3 hours of how we should
be doing labeling really it's
complicated it's not an easy task um and
here I just showed many different models
and um basically you see that models are
much cheaper and they can actually get
higher agreement with the mode of humans
than human humans themselves and the
reason why is because humans have a lot
of varant models have no varant so they
might be a little bit more biased but
have less virence uh so it works
surprisingly well and now it's kind of
the standard in open uh Source Community
I think even in Industry a lot of people
use both humans and llms for improving
uh the colle collection of allf data
um and this is like this is the paper
from last year but honestly now it's
more like that llms would be around this
agreement and this cost so around I
would say 50x cheaper than humans and
better agreement with human than humans
themselves okay so that gets us to
evaluation of post
training um that goes back to your
initial question at the beginning of the
lecture how do you evaluate something
like chpt uh the answers that chpt could
give are basically unbounded and it's
not that there one right answer there
are many answers that are just as good
um so there are many challenges one you
can't use validation loss because one
method might use po the other one might
use DPO validation loss is not
comparable second you can't use Cal uh
sorry perplexity that's the thing I told
you before these models uh are not
calibrated they don't give distributions
they they just optimize for one thing so
you can't use perplexity for actually
evaluating uh these type of models once
they're aligned sorry one Z lined third
uh there's a large diversity of
questions that human might ask to these
models generation open QA like some
question answering some summarization
and all of these things so there's so
many things you have to cover um then
the tasks are really open-ended so it's
very hard to automate so that's what you
were alluding to before so the idea uh
is that instead of trying to come up
with really easily automated uh
benchmarks uh it's just we're going to
ask questions that that users actually
ask to these models in practice and
we're just going to ask annotators to
say between these two models which one
is better like what's the what's the
better output so basically do exact same
thing as um basically the data from rhf
but you use it now for evaluation yes
I'm not sure I understand what you mean
by like can't use perplexity and not
calibrated right like LM is still doing
like next token
prediction so I can't so think about um
the optim solution after doing PO is
basically one model that gives you uh
essentially a Delta um like basically
says that there's only one sentence that
is that could be generated for that
question so now if you use it on
something that is slightly semantically
differently different it would actually
give a likelihood of zero for that
answer so in reality it's not that
extreme because as you say it's still a
distribution but I just shows you that
there's a there's a fundamental issue
with perplexity once these models are
not llms anymore they were not trained
at least with P they were not trained to
to do maximum likelihood anymore they
were trained to be
policies okay um so probably the most
common or like the most um yeah the most
common Benchmark or the most trusted one
is what we call Chad uh sorry chatbot
Arena uh which is basically go on
internet have random users on the
internet blindly talk with two chat Bots
just ask many questions see the two
answers and rate which one is better and
and you do that over hundred of
thousands of users and then you get uh
the actual preferences and you get
rankings of models uh so you can go
right now on chatbot Arena and actually
interact with these models um one
potential issue just to highlight is
that while people who want to do these
type of things are usually more like
Tech driven um or like techsavvy uh so a
lot of the questions that you will ask
are more like Tech stuff discussing
software errors inquiries about AI tools
and all these things um so another issue
is cost and speed if you really want to
use something like this for development
process um it will be too costly because
you would need to basically pay a lot of
humans to do that so one simple idea is
again as we said many times just use LM
instead of humans uh you probably know
the drill at this point uh steps for
every instruction generate outputs by
some baseline and the model that you
want to evaluate um so here you imagine
that I I'm comparing an answer from Chad
GPT and from
I'm just asking a model uh another model
uh which one is better and I just
basically average that out uh yeah I
asked gp4 which one is better I average
that out over my entire distribution
over my entire Benchmark or data set and
that gives me a RN rate so RN
probability for one model compared to
another one and now you can rank models
uh and this is the Alpa eval uh
leaderboard so the benefits of this is
that actually we show we get 98%
correlation with Chad B Arena so very
high correlation with humans um so this
is yeah comparison with correlation with
other benchmarks and it takes less than
three minutes and less than $10 to run
so it's pretty cheap um there are
downsides though uh one of them is purus
correlation um so as we already saw
before LMS prefer this is one SP
correlation not many I'll just talk
about one LMS prefer longer outputs
actually humans also prefer longer
outputs but the problem or the issue
once you use llms is that once there
bias you will continue optimizing that
humans at some point I can guarantee you
if I ask a simple question and you give
me five pages of answers I'll be like no
I don't like that answer but LMS if they
have this bius and they were trained for
that they will continue preferring
longer outputs so uh here we see um the
the preference just showing that like
humans and models prefer longer outputs
um and here is another view of the
initial apaka eval data uh Benchmark
where when we asked um when we we rank
gp4 when we look at the Run rate of gp4
versus actually uh gp4 itself if we com
if we use the standard GPT 4 it gets 50%
kind of by definition because we're
comparing GPT 4 versus gp4 but if we ask
a gbd4 to be slightly more verose so we
just say in the prompt be Vos in your
answers then it gets a r rate of
64.4% so really there's a huge variance
and if we ask it to be concise it gets
20% so there's a huge variance depending
on um whether you ask it to be concise
of
that's very annoying um so one possible
solution which is what we did is uh just
use some regression analysis I'm not
going to go into details but basically
use Cal inference tools to control for
length and right now uh actually length
matters much less so if you ask it to be
veros we still get some gains but much
less great so that's all about post
training and now for the next eight
minutes I might talk about systems or
just answer questions yes can you um go
back to your post training in terms of
post training how did we tune those
parameters using the small body of
fine-tuning data and have such big
effect on the model you mentioned
earlier that there's a different set of
hyperparameters are we changing just
some of the weights the later weights or
all the weights what's actually
happening yeah uh yeah I I kind of
skimmed through all of this you change
all the weights actually um industry
would change all the weights in open
source land you might have heard of
Laura which is going to change basically
only some of the weights or it actually
to be more specific it's going to add
some differences to the output of every
of every layer but but in Industry
you're going to just fine tune all the
weights um and also to say something
else about the data actually the SL St
all HF you usually going to collect uh a
lot more data than with sft so if fft is
like 5,000 10,000 maybe 50,000 with rhf
I think you're going to be more around
like the 1 million
uh order of magnitude it's still much
less than pre-training though yeah
because pre-training is 15 trillion
tokens I mean this is like that's not
even a drop and yet you influence the
weight a lot so because you do it I mean
you have to think that how you do it is
you use um I mean as I said the learning
rate that you're going to use is going
to be different but also you only do
that so just imagine if I train even if
I train on one sentence but over and
over again all at some point my model
will only that sentence even if uh it
was just one sentence instead of the 15
trillion tokens so if you use a large
enough learning rate and for enough time
you will basically overfit that sentence
so the the the key thing to to remember
is that um the data is not I it's not as
if you mix some posttraining data and
some pre-training data you do
pre-training and then you just start
fine-tuning only on the post trining so
another way maybe another perspective is
that the post the pre-training is just
the initialization of your model
and once you view it that way that this
is just initialization of Weights then
there's nothing special like you don't
need to remember that you train a lot of
data before the only thing that matters
is that you had an initialization and
now I actually train a model so maybe
think about it that way like there's a
there's a mark of property in some way
just like you had your weights this is
my initialization now I'm training that
one does that kind of answer your
question kind of but you said something
just now about it's almost the
equivalence of just rerunning the find
tuning data many times is it actually is
that what actually happens in order to
give so much more preference
um you might I actually don't know right
now how they do it in Industry when we
did alpaca we had to do three box so you
did run it three times to it
um but I mean even the number of times
that you run it through it's actually
not important the only thing like the
only thing is the is kind of the
effective learning rate that what
matters
um so
yeah
great so I think I have five minutes
[Music]
right okay I might try to give a high
level Overview at least from one of the
systems trick systems as we said uh for
everyone Bott neck is a sorry compute is
the huge bottleneck uh one question you
might ask is why not buy more gpus uh
gpus are expensive but also are scarce
even if you have $10 million right now
you cannot buy the best gpus um
there's oh yeah there's also some
physical limitations when you have when
you have multiple gpus you have to
communicate between them that takes time
um so just buying more gpus is not that
easy um so it's really important to
think about how do you allocate
resources and how do you optimize your
pipeline so system 101 on gpus I'm sorry
I'm going slightly faster I hope for
that some of you at least can follow uh
gpus are basically optimized for
throughput CPUs are optimized uh for
latency so gpus the way you have to
think about it is that there's one Comm
there's one command that is run on many
many Calles at the same time on
different type of data um so this is how
you see a GPU you see there are many
different CES we call them streaming
multiprocessors which is very different
than the usual CPU architecture so just
think High throughput paralyzation for
gpus uh gpus are optimized for fast
matrix multiplication so every time you
will do uh you will do something on GPU
if you can do it with a a matrix
multiplication it's going to be 10 times
faster than with anything else uh that
is a little bit annoying because it
means that we're kind of uh bottlenecked
to doing anything with Matrix
multiplications um another thing to note
with gpus is that compute has been
improving faster than memory and
communication so right now gpus usually
are hard to keep uh like the data that
you send that send to gpus is actually
hard to keep up with the processess so
most of your gpus are actually going to
be idle if you just run normal code if
you don't optimize your code so
communication and this will continue
over time another thing to know about
gpus is that there's a memory hierarchy
this is the same thing actually with
CPUs but basically the closer you are to
your cuse the less memory there is but
the faster things run if you're further
more memory slower
um okay I'm going to skip that okay
actually I'm going to say it I told you
about this uh the fact of communication
uh the metric that people usually look
at is model flop utilization so what is
the theoretical maximum that GPU could
run at no more flops that you could use
per second divide sorry the number of OB
observed through put divided by this
theoretical um maximum and in general if
you reach 50% you're very happy like
Facebook I looked at Lama was at 45 or
something like this so that that means
that data doesn't come fast enough even
for these big
companies so one simple trick and that
might be the only one I'm going to tell
you about is low Precision one simple
idea is that well if I'm going to put my
floats in lower Precision then there's
going to be fewer bits that I have to
send to my gpus if there's fewer bits
it's faster communication lower memory
consumption things are going to go
faster uh and for deep learning it just
happens that de decimal is not that
important uh so so when you do matrix
multiplication when you do like for
example SGD there's already so much
noise that if you update something by
0.01 or
0.015 who cares uh so basically instead
of using uh 32 bits per float which is
um what people used to use or 64 for
example which is what you would use in
other domains you use 16 bits uh for
matrix multiplication so for every float
you use 16 bits um and for training you
have this type of like uh what we call
aut atic mix Precision which is that uh
some of the things are in 32 bits others
are in 60 bit in 16 bits um generally
the way you should be thinking about it
is that your weights are stored of your
model are stored in 32 bits um but just
before the computation you put
everything in 16 16 bits like this you
do computation super fast and at the end
you update your weights in 32 Bits And
the reason why you do all the updates in
32 bits it's just think that if your
learning rate for example is very small
you still want to be able to like make a
difference in your weights uh so all the
computation is done in 16 bits but the
weights are actually stored in 32 bits
so that's like the standard way that
people are doing it um okay I'll
actually talk just about this and then
I'll skip all the rest operator Fusion
because I think this is actually pretty
cool as I just said communication is
very slow and actually every time you
use a pie torch line it basically moves
variable to Global memory of your GPU so
when you have something like this x do
cosine uh equal X1 and then you do X1 do
cosine what is happening behind the
scenes is that you take the X which is
data you ship it to your um to your
actual processes of your gpus you apply
the coign you ship it back to the main
memory of your GPU and then you see the
next sign you ship it back to the
computer to the GPU processor you apply
another cosign and you ship it back
again um so another way to see that is
that you go from your Dam which is your
Global memory in your GPU and you ship
it to compute you ship it back for every
line This is a naive way of doing it
this seems very wasteful um so the idea
simple idea of operative Fusion is just
communicate do all the computation ship
it back once and this is exactly what
fuse kernels are um so if you ever want
to make your comp your computations in
pytorch much faster just apply torch.
compile on your model this is going to
make your model around two times faster
and what it does is simply that it
rewrites your code uh your P like your
py torch code basically in C++ in Cuda
uh to to do the communication only once
then do all the operations then uh ship
it back okay I'm not going to have time
to talk about tiling tiling is important
paration paration is important um and
mixture of experts mixture of experts is
important Outlook there are many things
we haven't T talked about we haven't
talked about architectures we definitely
haven't talked about inference um there
are many other things that are important
with LMS what is the UI that you use I
mean arguably chat jpt the big novelty
was just have a simple UI to use it
multimodality what are all the misuses
you could have uh the fact that there
might not be enough data on the internet
to train all these models legality of
data collection so many other things if
you are interested in all these topics
uh I would suggest three classes cs224n
is probably the one that touches the
least on uh LMS uh but it gives some
background and historical context um of
all the LMS and gives kind of some
adjacent material CS 324 I think it's
called Uh I think it's just called large
language models uh more in-depth reading
and lectures on everything I talked
about CS 336 which is large language
model from scratch you actually build
your own llm uh it's an amazing class
also given by my two supervisors very
heavy workload so be careful and um
great
Loading video analysis...