TLDW logo

Let's build GPT: from scratch, in code, spelled out.

By Andrej Karpathy

Summary

## Key takeaways - **Transformer architecture is the core of GPT**: The "Attention Is All You Need" paper from 2017 introduced the Transformer architecture, which forms the core of modern large language models like ChatGPT. This architecture has been widely adopted across AI applications. [00:00], [22:11] - **Character-level tokenization simplifies understanding**: The video uses character-level tokenization for the Tiny Shakespeare dataset, converting each character into an integer. While subword tokenization (like BPE or SentencePiece) is common in practice for larger models, character-level simplifies the learning process. [08:39], [47:11] - **Self-attention enables token communication**: Self-attention allows tokens within a sequence to communicate by calculating affinities (weights) based on queries and keys. This enables tokens to aggregate information from relevant past tokens, a key mechanism for understanding context. [01:00:18], [01:04:00] - **Residual connections and layer normalization stabilize deep networks**: To train deep Transformer networks effectively, residual connections (skip connections) and layer normalization are crucial. These techniques help stabilize training and prevent vanishing gradients, allowing for deeper and more performant models. [01:30:31], [01:33:03] - **Scaling up Transformer parameters improves performance**: By increasing the number of layers, heads, embedding dimensions, and context length (block size), the Transformer's performance significantly improves. For instance, scaling up to 6 layers, 6 heads, and 256 context length reduced validation loss substantially. [01:37:39], [01:40:34] - **Pre-training vs. fine-tuning for chatbot capabilities**: Large language models like ChatGPT are first pre-trained on vast amounts of text to learn language patterns, then fine-tuned through supervised learning and reinforcement learning (like RLHF) to align their behavior as helpful assistants. [01:48:53], [01:53:20]

Topics Covered

  • Language models are just sophisticated sequence completers.
  • The Transformer began as a random translation paper.
  • AI's tokenization choice is a fundamental design trade-off.
  • Self-attention uses 'query' and 'key' vector dot products.
  • Pre-training creates a document completer, not an assistant.

Full Transcript

hi everyone so by now you have probably

heard of chat GPT it has taken the world

and AI Community by storm and it is a

system that allows you to interact with

an AI and give it text based tasks so

for example we can ask chat GPT to write

us a small Hau about how important it is

that people understand Ai and then they

can use it to improve the world and make

it more prosperous so when we run this

AI knowledge brings prosperity for all

to see Embrace its

power okay not bad and so you could see

that chpt went from left to right and

generated all these words SE sort of

sequentially now I asked it already the

exact same prompt a little bit earlier

and it generated a slightly different

outcome ai's power to grow ignorance

holds us back learn Prosperity weights

so uh pretty good in both cases and

slightly different so you can see that

chat GPT is a probabilistic system and

for any one prompt it can give us

multiple answers sort of uh replying to

it now this is just one example of a

problem people have come up with many

many examples and there are entire

websites that index interactions with

chpt and so many of them are quite

humorous explain HTML to me like I'm a

dog uh write release notes for chess 2

write a note about Elon Musk buying a

Twitter and so on so as an example uh

please write a breaking news article

about a leaf falling from a

tree uh and a shocking turn of events a

leaf has fallen from a tree in the local

park Witnesses report that the leaf

which was previously attached to a

branch of a tree attached itself and

fell to the ground very dramatic so you

can see that this is a pretty remarkable

system and it is what we call a language

model uh because it um it models the

sequence of words or characters or

tokens more generally and it knows how

sort of words follow each other in

English language and so from its

perspective what it is doing is it is

completing the sequence so I give it the

start of a sequence and it completes the

sequence with the outcome and so it's a

language model in that sense now I would

like to focus on the under the hood of

um under the hood components of what

makes CH GPT work so what is the neural

network under the hood that models the

sequence of these words and that comes

from this paper called attention is all

you need in 2017 a landmark paper a

landmark paper in AI that produced and

proposed the Transformer

architecture so GPT is uh short for

generally generatively pre-trained

Transformer so Transformer is the neuron

nut that actually does all the heavy

lifting under the hood it comes from

this paper in 2017 now if you read this

paper this uh reads like a pretty random

machine translation paper and that's

because I think the authors didn't fully

anticipate the impact that the

Transformer would have on the field and

this architecture that they produced in

the context of machine translation in

their case actually ended up taking over

uh the rest of AI in the next 5 years

after and so this architecture with

minor changes was copy pasted into a

huge amount of applications in AI in

more recent years and that includes at

the core of chat GPT now we are not

going to what I'd like to do now is I'd

like to build out something like chat

GPT but uh we're not going to be able to

of course reproduce chat GPT this is a

very serious production grade system it

is trained on uh a good chunk of

internet and then there's a lot of uh

pre-training and fine-tuning stages to

it and so it's very complicated what I'd

like to focus on is just to train a

Transformer based language model and in

our case it's going to be a character

level language model I still think that

is uh very educational with respect to

how these systems work so I don't want

to train on the chunk of Internet we

need a smaller data set in this case I

propose that we work with uh my favorite

toy data set it's called tiny

Shakespeare and um what it is is

basically it's a concatenation of all of

the works of sh Shakespeare in my

understanding and so this is all of

Shakespeare in a single file uh this

file is about 1 megab and it's just all

of

Shakespeare and what we are going to do

now is we're going to basically model

how these characters uh follow each

other so for example given a chunk of

these characters like this uh given some

context of characters in the past the

Transformer neural network will look at

the characters that I've highlighted and

is going to predict that g is likely to

come next in the sequence and it's going

to do that because we're going to train

that Transformer on Shakespeare and it's

just going to try to produce uh

character sequences that look like this

and in that process is going to model

all the patterns inside this data so

once we've trained the system i' just

like to give you a preview we can

generate infinite Shakespeare and of

course it's a fake thing that looks kind

of like

Shakespeare

um apologies for there's some Jank that

I'm not able to resolve in in here but

um you can see how this is going

character by character and it's kind of

like predicting Shakespeare like

language so verily my Lord the sites

have left the again the king coming with

my curses with precious pale and then

tranos say something else Etc and this

is just coming out of the Transformer in

a very similar manner as it would come

out in chat GPT in our case character by

character in chat GPT uh it's coming out

on the token by token level and tokens

are these sort of like little subword

pieces so they're not Word level they're

kind of like word chunk

level um and now I've already written

this entire code uh to train these

Transformers um and it is in a GitHub

repository that you can find and it's

called nanog

GPT so nanog GPT is a repository that

you can find in my GitHub and it's a

repository for training Transformers um

on any given text and what I think is

interesting about it because there's

many ways to train Transformers but this

is a very simple implementation so it's

just two files of 300 lines of code each

one file defines the GPT model the

Transformer and one file trains it on

some given Text data set and here I'm

showing that if you train it on a open

web Text data set which is a fairly

large data set of web pages then I

reproduce the the performance of

gpt2 so gpt2 is an early version of open

AI GPT uh from 2017 if I recall

correctly and I've only so far

reproduced the the smallest 124 million

parameter model uh but basically this is

just proving that the codebase is

correctly arranged and I'm able to load

the uh neural network weights that openi

has released later so you can take a

look at the finished code here in N GPT

but what I would like to do in this

lecture is I would like to basically uh

write this repository from scratch so

we're going to begin with an empty file

and we're we're going to define a

Transformer piece by piece we're going

to train it on the tiny Shakespeare data

set and we'll see how we can then uh

generate infinite Shakespeare and of

course this can copy paste to any

arbitrary Text data set uh that you like

uh but my goal really here is to just

make you understand and appreciate uh

how under the hood chat GPT works and um

really all that's required is a

Proficiency in Python and uh some basic

understanding of um calculus and

statistics

and it would help if you also see my

previous videos on the same YouTube

channel in particular my make more

series where I um Define smaller and

simpler neural network language models

uh so multi perceptrons and so on it

really introduces the language modeling

framework and then uh here in this video

we're going to focus on the Transformer

neural network itself okay so I created

a new Google collab uh jup notebook here

and this will allow me to later easily

share this code that we're going to

develop together uh with you so you can

follow along so this will be in a video

description uh later now here I've just

done some preliminaries I downloaded the

data set the tiny Shakespeare data set

at this URL and you can see that it's

about a 1 Megabyte file then here I open

the input.txt file and just read in all

the text of the string and we see that

we are working with 1 million characters

roughly and the first 1,000 characters

if we just print them out are basically

what you would expect this is the first

1,000 characters of the tiny Shakespeare

data set roughly up to here so so far so

good next we're going to take this text

and the text is a sequence of characters

in Python so when I call the set

Constructor on it I'm just going to get

the set of all the characters that occur

in this text and then I call list on

that to create a list of those

characters instead of just a set so that

I have an ordering an arbitrary ordering

and then I sort that so basically we get

just all the characters that occur in

the entire data set and they're sorted

now the number of them is going to be

our vocabulary size these are the

possible elements of our sequences and

we see that when I print here the

characters there's 65 of them in total

there's a space character and then all

kinds of special characters and then U

capitals and lowercase letters so that's

our vocabulary and that's the sort of

like possible uh characters that the

model can see or emit okay so next we

will would like to develop some strategy

to tokenize the input text now when

people say tokenize they mean convert

the raw text as a string to some

sequence of integers According to some

uh notebook According to some vocabulary

of possible elements so as an example

here we are going to be building a

character level language model so we're

simply going to be translating

individual characters into integers so

let me show you uh a chunk of code that

sort of does that for us so we're

building both the encoder and the

decoder

and let me just talk through what's

happening

here when we encode an arbitrary text

like hi there we're going to receive a

list of integers that represents that

string so for example 46 47 Etc and then

we also have the reverse mapping so we

can take this list and decode it to get

back the exact same string so it's

really just like a translation to

integers and back for arbitrary string

and for us it is done on a character

level

now the way this was achieved is we just

iterate over all the characters here and

create a lookup table from the character

to the integer and vice versa and then

to encode some string we simply

translate all the characters

individually and to decode it back we

use the reverse mapping and concatenate

all of it now this is only one of many

possible encodings or many possible sort

of tokenizers and it's a very simple one

but there's many other schemas that

people have come up with in practice so

for example Google uses a sentence

piece uh so sentence piece will also

encode text into um integers but in a

different schema and using a different

vocabulary and sentence piece is a

subword uh sort of tokenizer and what

that means is that um you're not

encoding entire words but you're not

also encoding individual characters it's

it's a subword unit level and that's

usually what's adopted in practice for

example also openai has this Library

called tick token that uses a bite pair

encode

tokenizer um and that's what GPT uses

and you can also just encode words into

like hell world into a list of integers

so as an example I'm using the Tik token

Library here I'm getting the encoding

for gpt2 or that was used for gpt2

instead of just having 65 possible

characters or tokens they have 50,000

tokens and so when they encode the exact

same string High there we only get a

list of three integers but those

integers are not between 0 and 64 they

are between Z and 5

5,256 so basically you can trade off the

code book size and the sequence lengths

so you can have very long sequences of

integers with very small vocabularies or

we can have short um sequences of

integers with very large vocabularies

and so typically people use in practice

these subword encodings but I'd like to

keep our token ier very simple so we're

using character level tokenizer and that

means that we have very small code books

we have very simple encode and decode

functions uh but we do get very long

sequences as a result but that's the

level at which we're going to stick with

this lecture because it's the simplest

thing okay so now that we have an

encoder and a decoder effectively a

tokenizer we can tokenize the entire

training set of Shakespeare so here's a

chunk of code that does that and I'm

going to start to use the pytorch

library and specifically the torch.

tensor from the pytorch library so we're

going to take all of the text in tiny

Shakespeare encode it and then wrap it

into a torch. tensor to get the data

tensor so here's what the data tensor

looks like when I look at just the first

1,000 characters or the 1,000 elements

of it so we see that we have a massive

sequence of integers and this sequence

of integers here is basically an

identical translation of the first

10,000 characters

here so I believe for example that zero

is a new line character and maybe one

one is a space not 100% sure but from

now on the entire data set of text is

re-represented as just it's just

stretched out as a single very large uh

sequence of

integers let me do one more thing before

we move on here I'd like to separate out

our data set into a train and a

validation split so in particular we're

going to take the first 90% of the data

set and consider that to be the training

data for the Transformer and we're going

to withhold the last 10% at the end of

it to be the validation data and this

will help us understand to what extent

our model is overfitting so we're going

to basically hide and keep the

validation data on the side because we

don't want just a perfect memorization

of this exact Shakespeare we want a

neural network that sort of creates

Shakespeare like uh text and so it

should be fairly likely for it to

produce the actual like stowed away uh

true Shakespeare text um and so we're

going to use this to uh get a sense of

the overfitting okay so now we would

like to start plugging these text

sequences or integer sequences into the

Transformer so that it can train and

learn those patterns now the important

thing to realize is we're never going to

actually feed entire text into a

Transformer all at once that would be

computationally very expensive and

prohibitive so when we actually train a

Transformer on a lot of these data sets

we only work with chunks of the data set

and when we train the Transformer we

basically sample random little chunks

out of the training set and train on

just chunks at a time and these chunks

have basically some kind of a length and

some maximum length now the maximum

length typically at least in the code I

usually write is called block size you

can you can uh find it under different

names like context length or something

like that let's start with the block

size of just eight and let me look at

the first train data characters the

first block size plus one characters

I'll explain why plus one in a

second so this is the first nine

characters in the sequence in the

training set now what I'd like to point

out is that when you sample a chunk of

data like this so say the these nine

characters out of the training set this

actually has multiple examples packed

into it and uh that's because all of

these characters follow each other and

so what this thing is going to say when

we plug it into a Transformer is we're

going to actually simultaneously train

it to make prediction at every one of

these

positions now in the in a chunk of nine

characters there's actually eight indiv

ual examples packed in there so there's

the example that when 18 when in the

context of 18 47 likely comes next in a

context of 18 and 47 56 comes next in a

context of 18 47 56 57 can come next and

so on so that's the eight individual

examples let me actually spell it out

with

code so here's a chunk of code to

illustrate X are the inputs to the

Transformer it will just be the first

block size characters y will be the uh

next block size characters so it's

offset by one and that's because y are

the targets for each position in the

input and then here I'm iterating over

all the block size of eight and the

context is always all the characters in

x uh up to T and including T and the

target is always the teth character but

in the targets array y so let me just

run

this and basically it spells out what I

said in words uh these are the eight

examples hidden in a chunk of nine

characters that we uh sampled from the

training set I want to mention one more

thing we train on all the eight examples

here with context between one all the

way up to context of block size and we

train on that not just for computational

reasons because we happen to have the

sequence already or something like that

it's not just done for efficiency it's

also done um to make the Transformer

Network be used to seeing contexts all

the way from as little as one all the

way to block size and we'd like the

transform to be used to seeing

everything in between and that's going

to be useful later during inference

because while we're sampling we can

start the sampling generation with as

little as one character of context and

the Transformer knows how to predict the

next character with all the way up to

just context of one and so then it can

predict everything up to block size and

after block size we have to start

truncating because the Transformer will

will never um receive more than block

size inputs when it's predicting the

next

character Okay so we've looked at the

time dimension of the tensors that are

going to be feeding into the Transformer

there's one more Dimension to care about

and that is the batch Dimension and so

as we're sampling these chunks of text

we're going to be actually every time

we're going to feed them into a

Transformer we're going to have many

batches of multiple chunks of text that

are all like stacked up in a single

tensor and that's just done for

efficiency just so that we can keep the

gpus busy uh because they are very good

at parallel processing of um of data and

so we just want to process multiple

chunks all at the same time but those

chunks are processed completely

independently they don't talk to each

other and so on so let me basically just

generalize this and introduce a batch

Dimension here's a chunk of

code let me just run it and then I'm

going to explain what it

does so here because we're going to

start sampling random locations in the

data set to pull chunks from I am

setting the seed so that um in the

random number generator so that the

numbers I see here are going to be the

same numbers you see later if you try to

reproduce this now the batch size here

is how many independent sequences we are

processing every forward backward pass

of the

Transformer the block size as I

explained is the maximum context length

to make those predictions so let's say B

size four block size eight and then

here's how we get batch for any

arbitrary split if the split is a

training split then we're going to look

at train data otherwise at valid data

that gives us the data array and then

when I Generate random positions to grab

a chunk out of I actually grab I

actually generate batch size number of

Random offsets so because this is four

we are ex is going to be a uh four

numbers that are randomly generated

between zero and Len of data minus block

size so it's just random offsets into

the training

set and then X's as I explained are the

first first block size characters

starting at I the Y's are the offset by

one of that so just add plus one and

then we're going to get those chunks for

every one of integers I INX and use a

torch. stack to take all those uh uh

one-dimensional tensors as we saw here

and we're going to um stack them up at

rows and so they all become a row in a

4x8 tensor

so here's where I'm printing then when I

sample a batch XB and YB the inputs to

the Transformer now are the input X is

the 4x8 tensor four uh rows of eight

columns and each one of these is a chunk

of the training

set and then the targets here are in the

associated array Y and they will come in

to the Transformer all the way at the

end uh to um create the loss function

uh so they will give us the correct

answer for every single position inside

X and then these are the four

independent

rows so spelled out as we did

before uh this 4x8 array contains a

total of 32 examples and they're

completely independent as far as the

Transformer is

concerned uh so when the input is 24 the

target is 43 or rather 43 here in the Y

array

when the input is 2443 the target is

58 uh when the input is 24 43 58 the

target is 5 Etc or like when it is a 52

581 the target is

58 right so you can sort of see this

spelled out these are the 32 independent

examples packed in to a single batch of

the input X and then the desired targets

are in y and so now this integer tensor

of um X is going to feed into the

Transformer and that Transformer is

going to simultaneously process all

these examples and then look up the

correct um integers to predict in every

one of these positions in the tensor y

okay so now that we have our batch of

input that we'd like to feed into a

Transformer let's start basically

feeding this into neural networks now

we're going to start off with the

simplest possible neural network which

in the case of language modeling in my

opinion is the Byram language model and

we've covered the Byram language model

in my make more series in a lot of depth

and so here I'm going to sort of go

faster and let's just Implement pytorch

module directly that implements the byr

language

model so I'm importing the pytorch um NN

module uh for

reproducibility and then here I'm

constructing a Byram language model

which is a subass of NN

module and then I'm calling it and I'm

passing it the inputs and the targets

and I'm just printing now when the

inputs on targets come here you see that

I'm just taking the index uh the inputs

X here which I rename to idx and I'm

just passing them into this token

embedding table so it's going on here is

that here in the Constructor we are

creating a token embedding table and it

is of size vocap size by vocap

size and we're using an. embedding which

is a very thin wrapper around basically

a tensor of shape voap size by vocab

size and what's happening here is that

when we pass idx here every single

integer in our input is going to refer

to this embedding table and it's going

to pluck out a row of that embedding

table corresponding to its index so 24

here will go into the embedding table

and we'll pluck out the 24th row and

then 43 will go here and pluck out the

43d row Etc and then pytorch is going to

arrange all of this into a batch by Time

by channel uh tensor in this case batch

is four time is eight and C which is the

channels is vocab size or 65 and so

we're just going to pluck out all those

rows arrange them in a b by T by C and

now we're going to interpret this as the

logits which are basically the scores

for the next character in the sequence

and so what's happening here is we are

predicting what comes next based on just

the individual identity of a single

token and you can do that because um I

mean currently the tokens are not

talking to each other and they're not

seeing any context except for they're

just seeing themselves so I'm a f I'm a

token number five and then I can

actually make pretty decent predictions

about what comes next just by knowing

that I'm token five because some

characters uh know um C follow other

characters in in typical scenarios so we

saw a lot of this in a lot more depth in

the make more series and here if I just

run this then we currently get the

predictions the scores the lits for

every one of the 4x8 positions now that

we've made predictions about what comes

next we'd like to evaluate the loss

function and so in make more series we

saw that a good way to measure a loss or

like a quality of the predictions is to

use the negative log likelihood loss

which is also implemented in pytorch

under the name cross entropy so what we'

like to do here is loss is the cross

entropy on the predictions and the

targets and so this measures the quality

of the logits with respect to the

Targets in other words we have the

identity of the next character so how

well are we predicting the next

character based on the lits and

intuitively the correct um the correct

dimension of low jits uh depending on

whatever the target is should have a

very high number and all the other

dimensions should be very low number

right now the issue is that this won't

actually this is what we want we want to

basically output the logits and the

loss this is what we want but

unfortunately uh this won't actually run

we get an error message but intuitively

we want to uh measure this now when we

go to the pytorch um cross entropy

documentation here um we're trying to

call the cross entropy in its functional

form uh so that means we don't have to

create like a module for it but here

when we go to the documentation you have

to look into the details of how pitor

expects these inputs and basically the

issue here is ptor expects if you have

multi-dimensional input which we do

because we have a b BYT by C tensor then

it actually really wants the channels to

be the second uh Dimension here so if

you um so basically it wants a b by C

BYT instead of a b by T by C and so it's

just the details of how P torch treats

um these kinds of inputs and so we don't

actually want to deal with that so what

we're going to do instead is we need to

basically reshape our logits so here's

what I like to do I like to take

basically give names to the dimensions

so lit. shape is B BYT by C and unpack

those numbers and then let's uh say that

logits equals lit. View and we want it

to be a b * c b * T by C so just a two-

dimensional

array right so we're going to take all

the we're going to take all of these um

positions here and we're going to uh

stretch them out in a onedimensional

sequence and uh preserve the channel

Dimension as the second

dimension so we're just kind of like

stretching out the array so it's two-

dimensional and in that case it's going

to better conform to what pytorch uh

sort of expects in its Dimensions now we

have to do the same to targets because

currently targets are um of shape B by T

and we want it to be just B * T so

onedimensional now alternatively you

could always still just do minus one

because pytor will guess what this

should be if you want to lay it out uh

but let me just be explicit and say p *

t once we've reshaped this it will match

the cross entropy case and then we

should be able to evaluate our

loss okay so that R now and we can do

loss and So currently we see that the

loss is

4.87 now because our uh we have 65

possible vocabulary elements we can

actually guess at what the loss should

be and in

particular we covered negative log

likelihood in a lot of detail we are

expecting log or lawn of um 1 over 65

and negative of that so we're expecting

the loss to be about 4.1 17 but we're

getting 4.87 and so that's telling us

that the initial predictions are not uh

super diffuse they've got a little bit

of entropy and so we're guessing wrong

uh so uh yes but actually we're I a we

are able to evaluate the loss okay so

now that we can evaluate the quality of

the model on some data we'd like to also

be able to generate from the model so

let's do the generation now I'm going to

go again a little bit faster here

because I covered all this already in

previous

videos

so here's a generate function for the

model so we take some uh we take the the

same kind of input idx here and

basically this is the current uh context

of some characters in a batch in some

batch so it's also B BYT and the job of

generate is to basically take this B BYT

and extend it to be B BYT + 1 plus 2

plus 3 and so it's just basically it

continues the generation in all the

batch dimensions in the time Dimension

So that's its job and it will do that

for Max new tokens so you can see here

on the bottom there's going to be some

stuff here but on the bottom whatever is

predicted is concatenated on top of the

previous idx along the First Dimension

which is the time Dimension to create a

b BYT + one so that becomes a new idx so

the job of generate is to take a b BYT

and make it a b BYT plus 1 plus 2 plus

three as many as we want Max new tokens

so this is the generation from the model

now inside the generation what what are

we doing we're taking the current

indices we're getting the predictions so

we get uh those are in the low jits and

then the loss here is going to be

ignored because um we're not we're not

using that and we have no targets that

are sort of ground truth targets that

we're going to be comparing with

then once we get the logits we are only

focusing on the last step so instead of

a b by T by C we're going to pluck out

the negative-1 the last element in the

time Dimension because those are the

predictions for what comes next so that

gives us the logits which we then

convert to probabilities via softmax and

then we use tor. multinomial to sample

from those probabilities and we ask

pytorch to give us one sample and so idx

next will become a b by one because in

each uh one of the batch Dimensions

we're going to have a single prediction

for what comes next so this num samples

equals one will make this be a

one and then we're going to take those

integers that come from the sampling

process according to the probability

distribution given here and those

integers got just concatenated on top of

the current sort of like running stream

of integers and this gives us a b BYT +

one and then we can return that now one

thing here is you see how I'm calling

self of idx which will end up going to

the forward function I'm not providing

any Targets So currently this would give

an error because targets is uh is uh

sort of like not given so targets has to

be optional so targets is none by

default and then if targets is none then

there's no loss to create so it's just

loss is none but else all of this

happens and we can create a loss so this

will make it so um if we have the

targets we provide them and get a loss

if we have no targets it will'll just

get the

loits so this here will generate from

the model um and let's take that for a

ride

now oops so I have another code chunk

here which will generate for the model

from the model and okay this is kind of

crazy so maybe let me let me break this

down so these are the idx

right I'm creating a batch will be just

one time will be just one so I'm

creating a little one by one tensor and

it's holding a zero and the D type the

data type is uh integer so zero is going

to be how we kick off the generation and

remember that zero is uh is the element

standing for a new line character so

it's kind of like a reasonable thing to

to feed in as the very first character

in a sequence to be the new

line um so it's going to be idx which

we're going to feed in here then we're

going to ask for 100 tokens

and then. generate will continue that

now because uh generate works on the

level of batches we we then have to

index into the zero throw to basically

unplug the um the single batch Dimension

that exists and then that gives us a um

time steps just a onedimensional array

of all the indices which we will convert

to simple python list from pytorch

tensor so that that can feed into our

decode function and uh convert those

integers into text so let me bring this

back and we're generating 100 tokens

let's

run and uh here's the generation that we

achieved so obviously it's garbage and

the reason it's garbage is because this

is a totally random model so next up

we're going to want to train this model

now one more thing I wanted to point out

here is this function is written to be

General but it's kind of like ridiculous

right now because

we're feeding in all this we're building

out this context and we're concatenating

it all and we're always feeding it all

into the model but that's kind of

ridiculous because this is just a simple

Byram model so to make for example this

prediction about K we only needed this W

but actually what we fed into the model

is we fed the entire sequence and then

we only looked at the very last piece

and predicted K so the only reason I'm

writing it in this way is because right

now this is a byr model but I'd like to

keep keep this function fixed and I'd

like it to work um later when our

characters actually um basically look

further in the history and so right now

the history is not used so this looks

silly uh but eventually the history will

be used and so that's why we want to uh

do it this way so just a quick comment

on that so now we see that this is um

random so let's train the model so it

becomes a bit less random okay let's Now

train the model so first what I'm going

to do is I'm going to create a pyour

optimization object so here we are using

the optimizer ATM W um now in a make

more series we've only ever use tastic

gradi in descent the simplest possible

Optimizer which you can get using the

SGD instead but I want to use Adam which

is a much more advanced and popular

Optimizer and it works extremely well

for uh typical good setting for the

learning rate is roughly 3 E4 uh but for

very very small networks like is the

case here you can get away with much

much higher learning rates R3 or even

higher probably but let me create the

optimizer object which will basically

take the gradients and uh update the

parameters using the

gradients and then here our batch size

up above was only four so let me

actually use something bigger let's say

32 and then for some number of steps um

we are sampling a new batch of data

we're evaluating the loss uh we're

zeroing out all the gradients from the

previous step getting the gradients for

all the parameters and then using those

gradients to up update our parameters so

typical training loop as we saw in the

make more series so let me now uh run

this for say 100 iterations and let's

see what kind of losses we're going to

get so we started around

4.7 and now we're getting to down to

like 4.6 4.5 Etc so the optimization is

definitely happening but um let's uh

sort of try to increase number of

iterations and only print at the

end because we probably want train for

longer okay so we're down to 3.6

roughly roughly down to

three this is the most janky

optimization okay it's working let's

just do

10,000 and then from here we want to

copy this and hopefully that we're going

to get something reason and of course

it's not going to be Shakespeare from a

byr model but at least we see that the

loss is improving and uh hopefully we're

expecting something a bit more

reasonable okay so we're down at about

2.5 is let's see what we get okay

dramatic improvements certainly on what

we had here so let me just increase the

number of tokens okay so we see that

we're starting to get something at least

like reasonable is

um certainly not shakes spear but uh the

model is making progress so that is the

simplest possible

model so now what I'd like to do

is obviously this is a very simple model

because the tokens are not talking to

each other so given the previous context

of whatever was generated we're only

looking at the very last character to

make the predictions about what comes

next so now these uh now these tokens

have to start talking to each other and

figuring out what is in the context so

that they can make better predictions

for what comes next and this is how

we're going to kick off the uh

Transformer okay so next I took the code

that we developed in this juper notebook

and I converted it to be a script and

I'm doing this because I just want to

simplify our intermediate work into just

the final product that we have at this

point so in the top here I put all the

hyp parameters that we to find I

introduced a few and I'm going to speak

to that in a little bit otherwise a lot

of this should be recognizable uh

reproducibility read data get the

encoder and the decoder create the train

into splits uh use the uh kind of like

data loader um that gets a batch of the

inputs and Targets this is new and I'll

talk about it in a second now this is

the Byram language model that we

developed and it can forward and give us

a logits and loss and it can

generate and then here we are creating

the optimizer and this is the training

Loop so everything here should look

pretty familiar now some of the small

things that I added number one I added

the ability to run on a GPU if you have

it so if you have a GPU then you can

this will use Cuda instead of just CPU

and everything will be a lot more faster

now when device becomes Cuda then we

need to make sure that when we load the

data we move it to

device when we create the model we want

to move uh the model parameters to

device so as an example here we have the

N an embedding table and it's got a

weight inside it which stores the uh

sort of lookup table so so that would be

moved to the GPU so that all the

calculations here happen on the GPU and

they can be a lot faster and then

finally here when I'm creating the

context that feeds in to generate I have

to make sure that I create it on the

device number two what I introduced is

uh the fact that here in the training

Loop here I was just printing the um l.

item inside the training Loop but this

is a very noisy measurement of the

current loss because every batch will be

more or less lucky and so what I want to

do usually um is uh I have an estimate

loss function and the estimate loss

basically then um goes up here and it

averages up the loss over multiple

batches so in particular we're going to

iterate eval iter times and we're going

to basically get our loss and then we're

going to get the average loss for both

splits and so this will be a lot less

noisy so here when we call the estimate

loss we're we're going to report the uh

pretty accurate train and validation

loss now when we come back up you'll

notice a few things here I'm setting the

model to evaluation phase and down here

I'm resetting it back to training phase

now right now for our model as is this

doesn't actually do anything because the

only thing inside this model is this uh

nn. embedding and um this this um

Network would behave both would behave

the same in both evaluation mode and

training mode we have no drop off layers

we have no batm layers Etc but it is a

good practice to Think Through what mode

your neural network is in because some

layers will have different Behavior Uh

at inference time or training time and

there's also this context manager torch

up nograd and this is just telling

pytorch that everything that happens

inside this function we will not call do

backward on and so pytorch can be a lot

more efficient with its memory use

because it doesn't have to store all the

intermediate variables uh because we're

never going to call backward and so it

can it can be a lot more memory

efficient in that way so also a good

practice to tpy torch when we don't

intend to do back

propagation so right now this script is

about 120 lines of code of and that's

kind of our starter code I'm calling it

b.p and I'm going to release it later

now running this

script gives us output in the terminal

and it looks something like this it

basically as I ran this code uh it was

giving me the train loss and Val loss

and we see that we convert to somewhere

around

2.5 with the pyr model and then here's

the sample that we produced at the

end and so we have everything packaged

up in the script and we're in a good

position now to iterate on this okay so

we are almost ready to start writing our

very first self attention block for

processing these uh tokens now before we

actually get there I want to get you

used to a mathematical trick that is

used in the self attention inside a

Transformer and is really just like at

the heart of an an efficient

implementation of self attention and so

I want to work with this toy example to

just get you used to this operation and

then it's going to make it much more

clear once we actually get to um to it

uh in the script

again so let's create a b BYT by C where

BT and C are just 48 and two in the toy

example and these are basically channels

and we have uh batches and we have the

time component and we have information

at each point in the sequence so

see now what we would like to do is we

would like these um tokens so we have up

to eight tokens here in a batch and

these eight tokens are currently not

talking to each other and we would like

them to talk to each other we'd like to

couple them and in particular we don't

we we want to couple them in a very

specific way so the token for example at

the fifth location it should not

communicate with tokens in the sixth

seventh and eighth location

because uh those are future tokens in

the sequence the token on the fifth

location should only talk to the one in

the fourth third second and first so

it's only so information only flows from

previous context to the current time

step and we cannot get any information

from the future because we are about to

try to predict the

future so what is the easiest way for

tokens to communicate okay the easiest

way I would say is okay if we're up to

if we're a fifth token and I'd like to

communicate with my past the simplest

way we can do that is to just do a

weight is to just do an average of all

the um of all the preceding elements so

for example if I'm the fif token I would

like to take the channels uh that make

up that are information at my step but

then also the channels from the fourth

step third step second step and the

first step I'd like to average those up

and then that would become sort of like

a feature Vector that summarizes me in

the context of my history now of course

just doing a sum or like an average is

an extremely weak form of interaction

like this communication is uh extremely

lossy we've lost a ton of information

about the spatial Arrangements of all

those tokens uh but that's okay for now

we'll see how we can bring that

information back later for now what we

would like to do is for every single

batch element independently for every

teeth token in that sequence we'd like

to now calculate the average of all the

vectors in all the previous tokens and

also at this token so let's write that

out um I have a small snippet here and

instead of just fumbling around let me

just copy paste it and talk to

it so in other words we're going to

create X and B is short for bag of words

because bag of words is um is kind of

like um a term that people use when you

are just averaging up things so this is

just a bag of words basically there's a

word stored on every one of these eight

locations and we're doing a bag of words

we're just averaging

so in the beginning we're going to say

that it's just initialized at Zero and

then I'm doing a for Loop here so we're

not being efficient yet that's coming

but for now we're just iterating over

all the batch Dimensions independently

iterating over time and then the

previous uh tokens are at this uh batch

Dimension and then everything up to and

including the teeth token okay so when

we slice out X in this way X prev

Becomes of shape um how many T elements

there were in the past and then of

course C so all the two-dimensional

information from these little tokens so

that's the previous uh sort of chunk of

um tokens from my current sequence and

then I'm just doing the average or the

mean over the zero Dimension so I'm

averaging out the time here and I'm just

going to get a little c one dimensional

Vector which I'm going to store in X bag

of words so I can run this and and uh

this is not going to be very informative

because let's see so this is X of Zer so

this is the zeroth batch element and

then expo at zero now you see how the at

the first location here you see that the

two are equal and that's because it's

we're just doing an average of this one

token but here this one is now an

average of these two and now this one is

an average of these

three and so on

so uh and this last one is the average

of all of these elements so vertical

average just averaging up all the tokens

now gives this outcome

here so this is all well and good uh but

this is very inefficient now the trick

is that we can be very very efficient

about doing this using matrix

multiplication so that's the

mathematical trick and let me show you

what I mean let's work with the toy

example here let me run it and I'll

explain I have a simple Matrix here that

is a 3X3 of all ones a matrix B of just

random numbers and it's a 3x2 and a

matrix C which will be 3x3 multip 3x2

which will give out a 3x2 so here we're

just using um matrix multiplication so a

multiply B gives us

C okay so how are these numbers in C um

achieved right so this number in the top

left is the first row of a dot product

with the First Column of B and since all

the the row of a right now is all just

ones then the do product here with with

this column of B is just going to do a

sum of these of this column so 2 + 6 + 6

is

14 the element here in the output of C

is also the first column here the first

row of a multiplied now with the second

column of B so 7 + 4 + 5 is 16 now you

see that there's repeating elements here

so this 14 again is because this row is

again all ones and it's multiplying the

First Column of B so we get 14 and this

one is and so on so this last number

here is the last row do product last

column now the trick here is uh the

following this is just a boring number

of um it's just a boring array of all

ones but torch has this function called

Trail which is short for a

triangular uh something like that and

you can wrap it in torch up once and it

will just return the lower triangular

portion of this

okay so now it will basically zero out

uh these guys here so we just get the

lower triangular part well what happens

if we do

that so now we'll have a like this and B

like this and now what are we getting

here in C well what is this number well

this is the first row times the First

Column and because this is zeros

uh these elements here are now ignored

so we just get a two and then this

number here is the first row times the

second column and because these are

zeros they get ignored and it's just

seven this seven multiplies this one but

look what happened here because this is

one and then zeros we what ended up

happening is we're just plucking out the

row of this row of B and that's what we

got now here we have one 1 Z so here 110

do product with these two columns will

now give us 2 + 6 which is 8 and 7 + 4

which is 11 and because this is 111 we

ended up with the addition of all of

them and so basically depending on how

many ones and zeros we have here we are

basically doing a sum currently of a

variable number of these rows and that

gets deposited into

C So currently we're doing sums because

these are ones but we can also do

average right and you can start to see

how we could do average uh of the rows

of B uh sort of in an incremental

fashion because we don't have to we can

basically normalize these rows so that

they sum to one and then we're going to

get an average so if we took a and then

we did aals

aide torch. sum in the um of a in the um

oneth Dimension and then let's keep them

as true so so therefore the broadcasting

will work out so if I rerun this you see

now that these rows now sum to one so

this row is one this row is 0. 5.5 Z and

here we get 1/3 and now when we do a

multiply B what are we getting here we

are just getting the first row first row

here now we are getting the average of

the first two

rows okay so 2 and six average is four

and four and seven average is

5.5 and on the bottom here we are now

getting the average of these three rows

so the average of all of elements of B

are now deposited here and so you can

see that by manipulating these uh

elements of this multiplying Matrix and

then multiplying it with any given

Matrix we can do these averages in this

incremental fashion because we just get

um and we can manipulate that based on

the elements of a okay so that's very

convenient so let's let's swing back up

here and see how we can vectorize this

and make it much more efficient using

what we've learned so in

particular we are going to produce an

array a but here I'm going to call it we

short for weights but this is our

a and this is how much of every row we

want to average up and it's going to be

an average because you can see that

these rows sum to

one so this is our a and then our B in

this example of course is X

so what's going to happen here now is

that we are going to have an expo

2 and this Expo 2 is going to be way

multiplying

RX so let's think this true way is T BYT

and this is Matrix multiplying in

pytorch a b by T by

C and it's giving us uh different what

shape so pytorch will come here and it

will see that these shapes are not the

same so it will create a batch Dimension

here and this is a batched matrix

multiply and so it will apply this

matrix multiplication in all the batch

elements um in parallel and individually

and then for each batch element there

will be a t BYT multiplying T by C

exactly as we had

below so this will now create B by T by

C and Expo 2 will now become identical

to Expo

so we can see that torch. all close of

xbo and xbo 2 should be true

now so this kind of like convinces us

that uh these are in fact um the same so

xbo and xbo 2 if I just print

them uh okay we're not going to be able

to okay we're not going to be able to

just stare it down but

um well let me try Expo basically just

at the zeroth element and Expo two at

the zeroth element so just the first

batch and we should see that this and

that should be identical which they

are right so what happened here the

trick is we were able to use batched

Matrix multiply to do this uh

aggregation really and it's a weighted

aggregation and the weights are

specified in this um T BYT array and

we're basically doing weighted sums and

uh these weighted sums are are U

according to uh the weights inside here

they take on sort of this triangular

form and so that means that a token at

the teth dimension will only get uh sort

of um information from the um tokens

perceiving it so that's exactly what we

want and finally I would like to rewrite

it in one more way and we're going to

see why that's useful so this is the

third version and it's also identical to

the first and second but let me talk

through it it uses

softmax so Trill here is this Matrix

lower triangular

ones way begins as all

zero okay so if I just print way in the

beginning it's all zero then I

used masked fill so what this is doing

is we. masked fill it's all zeros and

I'm saying for all the elements where

Trill is equal equal Z make them be

negative Infinity so all the elements

where Trill is zero will become negative

Infinity now so this is what we get and

then the final line here is

softmax so if I take a softmax along

every single so dim is negative one so

along every single row if I do softmax

what is that going to

do well softmax is um is also like a

normalization operation right and so

spoiler alert you get the exact same

Matrix let me bring back to

softmax and recall that in softmax we're

going to exponentiate every single one

of these and then we're going to divide

by the sum and so if we exponentiate

every single element here we're going to

get a one and here we're going to get uh

basically zero 0 z0 Z everywhere else

and then when we normalize we just get

one here we're going to get one one and

then zeros and then softmax will again

divide and this will give us 5.5 and so

on and so this is also the uh the same

way to produce uh this mask now the

reason that this is a bit more

interesting and the reason we're going

to end up using it in self

attention is that these weights here

begin uh with zero and you can think of

this as like an interaction strength or

like an affinity so basically it's

telling us how much of each uh token

from the past do we want to Aggregate

and average up

and then this line is saying tokens from

the past cannot communicate by setting

them to negative Infinity we're saying

that we will not aggregate anything from

those

tokens and so basically this then goes

through softmax and through the weighted

and this is the aggregation through

matrix

multiplication and so what this is now

is you can think of these as um these

zeros are currently just set by us to be

zero but a quick preview is that these

affinities between the tokens are not

going to be just constant at zero

they're going to be data dependent these

tokens are going to start looking at

each other and some tokens will find

other tokens more or less interesting

and depending on what their values are

they're going to find each other

interesting to different amounts and I'm

going to call those affinities I think

and then here we are saying the future

cannot communicate with the past we're

we're going to clamp them and then when

we normalize and sum we're going to

aggregate uh sort of their values

depending on how interesting they find

each other and so that's the preview for

self attention and basically long story

short from this entire section is that

you can do weighted aggregations of your

past

Elements by having by using matrix

multiplication of a lower triangular

fashion and then the elements here in

the lower triangular part are telling

you how much of each element uh fuses

into this position so we're going to use

this trick now to develop the self

attention block block so first let's get

some quick preliminaries out of the way

first the thing I'm kind of bothered by

is that you see how we're passing in

vocap size into the Constructor there's

no need to do that because vocap size is

already defined uh up top as a global

variable so there's no need to pass this

stuff

around next what I want to do is I don't

want to actually create I want to create

like a level of indirection here where

we don't directly go to the embedding

for the um logits but instead we go

through this intermediate phase because

we're going to start making that bigger

so let me introduce a new variable n

embed it shorted for number of embedding

Dimensions so

nbed here will be say 32 that was a

suggestion from GitHub co-pilot by the

way um it also suest 32 which is a good

number so this is an embedding table and

only 32 dimensional

embeddings so then here this is not

going to give us logits directly instead

this is going to give us token

embeddings that's I'm going to call it

and then to go from the token Tings to

the logits we're going to need a linear

layer so self. LM head let's call it

short for language modeling head is n

and linear from n ined up to vocap size

and then when we swing over here we're

actually going to get the loits by

exactly what the co-pilot says now we

have to be careful here because this C

and this C are not equal um this is nmed

C and this is vocap size so let's just

say that n ined is equal to

C and then this just creates one spous

layer of interaction through a linear

layer but uh this should basically

run so we see that this runs and uh this

currently looks kind of spous but uh

we're going to build on top of this now

next up so far we've taken these indices

and we've encoded them based on the

identity of the uh tokens in inside idx

the next thing that people very often do

is that we're not just encoding the

identity of these tokens but also their

position so we're going to have a second

position uh embedding table here so

self. position embedding table is an an

embedding of block size by an embed and

so each position from zero to block size

minus one will also get its own

embedding vector and then here first let

me decode B BYT from idx do

shape and then here we're also going to

have a pause embedding which is the

positional embedding and these are this

is to arrange so this will be basically

just integers from Z to T minus one and

all of those integers from 0 to T minus

one get embedded through the table to

create a t by

C and then here this gets renamed to

just say x and x will be the addition of

the token embeddings with the positional

embeddings and here the broadcasting

note will work out so B by T by C plus T

by C

this gets right aligned a new dimension

of one gets added and it gets

broadcasted across

batch so at this point x holds not just

the token identities but the positions

at which these tokens occur and this is

currently not that useful because of

course we just have a simple byr model

so it doesn't matter if you're in the

fifth position the second position or

wherever it's all translation invariant

at this stage uh so this information

currently wouldn't help uh but as we

work on the self attention block we'll

see that this starts to matter

okay so now we get the Crux of self

attention so this is probably the most

important part of this video to

understand we're going to implement a

small self attention for a single

individual head as they're called so we

start off with where we were so all of

this code is familiar so right now I'm

working with an example where I Chang

the number of channels from 2 to 32 so

we have a 4x8 arrangement of tokens and

each to and the information each token

is currently 32 dimensional but we just

are working with random

numbers now we saw here that the code as

we had it before does a uh simple weight

simple average of all the past tokens

and the current token so it's just the

previous information and current

information is just being mixed together

in an average and that's what this code

currently achieves and it Doo by

creating this lower triangular structure

which allows us to mask out this uh we

uh Matrix that we create so we mask it

out and then we normalize it and

currently when we initialize the

affinities between all the different

sort of tokens or nodes I'm going to use

those terms

interchangeably so when we initialize

the affinities between all the different

tokens to be zero then we see that way

gives us this um structure where every

single row has these um uniform numbers

and so that's what that's what then uh

in this Matrix multiply makes it so that

we're doing a simple

average now we don't actually want this

to be all uniform because different uh

tokens will find different other tokens

more or less interesting and we want

that to be data dependent so for example

if I'm a vowel then maybe I'm looking

for consonants in my past and maybe I

want to know what those consonants are

and I want that information to flow to

me and so I want to now gather

information from the past but I want to

do it in the data dependent way and this

is the problem that self attention

solves now the way self attention solves

this is the following every single node

or every single token at each position

will emit two vectors it will emit a

query and it will emit a

key now the query Vector roughly

speaking is what am I looking for and

the key Vector roughly speaking is what

do I

contain and then the way we get

affinities between these uh tokens now

in a sequence is we basically just do a

do product between the keys and the

queries so my query dot products with

all the keys of all the other tokens and

that dot product now becomes

wayy and so um if the key and the query

are sort of aligned they will interact

to a very high amount and then I will

get to learn more about that specific

token as opposed to any other token in

the sequence

so let's implement this

now we're going to implement a

single what's called head of self

attention so this is just one head

there's a hyper parameter involved with

these heads which is the head size and

then here I'm initializing linear

modules and I'm using bias equals false

so these are just going to apply a

matrix multiply with some fixed

weights and now let me produce a key and

q k and Q by forwarding these modules on

X so the size of this will now

become B by T by 16 because that is the

head size and the same here B by T by

16 so this being the head size so you

see here that when I forward this linear

on top of my X all the tokens in all the

positions in the B BYT Arrangement all

of them them in parallel and

independently produce a key and a query

so no communication has happened

yet but the communication comes now all

the queries will do product with all the

keys so basically what we want is we

want way now or the affinities between

these to be query multiplying key but we

have to be careful with uh we can't

Matrix multiply this we actually need to

transpose uh K but we have to be also

careful because these are when you have

The Bash Dimension so in particular we

want to transpose uh the last two

dimensions dimension1 and dimension -2

so

-21 and so this Matrix multiply now will

basically do the following B by T by

16 Matrix multiplies B by 16 by T to

give us B by T by

T right

so for every row of B we're now going to

have a t Square Matrix giving us the

affinities and these are now the way so

they're not zeros they are now coming

from this dot product between the keys

and the queries so this can now run I

can I can run this and the weighted

aggregation now is a function in a data

Bandon manner between the keys and

queries of these nodes so just

inspecting what happened

here the way takes on this form

and you see that before way was uh just

a constant so it was applied in the same

way to all the batch elements but now

every single batch elements will have

different sort of we because uh every

single batch element contains different

uh tokens at different positions and so

this is not data dependent so when we

look at just the zeroth uh Row for

example in the input these are the

weights that came out and so you can see

now that they're not just exactly

uniform um and in particular as an

example here for the last row this was

the eighth token and the eighth token

knows what content it has and it knows

at what position it's in and now the E

token based on that uh creates a query

hey I'm looking for this kind of stuff

um I'm a vowel I'm on the E position I'm

looking for any consonant at positions

up to four and then all the nodes get to

emit keys and maybe one of the channels

could be I am a I am a consonant and I

am in a position up to four and that

that key would have a high number in

that specific Channel and that's how the

query and the key when they do product

they can find each other and create a

high affinity and when they have a high

Affinity like say uh this token was

pretty interesting to uh to this eighth

token when they have a high Affinity

then through the softmax I will end up

aggregating a lot of its information

into my position and so I'll get to

learn a lot about

it now just this we're looking at way

after this has already happened um let

me erase this operation as well so let

me erase the masking and the softmax

just to show you the under the hood

internals and how that works so without

the masking in the softmax Whey comes

out like this right this is the outputs

of the do products um and these are the

raw outputs and they take on values from

negative you know two to positive two

Etc so that's the raw interactions and

raw affinities between all the nodes but

now if I'm going if I'm a fifth node I

will not want to aggregate anything from

the sixth node seventh node and the

eighth node so actually we use the upper

triangular masking so those are not

allowed to

communicate and now we actually want to

have a nice uh distribution uh so we

don't want to aggregate negative .11 of

this node that's crazy so instead we

exponentiate and normalize and now we

get a nice distribution that sums to one

and this is telling us now in the data

dependent manner how much of information

to aggregate from any of these tokens in

the

past so that's way and it's not zeros

anymore but but it's calculated in this

way now there's one more uh part to a

single self attention head and that is

that when we do the aggregation we don't

actually aggregate the tokens exactly we

aggregate we produce one more value here

and we call that the

value so in the same way that we

produced p and query we're also going to

create a value

and

then here we don't

aggregate X we calculate a v which is

just achieved by uh propagating this

linear on top of X again and then we

output way multiplied by V so V is the

elements that we aggregate or the the

vectors that we aggregate instead of the

raw

X and now of course uh this will make it

so that the output here of this single

head will be 16 dimensional because that

is the head

size so you can think of X as kind of

like private information to this token

if you if you think about it that way so

X is kind of private to this token so

I'm a fifth token at some and I have

some identity and uh my information is

kept in Vector X and now for the

purposes of the single head here's what

I'm interested in here's what I have and

if you find me interesting here's what I

will communicate to you and that's

stored in v and so V is the thing that

gets aggregated for the purposes of this

single head between the different

notes and that's uh basically the self

attention mechanism this is this is what

it does there are a few notes that I

would make like to make about attention

number one attention is a communication

mechanism you can really think about it

as a communication mechanism where you

have a number of nodes in a directed

graph where basically you have edges

pointed between noes like

this and what happens is every node has

some Vector of information and it gets

to aggregate information via a weighted

sum from all of the nodes that point to

it and this is done in a data dependent

manner so depending on whatever data is

actually stored that you should not at

any point in time now our graph doesn't

look like this our graph has a different

structure we have eight nodes because

the block size is eight and there's

always eight to

tokens and uh the first node is only

pointed to by itself the second node is

pointed to by the first node and itself

all the way up to the eighth node which

is pointed to by all the previous nodes

and itself and so that's the structure

that our directed graph has or happens

happens to have in Auto regressive sort

of scenario like language modeling but

in principle attention can be applied to

any arbitrary directed graph and it's

just a communication mechanism between

the nodes the second note is that notice

that there is no notion of space so

attention simply acts over like a set of

vectors in this graph and so by default

these nodes have no idea where they are

positioned in the space and that's why

we need to encode them positionally and

sort of give them some information that

is anchored to a specific position so

that they sort of know where they are

and this is different than for example

from convolution because if you're run

for example a convolution operation over

some input there's a very specific sort

of layout of the information in space

and the convolutional filters sort of

act in space and so it's it's not like

an attention in ATT ention is just a set

of vectors out there in space they

communicate and if you want them to have

a notion of space you need to

specifically add it which is what we've

done when we calculated the um relative

the positional encode encodings and

added that information to the vectors

the next thing that I hope is very clear

is that the elements across the batch

Dimension which are independent examples

never talk to each other they're always

processed independently and this is a

batched matrix multiply that applies

basically a matrix multiplication uh

kind of in parallel across the batch

dimension so maybe it would be more

accurate to say that in this analogy of

a directed graph we really have because

the back size is four we really have

four separate pools of eight nodes and

those eight nodes only talk to each

other but in total there's like 32 nodes

that are being processed uh but there's

um sort of four separate pools of eight

you can look at it that way the next

note is that here in the case of

language modeling uh we have this

specific uh structure of directed graph

where the future tokens will not

communicate to the Past tokens but this

doesn't necessarily have to be the

constraint in the general case and in

fact in many cases you may want to have

all of the uh noes talk to each other uh

fully so as an example if you're doing

sentiment analysis or something like

that with a Transformer you might have a

number of tokens and you may want to

have them all talk to each other fully

because later you are predicting for

example the sentiment of the sentence

and so it's okay for these NOS to talk

to each other and so in those cases you

will use an encoder block of self

attention and uh all it means that it's

an encoder block is that you will delete

this line of code allowing all the noes

to completely talk to each other what

we're implementing here is sometimes

called a decoder block and it's called a

decoder because it is sort of like a

decoding language and it's got this

autor regressive format where you have

to mask with the Triangular Matrix so

that uh nodes from the future never talk

to the Past because they would give away

the answer

and so basically in encoder blocks you

would delete this allow all the noes to

talk in decoder blocks this will always

be present so that you have this

triangular structure uh but both are

allowed and attention doesn't care

attention supports arbitrary

connectivity between nodes the next

thing I wanted to comment on is you keep

me you keep hearing me say attention

self attention Etc there's actually also

something called cross attention what is

the

difference

so basically the reason this attention

is self attention is because because the

keys queries and the values are all

coming from the same Source from X so

the same Source X produces Keys queries

and values so these nodes are self

attending but in principle attention is

much more General than that so for

example an encoder decoder Transformers

uh you can have a case where the queries

are produced from X but the keys and the

values come from a whole separate

external source and sometimes from uh

encoder blocks that encode some context

that we'd like to condition on

and so the keys and the values will

actually come from a whole separate

Source those are nodes on the side and

here we're just producing queries and

we're reading off information from the

side so cross attention is used when

there's a separate source of nodes we'd

like to pull information from into our

nodes and it's self attention if we just

have nodes that would like to look at

each other and talk to each other so

this attention here happens to be self

attention but in principle um attention

is a lot more General okay and the last

note at this stage is if we come to the

attention is all need paper here we've

already implemented attention so given

query key and value we've U multiplied

the query and a key we've soft maxed it

and then we are aggregating the values

there's one more thing that we're

missing here which is the dividing by

one / square root of the head size the

DK here is the head size why are they

doing this finds this important so they

call it the scaled attention and it's

kind of like an important normalization

to basically

have the problem is if you have unit gsh

and inputs so zero mean unit variance K

and Q are unit gashin then if you just

do we naively then you see that your we

actually will be uh the variance will be

on the order of head size which in our

case is 16 but if you multiply by one

over head size square root so this is

square root and this is one

over then the variance of we will be one

so it will be

preserved now why is this important

you'll not notice that way

here will feed into

softmax and so it's really important

especially at initialization that we be

fairly diffuse so in our case here we

sort of locked out here and we had a

fairly diffuse numbers here so um like

this now the problem is that because of

softmax if weight takes on very positive

and very negative numbers inside it

softmax will actually converge towards

one hot vectors and so I can illustrate

that here um say we are applying softmax

to a tensor of values that are very

close to zero then we're going to get a

diffuse thing out of

softmax but the moment I take the exact

same thing and I start sharpening it

making it bigger by multiplying these

numbers by eight for example you'll see

that the softmax will start to sharpen

and in fact it will sharpen towards the

max so it will sharpen towards whatever

number here is the highest and so um

basically we don't want these values to

be too extreme especially at

initialization otherwise softmax will be

way too peaky and um you're basically

aggregating um information from like a

single node every node just agregates

information from a single other node

that's not what we want especially at

initialization and so the scaling is

used just to control the variance at

initialization okay so having said all

that let's now take our self attention

knowledge and let's uh take it for a

spin so here in the code I created this

head module and it implements a single

head of self attention so you give it a

head size and then here it creates the

key query and the value linear layers

typically people don't use biases in

these uh so those are the linear

projections that we're going to apply to

all of our nodes now here I'm creating

this Trill variable Trill is not a

parameter of the module so in sort of

pytorch naming conventions uh this is

called a buffer it's not a parameter and

you have to call it you have to assign

it to the module using a register buffer

so that creates the trill uh the triang

lower triangular Matrix and we're given

the input X this should look very

familiar now we calculate the keys the

queries we C calculate the attention

scores inside way uh we normalize it so

we're using scaled attention here then

we make sure that uh future doesn't

communicate with the past so this makes

it a decoder block and then softmax and

then aggregate the value and

output then here in the language model

I'm creating a head in the Constructor

and I'm calling it self attention head

and the head size I'm going to keep as

the same and embed just for

now and then here once we've encoded the

information with the token embeddings

and the position embeddings we're simply

going to feed it into the self attention

head and then the output of that is

going to go into uh the decoder language

modeling head and create the logits so

this the sort of the simplest way to

plug in a self attention component uh

into our Network right now I had to make

one more change which is that here in

the generate uh we have to make sure

that our idx that we feed into the model

because now we're using positional

embeddings we can never have more than

block size coming in because if idx is

more than block size then our position

embedding table is going to run out of

scope because it only has embeddings for

up to block size and so therefore I

added some uh code here to crop the

context that we're going to feed into

self um so that uh we never pass in more

than block siiz elements

so those are the changes and let's Now

train the network okay so I also came up

to the script here and I decreased the

learning rate because uh the self

attention can't tolerate very very high

learning rates and then I also increased

number of iterations because the

learning rate is lower and then I

trained it and previously we were only

able to get to up to 2.5 and now we are

down to 2.4 so we definitely see a

little bit of an improvement from 2.5 to

2.4 roughly uh but the text is still not

amazing so clearly the self attention

head is doing some useful communication

but um we still have a long way to go

okay so now we've implemented the scale.

product attention now next up and the

attention is all you need paper there's

something called multi-head attention

and what is multi-head attention it's

just applying multiple attentions in

parallel and concatenating their results

so they have a little bit of diagram

here I don't know if this is super clear

it's really just multiple attentions in

parallel so let's Implement that fairly

straightforward

if we want a multi-head attention then

we want multiple heads of self attention

running in parallel so in pytorch we can

do this by simply creating multiple

heads so however heads how however many

heads you want and then what is the head

size of each and then we run all of them

in parallel into a list and simply

concatenate all of the outputs and we're

concatenating over the channel

Dimension so the way this looks now is

we don't have just a single ATT

that uh has a hit size of 32 because

remember n Ed is

32 instead of having one Communication

channel we now have four communication

channels in parallel and each one of

these communication channels typically

will be uh smaller uh correspondingly so

because we have four communication

channels we want eight dimensional self

attention and so from each Communication

channel we're going to together eight

dimensional vectors and then we have

four of them and that concatenates to

give us 32 which is the original and

embed and so this is kind of similar to

um if you're familiar with convolutions

this is kind of like a group convolution

uh because basically instead of having

one large convolution we do convolution

in groups and uh that's multi-headed

self

attention and so then here we just use

essay heads self attention heads instead

now I actually ran it and uh scrolling

down I ran the same thing and then we

now get this down to 2.28 roughly and

the output is still the generation is

still not amazing but clearly the

validation loss is improving because we

were at 2.4 just now and so it helps to

have multiple communication channels

because obviously these tokens have a

lot to talk about they want to find the

consonants the vowels they want to find

the vowels just from certain positions

uh they want to find any kinds of

different things and so it helps to

create multiple independent channels of

communication gather lots of different

types of data and then uh decode the

output now going back to the paper for a

second of course I didn't explain this

figure in full detail but we are

starting to see some components of what

we've already implemented we have the

positional encodings the token encodings

that add we have the masked multi-headed

attention implemented now here's another

multi-headed attention which is a cross

attention to an encoder which we haven't

we're not going to implement in this

case I'm going to come back to that

later but I want you to notice that

there's a feed forward part here and

then this is grouped into a block that

gets repeat it again and again now the

feedforward part here is just a simple

uh multi-layer perceptron

um so the multi-headed so here position

wise feed forward networks is just a

simple little MLP so I want to start

basically in a similar fashion also

adding computation into the network and

this computation is on a per node level

so I've already implemented it and you

can see the diff highlighted on the left

here when I've added or changed things

now before we had the self multi-headed

self attention that did the

communication but we went way too fast

to calculate the logits so the tokens

looked at each other but didn't really

have a lot of time to think on what they

found from the other tokens and so what

I've implemented here is a little feet

forward single layer and this little

layer is just a linear followed by a Rel

nonlinearity and that's that's it so

it's just a little layer and then I call

it feed

forward um and embed

and then this feed forward is just

called sequentially right after the self

attention so we self attend then we feed

forward and you'll notice that the feet

forward here when it's applying linear

this is on a per token level all the

tokens do this independently so the self

attention is the communication and then

once they've gathered all the data now

they need to think on that data

individually and so that's what feed

forward is doing and that's why I've

added it here now when I train this the

validation LW actually continues to go

down now to 2. 24 which is down from

2.28 uh the output still look kind of

terrible but at least we've improved the

situation and so as a preview we're

going to now start to intersperse the

communication with the computation and

that's also what the Transformer does

when it has blocks that communicate and

then compute and it groups them and

replicates them okay so let me show you

what we'd like to do we'd like to do

something like this we have a block and

this block is is basically this part

here except for the cross

attention now the block basically

intersperses communication and then

computation the computation the

communication is done using multi-headed

selfelf attention and then the

computation is done using a feed forward

Network on all the tokens

independently now what I've added here

also is you'll

notice this takes the number of

embeddings in the embedding Dimension

and number of heads that we would like

which is kind of like group size in

group convolution and and I'm saying

that number of heads we'd like is four

and so because this is 32 we calculate

that because this is 32 the number of

heads should be four um the head size

should be eight so that everything sort

of works out Channel wise um so this is

how the Transformer structures uh sort

of the uh the sizes typically so the

head size will become eight and then

this is how we want to intersperse them

and then here I'm trying to create

blocks which is just a sequential

application of block block block so that

we're interspersing communication feed

forward many many times and then finally

we decode now I actually tried to run

this and the problem is this doesn't

actually give a very good uh answer and

very good result and the reason for that

is we're start starting to actually get

like a pretty deep neural net and deep

neural Nets uh suffer from optimization

issues and I think that's what we're

kind of like slightly starting to run

into so we need one more idea that we

can borrow from the um Transformer paper

to resolve those difficulties now there

are two optimizations that dramatically

help with the depth of these networks

and make sure that the networks remain

optimizable let's talk about the first

one the first one in this diagram is you

see this Arrow here and then this arrow

and this Arrow those are skip

connections or sometimes called residual

connections they come from this paper uh

the presidual learning for image

recognition from about

2015 uh that introduced the concept now

these are basically what it means is you

transform data but then you have a skip

connection with addition from the

previous features now the way I like to

visualize it uh that I prefer is the

following here the computation happens

from the top to bottom and basically you

have this uh residual pathway and you

are free to Fork off from the residual

pathway perform some computation and

then project back to the residual

pathway via addition and so you go from

the the uh inputs to the targets only

via plus and plus plus and the reason

this is useful is because during back

propagation remember from our microG

grad video earlier addition distributes

gradients equally to both of its

branches that that fed as the input and

so the supervision or the gradients from

the loss basically hop through every

addition node all the way to the input

and then also Fork off into the residual

blocks but basically you have this

gradient Super Highway that goes

directly from the supervision all the

way to the input unimpeded and then

these viral blocks are usually

initialized in the beginning so they

contribute very very little if anything

to the residual pathway they they are

initialized that way so in the beginning

they are sort of almost kind of like not

there but then during the optimization

they come online over time and they uh

start to contribute but at least at the

initialization you can go from directly

supervision to the input gradient is

unimpeded and just flows and then the

blocks over time

kick in and so that dramatically helps

with the optimization so let's implement

this so coming back to our block here

basically what we want to do is we want

to do xal

X+ self attention and xal X+ self. feed

forward so this is X and then we Fork

off and do some communication and come

back and we Fork off and we do some

computation and come back so those are

residual connections and then swinging

back up here we also have to introd use

this projection so nn.

linear and uh this is going to be

from after we concatenate this this is

the prze and embed so this is the output

of the self tension itself but then we

actually want the uh to apply the

projection and that's the

result so the projection is just a

linear transformation of the outcome of

this

layer so that's the projection back into

the virual pathway and then here in a

feet forward it's going to be the same

same thing I could have a a self doot

projection here as well but let me just

simplify it and let me uh couple it

inside the same sequential container and

so this is the projection layer going

back into the residual

pathway and

so that's uh well that's it so now we

can train this so I implemented one more

small change when you look into the

paper again you see that the

dimensionality of input and output is

512 for them and they're saying that the

inner layer here in the feet forward has

dimensionality of 248 so there's a

multiplier of four and so the inner

layer of the feet forward Network should

be multiplied by four in terms of

Channel sizes so I came here and I

multiplied four times embed here for the

feed forward and then from four times

nmed coming back down to nmed when we go

back to the pro uh to the projection so

adding a bit of computation here and

growing that layer that is in the

residual block on the side of the

residual

pathway and then I train this and we

actually get down all the way to uh 2.08

validation loss and we also see that

network is starting to get big enough

that our train loss is getting ahead of

validation loss so we're starting to see

like a little bit of

overfitting and um our our

um uh Generations here are still not

amazing but at least you see that we can

see like is here this now grief syn like

this starts to almost look like English

so um yeah we're starting to really get

there okay and the second Innovation

that is very helpful for optimizing very

deep neural networks is right here so we

have this addition now that's the

residual part but this Norm is referring

to something called layer Norm so layer

Norm is implemented in pytorch it's a

paper that came out a while back here

um and layer Norm is very very similar

to bash Norm so remember back to our

make more series part three we

implemented bash

normalization and uh bash normalization

basically just made sure that um Across

The Bash dimension any individual neuron

had unit uh Gan um distribution so it

was zero mean and unit standard

deviation one standard deviation output

so what I did here is I'm copy pasting

the bashor 1D that we developed in our

make more series and see here we can

initialize for example this module and

we can have a batch of 32 100

dimensional vectors feeding through the

bachor layer so what this does is it

guarantees that when we look at just the

zeroth column it's a zero mean one

standard deviation so it's normalizing

every single column of this uh input now

the rows are not uh going to be

normalized by default because we're just

normalizing columns so let's now

Implement layer Norm uh it's very

complicated look we come here we change

this from zero to one so we don't

normalize The Columns we normalize the

rows and now we've implemented layer

Norm

so now the columns are not going to be

normalized um but the rows are going to

be normalized for every individual

example it's 100 dimensional Vector is

normalized uh in this way and because

our computation Now does not span across

examples we can delete all of this

buffers stuff uh because uh we can

always apply this operation and don't

need to maintain any running buffers so

we don't need the

buffers uh we

don't There's no distinction between

training and test

time uh and we don't need these running

buffers we do keep gamma and beta we

don't need the momentum we don't care if

it's training or not and this is now a

layer

norm and it normalizes the rows instead

of the columns and this here is

identical to basically this here so

let's now Implement layer Norm in our

Transformer before I incorporate the

layer Norm I just wanted to note that as

I said very few details about the

Transformer have changed in the last 5

years but this is actually something

that slightly departs from the original

paper you see that the ADD and Norm is

applied after the

transformation but um in now it is a bit

more uh basically common to apply the

layer Norm before the transformation so

there's a reshuffling of the layer Norms

uh so this is called the prorm

formulation and that's the one that

we're going to implement as well so

select deviation from the original paper

basically we need two layer Norms layer

Norm one is uh NN do layer norm and we

tell it how many um what is the

embedding Dimension and we need the

second layer norm and then here the

layer Norms are applied immediately on X

so self. layer Norm one applied on X and

self. layer Norm two applied on X before

it goes into self attention and feed

forward and uh the size of the layer

Norm here is an ed so 32 so when the

layer Norm is normalizing our features

it is uh the normalization here uh

happens the mean and the variance are

taken over 32 numbers so the batch and

the time act as batch Dimensions both of

them so this is kind of like a per token

um transformation that just normalizes

the features and makes them a unit mean

uh unit Gan at

initialization but of course because

these layer Norms inside it have these

gamma and beta training

parameters uh the layer Norm will U

eventually create outputs that might not

be unit gion but the optimization will

determine that so for now this is the uh

this is incorporating the layer norms

and let's train them on okay so I let it

run and we see that we get down to 2.06

which is better than the previous 2.08

so a slight Improvement by adding the

layer norms and I'd expect that they

help uh even more if we had bigger and

deeper Network one more thing I forgot

to add is that there should be a layer

Norm here also typically as at the end

of the Transformer and right before the

final uh linear layer that decodes into

vocabulary so I added that as well so at

this stage we actually have a pretty

complete uh Transformer according to the

original paper and it's a decoder only

Transformer I'll I'll talk about that in

a second uh but at this stage uh the

major pieces are in place so we can try

to scale this up and see how well we can

push this number now in order to scale

out the model I had to perform some

cosmetic changes here to make it nicer

so I introduced this variable called n

layer which just specifies how many

layers of the blocks we're going to have

I created a bunch of blocks and we have

a new variable number of heads as well I

pulled out the layer Norm here and uh so

this is identical now one thing that I

did briefly change is I added a Dropout

so Dropout is something that you can add

right before the residual connection

back right before the connection back

into the residual pathway so we can drop

out that as l layer here we can drop out

uh here at the end of the multi-headed

exension as well and we can also drop

out here uh when we calculate the um

basically affinities and after the

softmax we can drop out some of those so

we can randomly prevent some of the

nodes from

communicating and so Dropout uh comes

from this paper from 2014 or so and

basically it takes your neural

nut and it randomly every forward

backward pass shuts off some subset of

uh neurons so randomly drops them to

zero and trains without them and what

this does effectively is because the

mask of what's being dropped out is

changed every single forward backward

pass it ends up kind of uh training an

ensemble of sub networks and then at

test time everything is fully enabled

and kind of all of those sub networks

are merged into a single Ensemble if you

can if you want to think about it that

way so I would read the paper to get the

full detail for now we're just going to

stay on the level of this is a

regularization technique and I added it

because I'm about to scale up the model

quite a bit and I was concerned about

overfitting so now when we scroll up to

the top uh we'll see that I changed a

number of hyper parameters here about

our neural nut so I made the batch size

be much larger now it's 64 I changed the

block size to be 256 so previously it

was just eight eight characters of

context now it is 256 characters of

context to predict the 257th

uh I brought down the learning rate a

little bit because the neural net is now

much bigger so I brought down the

learning rate the embedding Dimension is

now 384 and there are six heads so 384

divide 6 means that every head is 64

dimensional as it as a standard and then

there's going to be six layers of that

and the Dropout will be at 02 so every

forward backward pass 20% of all of

these um intermediate calculations are

disabled and dropped to zero

and then I already trained this and I

ran it so uh drum roll how well does it

perform so let me just scroll up

here we get a validation loss of

1.48 which is actually quite a bit of an

improvement on what we had before which

I think was 2.07 so it went from 2.07

all the way down to 1.48 just by scaling

up this neural nut with the code that we

have and this of course ran for a lot

longer this maybe trained for I want to

say about 15 minutes on my a100 GPU so

that's a pretty a GPU and if you don't

have a GPU you're not going to be able

to reproduce this uh on a CPU this would

be um I would not run this on a CPU or

MacBook or something like that you'll

have to Brak down the number of uh

layers and the embedding Dimension and

so on uh but in about 15 minutes we can

get this kind of a result and um I'm

printing some of the Shakespeare here

but what I did also is I printed 10,000

characters so a lot more and I wrote

them to a file and so here we see some

of the outputs

so it's a lot more recognizable as the

input text file so the input text file

just for reference looked like this so

there's always like someone speaking in

this manner and uh our predictions now

take on that form except of course

they're they're nonsensical when you

actually read them

so it is every crimp tap be a house oh

those

prepation we give

heed um you know

Oho sent me you mighty

Lord anyway so you can read through this

um it's nonsensical of course but this

is just a Transformer trained on a

character level for 1 million characters

that come from Shakespeare so there's

sort of like blabbers on in Shakespeare

like manner but it doesn't of course

make sense at this scale uh but I think

I think still a pretty good

demonstration of what's

possible so now

I think uh that kind of like concludes

the programming section of this video we

basically kind of uh did a pretty good

job and um of implementing this

Transformer uh but the picture doesn't

exactly match up to what we've done so

what's going on with all these digital

Parts here so let me finish explaining

this architecture and why it looks so

funky basically what's happening here is

what we implemented here is a decoder

only Transformer so there's no component

here this part is called the encoder and

there's no cross attention block here

our block only has a self attention and

the feet forward so it is missing this

third in between piece here this piece

does cross attention so we don't have it

and we don't have the encoder we just

have the decoder and the reason we have

a decoder only uh is because we are just

uh generating text and it's

unconditioned on anything we're just

we're just blabbering on according to a

given data set what makes it a decoder

is that we are using the Triangular mask

in our uh trans former so it has this

Auto regressive property where we can

just uh go and sample from it so the

fact that it's using the Triangular

triangular mask to mask out the

attention makes it a decoder and it can

be used for language modeling now the

reason that the original paper had an

incoder decoder architecture is because

it is a machine translation paper so it

is concerned with a different setting in

particular it expects some uh tokens

that encode say for example French and

then it is expecting to decode the

translation in English so so you

typically these here are special tokens

so you are expected to read in this and

condition on it and then you start off

the generation with a special token

called start so this is a special new

token um that you introduce and always

place in the beginning and then the

network is expected to Output neural

networks are awesome and then a special

end token to finish the

generation so this part here will be

decoded exactly as we we've done it

neural networks are awesome will be

identical to what we did but unlike what

we did they wanton to condition the

generation on some additional

information and in that case this

additional information is the French

sentence that they should be

translating so what they do now is they

bring in the encoder now the encoder

reads this part here so we're only going

to take the part of French and we're

going to uh create tokens from it

exactly as we've seen in our video and

we're going to put a Transformer on it

but there's going to be no triangular

mask and so all the tokens are allowed

to talk to each other as much as they

want and they're just encoding

whatever's the content of this French uh

sentence once they've encoded it they

they basically come out in the top here

and then what happens here is in our

decoder which does the uh language

modeling there's an additional

connection here to the outputs of the

encoder

and that is brought in through a cross

attention so the queries are still

generated from X but now the keys and

the values are coming from the side the

keys and the values are coming from the

top generated by the nodes that came

outside of the de the encoder and those

tops the keys and the values there the

top of it feed in on a side into every

single block of the decoder and so

that's why there's an additional cross

attention and really what it's doing is

it's conditioning the decoding

not just on the past of this current

decoding but also on having seen the

full fully encoded French um prompt sort

of and so it's an encoder decoder model

which is why we have those two

Transformers an additional block and so

on so we did not do this because we have

no we have nothing to encode there's no

conditioning we just have a text file

and we just want to imitate it and

that's why we are using a decoder only

Transformer exactly as done in

GPT okay okay so now I wanted to do a

very brief walkthrough of nanog GPT

which you can find in my GitHub and uh

nanog GPT is basically two files of

Interest there's train.py and model.py

train.py is all the boilerplate code for

training the network it is basically all

the stuff that we had here it's the

training loop it's just that it's a lot

more complicated because we're saving

and loading checkpoints and pre-trained

weights and we are uh decaying the

learning rate and compiling the model

and using distributed training across

multiple nodes or GP use so the training

Pi gets a little bit more hairy

complicated uh there's more options Etc

but the model.py should look very very

um similar to what we've done here in

fact the model is is almost identical so

first here we have the causal self

attention block and all of this should

look very very recognizable to you we're

producing queries Keys values we're

doing Dot products we're masking

applying soft Maxs optionally dropping

out and here we are pulling the wi the

values what is different here is that in

our code I have separated out the

multi-headed detention into just a

single individual head and then here I

have multiple heads and I explicitly

concatenate them whereas here uh all of

it is implemented in a batched manner

inside a single causal self attention

and so we don't just have a b and a T

and A C Dimension we also end up with a

fourth dimension which is the heads and

so it just gets a lot more sort of hairy

because we have four dimensional array

um tensors now but it is um equivalent

mathematically so the exact same thing

is happening as what we have it's just

it's a bit more efficient because all

the heads are now treated as a batch

Dimension as

well then we have the multier perceptron

it's using the Galu nonlinearity which

is defined here except instead of Ru and

this is done just because opening I used

it and I want to be able to load their

checkpoints uh the blocks of the

Transformer are identical to communicate

in the compute phase as we saw and then

the GPT will be identical we have the

position encodings token encodings the

blocks the layer Norm at the end uh the

final linear layer and this should look

all very recognizable and there's a bit

more here because I'm loading

checkpoints and stuff like that I'm

separating out the parameters into those

that should be weight decayed and those

that

shouldn't um but the generate function

should also be very very similar so a

few details are different but you should

definitely be able to look at this uh

file and be able to understand little

the pieces now so let's now bring things

back to chat GPT what would it look like

if we wanted to train chat GPT ourselves

and how does it relate to what we

learned today well to train in chat GPT

there are roughly two stages first is

the pre-training stage and then the

fine-tuning stage in the pre-training

stage uh we are training on a large

chunk of internet and just trying to get

a first decoder only Transformer to

babble text so it's very very similar to

what we've done ourselves except we've

done like a tiny little baby

pre-training step um and so in our case

uh this is how you print a number of

parameters I printed it and it's about

10 million so this Transformer that I

created here to create little

Shakespeare um Transformer was about 10

million parameters our data set is

roughly 1 million uh characters so

roughly 1 million tokens but you have to

remember that opening I is different

vocabulary they're not on the Character

level they use these um subword chunks

of words and so they have a vocabulary

of 50,000 roughly elements and so their

sequences are a bit more condensed so

our data set the Shakespeare data set

would be probably around 300,000 uh

tokens in the open AI vocabulary roughly

so we trained about 10 million parameter

model on roughly 300,000 tokens now when

you go to the gpt3

paper and you look at the Transformers

that they trained they trained a number

of trans Transformers of different sizes

but the biggest Transformer here has 175

billion parameters uh so ours is again

10 million they used this number of

layers in the Transformer this is the

nmed this is the number of heads and

this is the head size and then this is

the batch size uh so ours was

65 and the learning rate is similar now

when they train this Transformer they

trained on 300 billion tokens so again

remember ours is about 300,000

so this is uh about a millionfold

increase and this number would not be

even that large by today's standards

you'd be going up uh 1 trillion and

above so they are training a

significantly larger

model on uh a good chunk of the internet

and that is the pre-training stage but

otherwise these hyper parameters should

be fairly recognizable to you and the

architecture is actually like nearly

identical to what we implemented

ourselves but of course it's a massive

infrastructure challenge to train this

you're talking about typically thousands

of gpus having to you know talk to each

other to train models of this size so

that's just a pre-training stage now

after you complete the pre-training

stage uh you don't get something that

responds to your questions with answers

and is not helpful and Etc you get a

document

completer right so it babbles but it

doesn't Babble Shakespeare it babbles

internet it will create arbitrary news

articles and documents and it will try

to complete documents because that's

what it's trained for it's trying to

complete the sequence so when you give

it a question it would just uh

potentially just give you more questions

it would follow with more questions it

will do whatever it looks like the some

close document would do in the training

data on the internet and so who knows

you're getting kind of like undefined

Behavior it might basically answer with

to questions with other questions it

might ignore your question it might just

try to complete some news article it's

totally unineed as we say so the second

fine-tuning stage is to actually align

it to be an assistant and uh this is the

second stage and so this chat GPT block

post from openi talks a little bit about

how the stage is achieved we basically

um there's roughly three steps to to

this stage uh so what they do here is

they start to collect training data that

looks specifically like what an

assistant would do so these are

documents that have to format where the

question is on top and then an answer is

below and they have a large number of

these but probably not on the order of

the internet uh this is probably on the

of maybe thousands of examples and so

they they then fine-tune the model to

basically only focus on documents that

look like that and so you're starting to

slowly align it so it's going to expect

a question at the top and it's going to

expect to complete the answer and uh

these very very large models are very

sample efficient during their

fine-tuning so this actually somehow

works but that's just step one that's

just fine tuning so then they actually

have more steps where okay the second

step is you let the model respond and

then different Raiders look at the

different responses and rank them for

their preference as to which one is

better than the other they use that to

train a reward model so they can predict

uh basically using a different network

how much of any candidate

response would be desirable and then

once they have a reward model they run

po which is a form of polic policy

gradient um reinforcement learning

Optimizer to uh fine-tune this sampling

policy uh so that the answers that the

GP chat GPT now generates are expected

to score a high reward according to the

reward model and so basically there's a

whole aligning stage here or fine-tuning

stage it's got multiple steps in between

there as well and it takes the model

from being a document completer to a

question answerer and that's like a

whole separate stage a lot of this data

is not available publicly it is internal

to open AI and uh it's much harder to

replicate this stage um and so that's

roughly what would give you a chat GPT

and nanog GPT focuses on the

pre-training stage okay and that's

everything that I wanted to cover today

so we trained to summarize a decoder

only Transformer following this famous

paper attention is all you need from

2017 and so that's basically a GPT we

trained it on Tiny Shakespeare and got

sensible results

all of the training code is

roughly 200 lines of code I will be

releasing this um code base so also it

comes with all the git log commits along

the way as we built it

up in addition to this code I'm going to

release the um notebook of course the

Google collab and I hope that gave you a

sense for how you can train um these

models like say gpt3 that will be um

architecturally basically identical to

what we have but they are somewhere

between 10,000 and 1 million times

bigger depending on how you count and so

uh that's all I have for now uh we did

not talk about any of the fine-tuning

stages that would typically go on top of

this so if you're interested in

something that's not just language

modeling but you actually want to you

know say perform tasks um or you want

them to be aligned in a specific way or

you want um to detect sentiment or

anything like that basically anytime you

don't want something that's just a

document completer you have to complete

further stages of fine tuning which did

not cover uh and that could be simple

supervised fine tuning or it can be

something more fancy like we see in chat

jpt where we actually train a reward

model and then do rounds of Po to uh

align it with respect to the reward

model so there's a lot more that can be

done on top of it I think for now we're

starting to get to about two hours Mark

uh so I'm going to um kind of finish

here uh I hope you enjoyed the lecture

uh and uh yeah go forth and transform

see you later

Loading...

Loading video analysis...