Code, Write & Publish AI Research Paper - Full Course - LLM From Scratch - Muon vs Adam Optimizer
By Vuk Rosić
Summary
## Key takeaways - **Hardest research: picking questions**: Usually the most difficult thing in research is figuring out what to even do, what questions to answer. Answering questions is easier than figuring out what questions to answer. [01:22], [01:29] - **Publish research publicly always**: I recommend keeping all of your research in public. Otherwise, this is going to make it so much better so people can learn about you, know about you, you can build your network. [06:36], [06:38] - **Router noise prevents expert collapse**: There are some issues when choosing an expert; the neural network can learn to always choose same experts and then those experts learn and then they're good and it gets stuck choosing only few experts. [17:02], [17:12] - **Auxiliary loss balances expert usage**: Auxiliary loss is trying to make the LLM to give equal number of tokens to each expert. If some experts are getting more tokens on average than others then this auxiliary loss will be higher. [31:19], [31:29] - **Muon orthogonalizes gradient matrices**: Muon optimizer keeps the rows of the weight matrix perpendicular to each other. Neural networks learn faster with less data if weight matrix rows are perpendicular. [02:15:00], [02:15:39]
Topics Covered
- Hardest Research Skill: Picking Questions
- Research Demands Patient Thinking
- Publish Research Publicly Always
- Auxiliary Loss Balances Expert Usage
- MuOn Keeps Weight Matrices Orthogonal
Full Transcript
We will code and write this entire AI research paper that I just wrote just for this video to show you how to write and code entire AI research paper and
you can put your name onto this and it will be very easy for you to do something else. Change the this a little
something else. Change the this a little bit and do your own experiments and do your own real uh research step by step everything explained in this course.
This research paper is very beginner friendly to create. You don't need any tough mathematics and it will be very easy for you to create your own experiments and uh do your own research
on top of what I will teach right now.
This GitHub repository is the repository for this research. Uh it contains large language model that we will code from scratch and all of the experiments that we will do. And in this experiments
folder, I have the base that I used for this research and it will be very easy for you to add your own experiments.
My goal is to create AI research lab, hopefully best in the world. And every
paper we make, we're going to make a full course on how we did it. Right now,
it's me and my Discord community. But
I'm going to be moving to different cities, meeting people, and hiring people soon. And by making these videos,
people soon. And by making these videos, I can pay for the research, for salaries, for compute. And so, just by watching this video, you are helping us advance science. Usually the most
advance science. Usually the most difficult thing in research is figuring out what to even do, what questions to answer. Uh answering questions is easier
answer. Uh answering questions is easier than figuring out what questions to answer. This is also the problem with
answer. This is also the problem with large language models with humanity's last exam where large language models are tested if they can answer these questions and if they can, they're going to be good at doing research. But in
reality, coming up with the questions themselves is even more difficult. And
this skill of knowing what to work on is something you just build over years. And
even most experienced researchers struggle with this. But by writing research papers, you will eventually become better and better at this. So
when doing research, you will spend a lot of time thinking. It's like
reasoning and thinking. You will maybe be walking around the street around the forest and thinking about your research.
That's what a lot of people do. And you
cannot rush research publish it in a week. You need to spend a lot of time
week. You need to spend a lot of time thinking about it. Of course, in this case, because this is a course and tutorial and we did a lot of uh thinking, then it will be a lot easier
and faster for you. But if you are looking to uh create a groundbreaking research which I actually have a goal for our company eventually to to uh
start doing research that's accepted to these conferences and everything then we will spend a lot more time on each paper. Also decide what your goal for
paper. Also decide what your goal for doing research is. And it's not it's not going to be to get hired at some uh company like OpenAI or Deep Mind. If you
get hired there, there are some benefits. But just spending every day
benefits. But just spending every day whole day just because you want to get hired, there are easier ways to make money and usually you will get a high salary after like 5 10 years of
experience and you can get a lot money in other fields as well even faster where you develop a product or something. So uh your purpose of doing
something. So uh your purpose of doing research needs to be different than making money or getting hired because it's the more difficult way to make money and get hired than other ways. Uh
and so your purpose can be to contribute to science long term or to uh answer things find answers to questions you are curious about. And also there is no some
curious about. And also there is no some hidden mentor that when once you meet them they're going to discover your potential. I highly doubt and Robert
potential. I highly doubt and Robert Green talks about this. So, uh, it's all about you sitting down and figuring out your own life. Maybe it's good to have friends from AI research field, talk to
them, get their opinions, see what everybody else is doing. But at the end of the day, it will be you sitting down in your room and figuring out what you want to research and do. Also, you can
do what I do. So, you can either make YouTube videos or publish blog posts, LinkedIn posts and either create your own company or some AI research lab or uh join. It's also very good to join
uh join. It's also very good to join OpenAI deep mind but you need to understand like why you are joining it why you are joining but anyways whatever your goal is uh I do recommend you do
your research in public publish make these either videos or blogs or posts on social media so for me I want to keep improving these research papers and courses I want to publish everything
make courses so everybody can learn and this will accelerate science and I plan to keep improving our research papers As I said, eventually we want to publish in
conferences and hire people and uh create a proper lab. But it took me a couple of years to figure out what exactly I want. So you will also take some time to figure out what you want to
do and why you want to do research. So
then uh let's start with the course. You
can join my school community to become AI researcher. We have exclusive courses
AI researcher. We have exclusive courses here and community that are not available on YouTube. It helps me fund my research and make better videos as
well. So everything from math, PyTorch,
well. So everything from math, PyTorch, neural networks, attention, LLMs, and more courses. Click the link below and
more courses. Click the link below and watch this video to learn more what this school contains. So I'm going to assume
school contains. So I'm going to assume you have some basics on doing AI research and large language models for machine learning or PyTorch. So if you don't have any basics, I recommend
watching this uh video become AI researcher from scratch full course. I'm
going to link it below in the description and then it's also good to watch uh this coding llama 4 from scratch. So if you want you can watch
scratch. So if you want you can watch this because this is also very beginner large language models because in this paper in this course I'm going to assume you know about large language models a
little bit at least and or checking out this code deep from scratch. You don't
need to watch the entire thing, maybe just a little bit to familiarize yourself. All of the links below the
yourself. All of the links below the video. Go to your GitHub and let's
video. Go to your GitHub and let's create a new repository. I will name it no teams for novel optimizers or new
optimizers. You can name it whatever you
optimizers. You can name it whatever you want. So all of the code and my paper
want. So all of the code and my paper will be here. I'm going to add description development of new neural network optimizers.
I'm going to make it public. I recommend
you share your research and everything publicly. Otherwise, this is going to
publicly. Otherwise, this is going to make it so much better so people can learn about you, know about you, you can build your network. Keeping it private is not going to work well for you
because you will do research that nobody knows about. Same philosophy goes back
knows about. Same philosophy goes back to my channel. I'm sure a lot of people have watched my videos and this will help me a lot build the network. So, I
recommend keeping all of your research in public. And then I'm going to add
in public. And then I'm going to add read me so I don't need to add it manually later. I'm going to go get
manually later. I'm going to go get ignore and select Python and L license license MIT. This is the most permissive license and create repository.
This is the repository we created. I'm
going to copy its URL. So I'm going to clone the repository locally to my computer. I'm using Google Anti-gravity
computer. I'm using Google Anti-gravity right now. So I can click here and then
right now. So I can click here and then paste the URL. And then I'm going to select where to clone it. After you
clone it, it will offer you to open this new repository in a new window or the same window. So you can open it just
same window. So you can open it just like I opened it here. Uh if you don't know how to clone from GitHub, then you can watch some uh Git and GitHub basics
introduction on YouTube so you learn about GitHub and what cloning is or you can ask Chad GPT or AI to help you. So
you need to be able to set up your environment like this. This is basic like Python introduction or setting up Python environment and opening it in a code editor. So you can watch YouTube
code editor. So you can watch YouTube videos or ask GPT if you don't know how to do this. But anyways, once we open this, we're going to start creating some
new folders here. Let's create a new folder models and inside new file layers. py for
python.
Then next to that file, I'm going to say new file components.py
and a new file llm.py. So this is where our large
llm.py. So this is where our large language model will reside.
Um it's like mixture of experts lm or you can name it whatever you want. Now
let's open components.py.
Let's open and let's start writing some code. So I'm going to import torch. I'm
code. So I'm going to import torch. I'm
going to import uh torch.n as nnn torchn functional as f and uh from typing import type tpple and
optional.
Now you may type this all out. You may
copy these imports if you want from my code, but I recommend you type manually all of the code here. Uh we're going to write now so you understand how it works
and how it's written etc. First we will code a single expert.
So let's create a class expert which inherits from NN module and I'm just going to add comment single expert network essentially a feed
forward layer. So single expert is just
forward layer. So single expert is just a feed forward layer. So I believe you understand how experts work how mixture of experts work. If you don't I listed
videos and courses you should watch to understand how large language model works because but I can also just tell you so
each of the feed forward layer in neural network it looks like this so you have expert one two three so each of them is just a feed forward classic neural
network multi-layer perceptron and then we have router to select the expert based on the the token that is being generated. ated right now.
generated. ated right now.
Then I'm going to initialize the expert.
So we just pass in self and dimension of the model.
So dimension of the feed forward. So
imagine this as uh having the first input layer. This neural network has
input layer. This neural network has first input layer and then expands to this hidden layer. So this is the D model. This is the input the token. This
model. This is the input the token. This
is the size of the token that this expert gets. This is the token
expert gets. This is the token embedding. And then it expands. So this
embedding. And then it expands. So this
middle layer which is going to be processing this token. So this middle layer is usually four times bigger than this token embedding input size. And
then we contract back. So it's like input hidden and then going back to the token size. And this is four times
token size. And this is four times bigger. So this hidden layer of neurons
bigger. So this hidden layer of neurons contains a lot of the facts and knowledge and rules for processing the token. Like if you are talking about uh
token. Like if you are talking about uh some sports, this hidden layer would contain like knowledge about rules of the sport, famous players in that sport
etc. So I'm going to have uh this is the token size the first and the last and then the fit forward is the middle the
four times bigger usually.
Okay. And then dropout this is good for to prevent overfitting.
And we're going to initialize this super. Oh my god. This super is just
super. Oh my god. This super is just initializing this NN module.
And then this is the first linear layer.
So our input layer. So as I said, this goes from the D model which is token size to the hidden layer which is four times bigger. So that's our first linear
times bigger. So that's our first linear layer. And I'm going to show you on the
layer. And I'm going to show you on the image.
It looks similar like this. Just imagine
that this middle layer would have uh eight neurons instead of three because I said that whatever the input token this
gets a token. This is four times bigger and then we go back to token size and so this contains a lot of like knowledge about
uh whatever the so each expert is going to be specific for something. In the
neural network, the large language model will learn for each expert to assign some specific knowledge like this expert is for math, this is for sports, this is for history etc.
So I explained all of this in the courses and you can watch on YouTube more courses about mixture of experts.
So then second linear layer will go from the mod from the hidden back to the token size and we don't need bias for these
and the dropout we will initialize dropout so we prevent overfeitting now some people don't use dropout but I use
it it worked better with my experiments and then let's define forward method so the forward method is will be the how we
use. So here we just initialized these
use. So here we just initialized these this network expert and forward is where we use it.
So first uh it's going to look like this. So I'm just going to type all of
this. So I'm just going to type all of the this. So you can see so you know
the this. So you can see so you know it's going to start from here. So uh
this is the input. It gets passed through the first linear layer.
And after we pass through the first linear layer, so it's going like this.
And then it's going to be passed through activation function celu in this case.
And so imagine now we are at this point we are we have values for the middle layer for the hidden layer. Dropout
means we will ignore some of the activations make them zero. Some of the neurons make them zero in the middle layer. And then we're going to multiply
layer. And then we're going to multiply this middle layer with the weights to uh get the output of that's the token
size. So that's so this linear will
size. So that's so this linear will multiply the hidden layer with the weights to condense it back to the token size. So that's how we process this is
size. So that's how we process this is the classic feed forward. So this is just one expert.
Then uh let's make sure we save all of this commit to GitHub. So uh you have connected your GitHub. So I'm just going to press generate and it's going to
generate some message and sync.
So you can learn more about GitHub in other YouTube tutorials if you don't know about it. So guys, I switched to cursor because I just couldn't disable the autocomp completion AI generated
suggestions in my editor. So let's
continue with cursor.
Let's now create a top k router. So
router will choose which expert one or multiple to choose uh based on the current token. So just create a class
current token. So just create a class that inherits from nn module and router that selects topk experts for each token.
So top k most similar most relevant I should say experts.
First I'm going to initialize the uh topk router. So I'm going to say dimension of the model which is the token embedding size number of experts
and how many of those experts we should choose. Default is two and then I'm just
choose. Default is two and then I'm just going to super in it to initialize the parent class.
And so I'm just going to assign these uh values to the local variables top k and number of experts.
Then uh we will define gate which is just a linear layer that goes from the token embedding to number of experts. So it
will just look at the token and convert this to some numbers and we will use those numbers to assign probability or
likeliness or affinity for each expert.
So how much it wants to choose each expert in particular and we have this noise. Uh there are some issues when choosing an expert. uh
the neural network can learn to always choose same experts and then those experts learn and then they're good and then when it even tries to choose
different experts they didn't learn so it gets punished. So then it gets stuck choosing only few experts and but it's suboptimal because if it instead uh
balances out choosing all of the experts then it in the end uh it will perform better and be able to save more data save more data into like each experts
and knowledge and development of the experts.
Now let's define forward for the topk router.
So uh we're going to start with the forward itself. So x is our tensor that
forward itself. So x is our tensor that contains all of the tokens. So it can be uh three-dimensional tensor. So each
token but all of the tokens are in a sequence. So sequence length or context
sequence. So sequence length or context window or conversation is just a single conversation with all of the tokens in that conversation. So that's sequence
that conversation. So that's sequence length. And then batch size. So you can
length. And then batch size. So you can have multiple conversations. All
conversations are independent.
So usually you would have these three dimensions. Tokens for token embedding
dimensions. Tokens for token embedding and then multiple of those tokens in an array that's sequence of tokens for that conversation and then multiple
conversations that are independent and this will return tpple which is uh torch tensor torch tensor to tensor.
Gemini gave good explanation of the three things uh this router will return.
So let's start here. Top K indices is the second thing. So it's just index of the selected experts. So let's say expert two and expert five are selected
because we are selecting two by default.
And uh each of the expert will um have some weight or importance of how much it
contributes to the final uh token. So
maybe this will have 70% weight and this will have 30% weight and you will understand this better later. So those
weights are multiplied. Uh and so those are the weights for the top K experts.
and router probabilities will return all of the uh probabilities across all of the experts. This is necessary for the
the experts. This is necessary for the auxiliary loss later.
So these two just return for the top selected experts and this is probabilities for all of the experts but we will learn more about later. Then I
will just add some comments. So the
input X which which is the input tensor is batch size sequence length D model.
Uh you need to be able to understand what uh this means like these dimensions like intuitively. So I explained that
like intuitively. So I explained that well in my AI from scratch AI research from scratch course and then what it returns router weights expert as router
probabilities.
And first thing we want to do is uh take the x input tensor and then uh get its shape into these three
variables. So dimension of the model and
variables. So dimension of the model and sequence length and batch size get shape of these three separately. So the size how many conversations how many tokens
in sequence and how many dimensions per token and this is going to be maximum number of tokens possible. So maximum sequence length
then we want to compute router logit. So
now we want the router to select experts. So I'm going to say router
experts. So I'm going to say router logits equals two. And don't worry these uh words will become brighter blue color
later when we start using them. So we
will just pass X through our gate which we defined.
Oh we didn't define it.
Okay never mind. We'll we'll define it later.
Oh we did we did define it here. This is
the gate. So it's a linear layer that goes from D model to number of experts.
But we can easily pass X even though X has this big shape. It will just so these ones will be copied and it will just go from D model to number of experts instead.
And so it will return batch size sequence length number of experts. So
that's going to be our logit. So it's
like uh showing logits are showing the affinity like the bigger number the bigger affinity for this expert. So that
expert will be selected and we're going to add some noise during training only for exploration. So as I said it we will randomize a little bit the logits
so it doesn't get stuck reinforcing same few experts all that every time and so if it's training and if we have
some noise that we want to add if it's more than zero then we will say noise is equal to we will create the noise.
So those will be random numbers of the shape router logits. So we have some random value for each of the logits.
Each of the logit. Yeah. And we will multiply with the standard deviation here 0.1.
And so router logits will be equal to router logits plus noise. It will add noise to all of the logits.
get full probability distribution for load balancing loss. So we need this later. We also need to get this. This is
later. We also need to get this. This is
the third thing we return.
So router probabilities we will do softmax over all of the logits.
Uh how softmax works is if you have some logit numbers like one five six seven it will put all of those numbers between
zero and one and make all of them add up to one. So it will squish them between
to one. So it will squish them between zero and one and make them all add up to one so that you can look at that as probabilities. Now
probabilities. Now and along last dimensions so a long number of experts dimensions. So for
each token we just take its experts and just do over experts just for each token separately and select top k experts. This is the
first and the second thing we return from this uh gate.
So there is this torch.topk top K function and we just pass in the logits and that's going to be assigned to top K
and because router logits has this big shape we just want to select uh over the last dimension so for each token
so that's the last dimension sorry my bad my bad this top K means we are selecting top two experts experts.
So two out of whatever it is number of experts which could be for example eight and so uh we say top k weights just like we
calculated with the soft max these probabilities we will also use the same thing to calculate weights which is also which will also add up to one. So as I said
weights we will multiply output of each expert with certain weight for that expert. So weight will say how important
expert. So weight will say how important this expert was. So if we have two selected experts one gets 70% the other
get 30% then we will this one will have 70% influence when multiplied and this one 30.
So the same thing and we will return all three things that I mentioned. So it's also a bit
I mentioned. So it's also a bit difficult to explain for me but don't worry I'm going to repeat all of this as we code following things you will understand because it all goes back on
itself and you will understand it. So
the topk router will choose for each token it will choose for example two experts out of eight possible experts or whatever we have and it will return the
indices of those experts for example second and fifth and weights of those experts for example 70% weight 30% that's based on the logits on basically
that's based on how the neural network classifies importance which which numbers it gives here and weights will add up to 100%. and
then router probabilities for entire all of the experts for that token which we need later to make sure it's balanced uh when it
selects experts it doesn't just select only two experts all the time same two experts next class will be a lot simpler I believe so it's just mixture of experts
class also inherits from nn module so we will combine our router and experts here. So let's first initialize
experts here. So let's first initialize self dimension of the model which is the token embedding size uh dimension of the feed forward hidden
layers which is four times larger usually than the model size although it's not always number of experts eight as default.
Uh we are selecting two experts out of eight for each token.
Dropout 0.1 10%.
Load balancing weight. Uh we'll talk about this later. The goal here is to make sure that the router doesn't always
select same experts. So we want to make it select different experts.
So initialize super. And then I'm just gonna all of these variables I'm going to assign them.
Okay, these three and create experts. So
experts will be module list and just call this expert class pass in the model DFF and dropout for so for every expert. So number of
experts is eight. So this will create eight experts in this module list and then we will create router and that's it I think. Yeah that's it.
So that's all we need. So for topk router that we defined here we just pass in um number of experts d model and topk number. Then let's define the uh forward
number. Then let's define the uh forward method for mixture of experts. So how
does the data go through this class?
So input is x torch tensor and output is tpple which is torch tensor and optional torch tensor. Let's see what this is. So
torch tensor. Let's see what this is. So
first input is our classic batch size sequence length. So this is just our uh
sequence length. So this is just our uh list of conversations and returns output which is going to be
processed list of conversations. So it's
going to be the same list of conversations but it will be processed.
So what does mixture of experts do mixture of experts it comes after the attention mechanism. So I'm just
attention mechanism. So I'm just repeating now the structure of LLM and for each token it will just modify it a
little bit process it uh based on maybe context around other tokens. You know
that in attention mechanism information from all of the tokens will be blended added into every token.
But attention will only look at the previous tokens for every token. So for
every token it will just add information from previous tokens. But now that information is just added with a plus with plus sign. So it's just adding a bit of this vector to this vector. And
so feed forward networks or mixture of experts will uh take this just added vector that has these numbers that are just added and process them somehow uh
refine them, inject more information into them. So imagine this as like
into them. So imagine this as like processing. So attention is adding
processing. So attention is adding information and feed forward is for processing information, adding more data, analyzing it.
the output is same same size as the input same shape. So
every token just gets processed.
So there is no next token generation here. So this is not generating next
here. So this is not generating next token. That's only the last output head
token. That's only the last output head at the end. So so this is just for processing information maybe for understanding for the LLM what each token means and how it relates to
other tokens and injecting new knowledge and an analysis etc. And then we have this auxiliary loss. This is just a number
uh that neural network that the LLM wants to make as low as possible. So
this number will show how imbalanced expert selection is. So to put it simply
uh using this we are trying to make the large language model choose every expert equal number of times.
So we don't want it to choose some expert more frequently. We want each expert to get equal number of tokens in general on average.
And so if some experts are getting more tokens on average than others then this auxiliary loss will be higher number
and so the loss the error will be higher number and this loss will be later added to the general large language model loss.
So it's trying to minimize the general loss and so it's trying to minimize this one as well because this is going to be added. So
added. So in short in summary, auxiliary loss is trying to make the LLM to give equal
number of tokens to each expert.
In this example, you can see that uh this our mixture of experts layer will return this auxiliary loss besides the output the process tokens that I just
explained and this auxiliary loss will be added to total loss at the end. So
that's the total loss and then we do back propagation with the total loss. So
we want to minimize this total loss total error.
So again, we're just going to get the shape to these variables and we will pass x our input through the
router to get our expert selection uh router probabilities and all of this that I described about the router and
uh initialize output tensor. So this is the output that we will get after processing and it's all going to be zeros and the
shape will be same as X.
Let's explain next piece of code with the Gemini. It's easier just listen to
the Gemini. It's easier just listen to the theory and then we will code it after and then I'm going to repeat and explain again shortly as well. So we are
looping through experts. So for each expert, we're going to go one by one and for each expert, we will process the tokens that that expert got. In our
code, I'm going to say process each expert as a comment. And for expert index in range, so in range all of the experts,
I'm going to say find tokens routed to this expert.
And that's how this is how we will do it.
Expert mask is equal to expert indices equal to expert index and also
dot any uh along dimension last dimension. So
remember what does this mean? So
remember uh the first iteration we are gathering all of the tokens from the first expert and then processing all of
the tokens for the first expert. So
remember that each token has two experts assigned. So here we are checking if
assigned. So here we are checking if this current token that we are processing
uh if this first expert is at any of the two places. So the first selected
two places. So the first selected expert, the second selected expert. So
is this current expert within those two experts that are selected for the current token?
And so this expert mask will be the same shape. So it will be batch size and
shape. So it will be batch size and sequence length. So basically for every
sequence length. So basically for every token in our context window in our input, it will have true if that token needs to be processed by this current
expert and false if it doesn't need to be processed.
That's what this expert mask will have.
So you can see here that previously we got for every token we have two experts two indices.
So then for this particular expert in the for loop we are checking
in that list of tokens where each token has two experts. if this current expert that we are checking is contained within
uh within that list.
So we will check for all of the tokens at once.
Now this may be a bit confusing but maybe if you understand it rewatch this or ask AI to help you then eventually you will understand it. So expert mask
is batch size sequence length. So for
each token it has true or false. this
token does need to be processed by this expert or this token doesn't need to be processed.
So then if expert mask any which just checks if any of the values in this expert mask is true. So if this expert
needs to process even a single token so even if it's single token has true then we will process.
So get tokens for this expert.
Expert input is equal to X and expert mask.
This part will be easiest to show on an example. Let's say we have our X which
example. Let's say we have our X which is for example this is the the first token second token. So we have list of
tokens which is that conversation.
second conversation and then for expert zero let's say first token is sent to expert zero
and then this one is not this one is no no and yes so each of the tokens has its own true or false and the shape of this
expert mask will be two three so three tokens and two conversations each having three tokens
so it's it's going to be like this first token, second token, first token, etc. Sorry, third token. And I will show
you this example in the Python code. So
this is what I just explained. And this
is our mask. True false true and false true for the second conversation.
And if we apply this mask to the X, let's just see what we get.
So we get the first token, third token and sixth token.
So first, third and sixth have true. So
it's like it's now just one sequence of tokens. It's flattened it's flattened
tokens. It's flattened it's flattened the batch and sequence dimension into a single dimension. Now it's just a
single dimension. Now it's just a sequence of tokens that had this true.
So if X the full size of the input is 100 tokens. If only 10 tokens are
100 tokens. If only 10 tokens are selected expert input will be a smaller tensor of size 10.
Then in the code we will add this expert output will be self.experts
and choose this particular expert. So
expert is just a list of experts. It's
module and choose this particular expert and pass in expert input which is a list of these tokens for this particular
expert that will pass these selected tokens through our expert which is a feed forward network which is linear relu linear. We defined this in the
relu linear. We defined this in the beginning of the video but in reality we are using selu not relu is more complex and it's better function you would use
relu only for some other simple things for example in deepseek uh sparse attention and some other things and remember we
just need to pass in a single token when we are processing this that token that token is not looking at the previous tokens that was happening in the attention
mechanism. So in this feed forward
mechanism. So in this feed forward network, we don't need to pass the token and then look at all of the previous tokens. So that was actually at first
tokens. So that was actually at first confusing to me as well, but I just realized that was in the attention mechanism. And here we just pass the
mechanism. And here we just pass the token itself and process the token itself, the vector and nothing else for that particular token through this
expert. Then maybe a bit trickier part.
expert. Then maybe a bit trickier part.
So I said each token is passed through two experts and for each of those experts we need to multiply the output with some weight and the output here and
then add. So if the first expert for
then add. So if the first expert for that token has weight 0.8 we will we will multiply the output of
that expert for that token with 0.8 and add that to 0.2 two times the output of the second expert for that token. And
when we add those weighted outputs, then that's our final process token. To
understand next piece of code, let's look at this scenario. So we are of course we are selecting top two experts as always and let's say we are currently
processing expert five. So all of the tokens for this expert five and so let's say this current token has experts two
and five and let's say uh weight is 90% for expert 2 and 10% for expert five.
That's the data we already have. So here
we need to grab this 0.1 or 10% and multiply the expert 5's output. We must
not accidentally grab 0.9 and multiply expert five's output. So
expert five is 0.1. Here first line of code we will add is mask for expert is equal to and then we check if expert
indices is equal to expert index.
So looking at this line of code uh we will check. So if expert indices is two and five expert index is five
then this will yield false for this one and five= 5 true.
So the mask for expert will be false true like this.
So that's going to be assigned for this particular token.
Now I'm just going to show how to add this in the code. So we are here.
I'm just going to say mask for expert is equal to this line of code that I just explained.
And so let's see the next line of code.
positions is equal to mask for expert and then inside expert mask. Let's see
what this does. So remember two experts are chosen for this token and each of the experts has its weight. So 0.9 0.1
and so we have false true because we are currently processing expert 5. So now we need to turn this
expert 5. So now we need to turn this false true into zero and one for indices. So we will use
this zero and one. We will use one to select this second weight 0.1.
So you see here this is weight at index zero. This is
weight at index one. And so this is going to be false and this is going to be true. So we need to convert
be true. So we need to convert this false true that we have into just an index saying that from this we want
to pick this one that has index one.
So this may get a bit of confusing. So
looking at this line of code again positions is equal to mask for experts and then index this with expert mask. So
let's first remind ourselves what these two are.
So this expert mask is uh for this particular expert that we are processing. So expert five showing
are processing. So expert five showing true for every token it needs to process and false for every token it doesn't need to process.
You can uh read it here as well. And
then mask for expert. Remember
if we have token zero has experts two and five then it will be false true and false force if it's not contained
and 53 so it's true false for expert five.
So mask for expert is actually for this particular token if it's using this particular expert or not.
And expert mask is for this particular expert showing which tokens in the sequence it's using.
Now check this out. This is interesting.
Both of these arrays. So this array and this array is showing you list of tokens. So first token if it's using
tokens. So first token if it's using this expert or not. Second token if it's using this expert or not. Third and here
as well first token if it's using this expert or not. Second and third. And so
what we will do is this second token is not using this expert.
So we will just remove this completely.
That's one way to do it. You could also remove if both of them are false here.
But it's maybe faster easier in this way to use this array to remove tokens here that that are not being
used.
So this line in our code will remove those tokens and then we will only be left with tokens that are being used or processed by this expert. It also
explained here what I just described. If
you want you can read but at the end in this example instead of having three tokens we will just have two tokens at the end because the middle token is not
processed by this current expert five from the example. We will see later how we use this position. So it will also be easier for you to understand. Now we
have dotflat and dot arg max last dimension.
So let me continue with this example to explain this float and arg max. So now
we have this filtered tensor filtered list of just these two tokens that are using this expert five. So it's second selected in this token and it's first
selected in this token.
So when we say dotflat it will convert true false into one or zero. So false
becomes zero true becomes one.
So now we have this instead and because we converted it to numbers we can now use arg max and minus one means along
this last dimension. So we will compare this zero and this one. This is the last most inner dimension. Which one is bigger? Arg max. Which one is bigger?
bigger? Arg max. Which one is bigger?
This one is bigger. And index of this is one.
So the result of arg max of this token will be one because it's going to look at the index of the biggest number
and result of arg max of this row will be zero. uh it's just the index of the
be zero. uh it's just the index of the biggest number here.
Uh you can read through this. I
explained this right now. And so
positions will be one and zero. So for
the first token, the weight for this particular expert is at index one. And
for the second token, the weight of this expert five is at index zero. So we will pick those weights to multiply
later the output of the expert 5.
Next uh gather weights only for relevant tokens.
So again let me just uh do all of this and we also have squeeze. So let's
explain this uh part of code. I'm going
to continue with the same example.
So token zero and token two are processed with expert five and these are the weights
and so as I said index here is one index here is zero. This is what we calculated previously.
So we had um so token one is not processed so it's filtered out.
So the position that we just calculated is one zero. This is the vector 1 zero.
So index one index zero and um router weights the full tensor weights for all tokens.
So in the first step filtered weights.
So we will just grab weights for only the tokens that we are processing with this expert using this expert mask. As if you remember expert mask is just showing
which tokens are being processed for this expert. So
this expert. So it will be token zero and token two.
So if these are router weights we will ignore this token. It's not being processed because token zero and token two are being processed.
And so when we apply this uh router weights and then expert mask we will just get this. So array of and this is contains u this is for token
zero and token two.
Now this is 2x two matrix or tensor.
But if you look at indices up it's just uh two. So it's just one array of two
uh two. So it's just one array of two values. So
values. So uh we want to pick we want to pick
this guy and this guy and we can do that using this gather.
But to use this gather we need to match their shapes.
So if this is 2x two we will also make this 2x two by putting each of these into its own array.
So that's why we will just unsqueeze.
If I scroll down, we will unsqueeze the last dimension.
Which means it's going to look like this. Now it will each of them will be
this. Now it will each of them will be in its own array. And now we can use gather to pick
uh the index one. So whatever is at index one here and whatever is at index zero here.
So if I scroll down to show you dot gather minus one and then indices.
Indices is actually just our positions.
So position one and position zero.
And so that's how we select and minus one means along the last dimension or innermost dimension. So we process this one. Pick index one. We process
this one. Pick index one. We process
this one. So this is the innermost dimension. So I recommend you play
dimension. So I recommend you play around with this. You need to maybe get some feeling for like these dimensions if you don't have strong basics. You can
also check that in my AI from scratch course, AI research from scratch. Or you
can ask Chad GPT to give you more examples to give you some tasks that you can run in Google Collab or in Python in your environment.
to get the feeling for squeezing, unsqueezing, indexing and all this good stuff. At the end, we get this 2x one
stuff. At the end, we get this 2x one matrix. Did I earlier say that the
matrix. Did I earlier say that the indices were 2x2? I meant 2x one. So,
it's going to be 2x one matrix. And so,
this is the weight we for the first token that we multiply the expert of the the output of the expert five with this weight. And for the second token the
weight. And for the second token the output of expert five for multiply with this weight and finally we get this dots squeeze the last dimension.
So this will remove the innermost dimension. So our current tensor is like
dimension. So our current tensor is like this. But we want to remove this
this. But we want to remove this innermost dimension. So we just get this
innermost dimension. So we just get this array of these numbers. And now the shape is two. So it was 2 by one 2 one.
But we removed with this with squeeze squeeze last dimension. So now we just two and you see what it looks like when you squeeze this innermost dimension the last dimension.
And then what we want to do so as I showed you earlier we already have this output tensor we are collecting when we are processing each expert individually
we are just adding outputs to this tensor. So this output tensor is same
tensor. So this output tensor is same it's defined before experts here. So now we are processing experts
here. So now we are processing experts and we are just adding output of each expert to this. So
so output so but for only for tokens that are being processed by this expert um let's ignore this unsqueeze for now.
So we have expert weights which is just our array of weights. So for each token, so for each of the tokens to multiply the expert output with this
weight.
So I explained this uh I don't want to repeat too much maybe it's boring for some people. You may also ask J GPT to
some people. You may also ask J GPT to explain this again if you want. And we
will need this unsqueeze so we can multiply properly and then add the result to existing output. We are accumulating
output. We are accumulating every expert's output in this total output.
And so this will work well because uh weights for one expert for for this particular token weights for from one expert will be added the output the weighted output and the weighted output
of the other expert will be added for this particular token that we are currently processing. So at the end we
currently processing. So at the end we will just have this token as I said weighted output of the one expert plus weighted output of the other
expert gets added and that's our token.
And by the way if you like if this is not 100% clear don't worry like if you are beginning with this series only a couple times it will be difficult for people who've been doing this for a few
years they might like it's it will be easy for them to understand but don't get discouraged like others understand it. I don't understand it which it's
it. I don't understand it which it's expected if you are uh don't have too many too much experience. So that's it for the experts.
Now let's do the auxiliary loss.
Uh the goal of this again is to make sure that every expert gets equal amount of tokens. If some experts are getting
of tokens. If some experts are getting more tokens then the loss will be higher and neural network will learn to minimize loss. So it will learn to
minimize loss. So it will learn to divide tokens equally among experts.
So let's just define the beginning value.
If it's training then we will just say auxiliary loss is equal to self domp compute load balancing loss. And we will define this later. We don't have this function yet.
and we will give uh router probabilities and expert indices.
I will explain. Let's just see that we return output and auxiliary loss. We
return from this forward method of the mixture of experts. So we return the output the tokens process tokens and the auxiliary loss. Remember
auxiliary loss. Remember that router probabilities are for each token the probabilities
for every expert for that token.
So it's going to be batch size sequence length number of experts. So all of the experts for every token in the sequence.
So it can be like this probabilities for this particular token let's say and we will use this in our auxiliary loss to calculate
uh to make sure that every token is getting every expert is getting equal amount of tokens.
Also this is useful for back propagation because uh we are sending the probability distribution.
So, PyTorch can adjust the weights to select better probability distribution.
It answers how strongly did the router prefer expert X on average. And we don't want it to prefer any expert. We want it
to be equal for each expert. And again
let's remind ourselves expert indices a tensor containing the integer ids of the top k experts that were actually
selected for each token. So for each token it says two and five meaning that each token like this token got expert
two and five selected and so the loss is simple actual usage which we get from our indices that I just explained
times average confidence which we get probabil from probabilities which is the first thing that I explained so router probabilities
but don't worry we we will understand all of this better. Let's see how it's coded. So in our code, let's actually
coded. So in our code, let's actually define this function that we just used.
So compute load balancing loss and it's going to initialize with with self and then router probabilities that we mentioned and expert indices. We'll see
later how these are used so you will understand them better. And we will return torch.tensor tensor which will
return torch.tensor tensor which will actually be just a single number. The
loss and let me add this comment.
Compute auxiliary loss to ensure balanced expert usage. This encourages
the router to distribute tokens evenly across experts.
Then let's first code and then see on an example what it does. So compute the fraction of tokens routed to each expert.
F do one hot. So we are one hot encoding and we're passing in expert indices and number of classes will be number of
experts and converting this to float.
Let's see what this does. This is
example. Let's say for this particular token expert one and three are selected.
The output of this line after we process will be these two one hot encoded vectors. So let's say we are se out of
vectors. So let's say we are se out of four experts we are selecting two.
So this means that expert zero is not selected. Expert one is selected and
selected. Expert one is selected and zero and zero these are not selected.
So that's the first selected expert for this particular token we are processing.
and the second selected expert is at index three.
So for every token we are generating this these two one hot encoded vectors to show which out of the maximum number
of experts are selected for this particular token. So this line will
particular token. So this line will convert this shape which was batch size sequence length and top k which is two
into batch size sequence length top k and then number of experts. So each of these two each of these top k will now have a vector of size number of experts
it will be one and one hot encoded vector.
I'm going to summarize I think it's very easy. So imagine this expert indices is
easy. So imagine this expert indices is just an array of two numbers. So one
five and then instead of having one five we will have like one hot encoded. So 0
1 0 0 0. So there is eight experts and for the five we will have 0 0 1 0 0.
So that vector so that's going to be expert mask.
It's going to be like this. But since we are selecting out of eight expert this will be on of length eight not four. And
then we will just convert all of these integers into floating point numbers using dotflat because later in the code
we will need to uh use this in a fraction. So, Python needs this to be in
fraction. So, Python needs this to be in a float number and we will use these one hot encoded vectors to count how many
times each expert has been used.
So, tokens per expert is equal to expert mask dot sum. So now we are summing these one hot encoded vectors along this
dim 012 which I will explain but we also have divided by expert mask dots sum. Let's
understand it with an example. Let's say
we have just one batch which means just one conversation and that conversation has two tokens and each token picks two experts. So we know
that and so let's say there is uh total three possible experts. So out of three each
possible experts. So out of three each token each of the two tokens gets two experts.
And so total slots is four which is two tokens each having two experts.
And let's see what these slots are.
Let's say token zero gets experts zero and one and token one gets experts one and two.
So the expert mask is the one hot encoded vectors that we just generated.
So we we have one batch which is one conversation, one sequence. That
sequence has two tokens.
Each of the two tokens has two experts selected for that token.
And then each of the experts is encoded with one hot encoded vector that we just did.
This is what it looks like. So for first token, we're going to have expert zero which is just going to be encoded like this. So this is the first token. We'll
this. So this is the first token. We'll
have this and this expert zero and expert one. And second token will have
expert one. And second token will have expert one and expert two.
Then dot sum and dimension 012 we sum across the batch sequence length
and top k top experts too. So what will this do is this will just sum it up like
this. So it will just sum how many times
this. So it will just sum how many times expert zero appears. So just sum all of these numbers for expert um zero. Sum
all of the numbers for expert one. Sum
all of the numbers for expert two. I
know it's a bit confusing. So actually
it says here we are summing across these first three dimensions.
But when I look at how we are summing, it looks like we are summing like the number of experts itself. So it looks kind of weird and confusing. But just
you just need to develop intuition by looking at this couple of times thinking about it.
So at the end after this sum we will collapse all of the information about token position everything we will just
have how many times each expert was selected.
So expert zero selected one time and then this and this. So this is the resulting vector 1 to one.
So that's going to result from this sum which is again kind of weird. We specify
these three dimensions that we're going to collapse and then it looks like we are actually summing across this last but we are not.
You just need to get intuition because summing across the last dimension would mean we sum 1 + 0 plus 0. So that would mean that we sum across the last dimension.
But this means we are actually summing across this like the the second to last all of the
all of these and that sum is this first part and then we are dividing by this expert
mask dot sum. Uh here we're just going to take this array which was one to one
and then divide by the total amount of tokens.
So you see our goal is to convert this one to one into something that's going to add up to one. So we get like probabilities or percentages I should
say. So 25 50 25. So the first token
say. So 25 50 25. So the first token selected 25% of the time. second token
50% of the time and 25. So that's our goal to get this. So the way we will get that is we're going to divide this count for
every expert. Each of them we will
every expert. Each of them we will divide with the total amount of these counts or these experts. So there is four of
them in total. You see one, two, three, four.
That's how we get that's how we get them to add up to one.
So we can look at them as percentages and our load balancing loss or auxiliary loss will try to push these uh weights
of the gate so that this vector is actually 0.33 0.3 so it's equal for every expert.
Let's see the next line of code. Compute
the average probability of routing to each expert.
So, uh, router probability mean, it's going to be router probabilities which we already uh passed. We're going to remind
uh passed. We're going to remind ourselves. So, we do the mean across
ourselves. So, we do the mean across zero and one.
Remember that router probabilities is the output of the softmax before we pick the top k. So it's raw probability distribution
uh for across all eight experts for each token. So
each tokens how likely it is across all eight experts for every expert and it's going to sum up to one per token and
mean across batch and sequence length.
Uh we'll say on average how strong was the router's desire to send tokens to expert X. Let's see example. So let's
expert X. Let's see example. So let's
say we have a batch or sequence of two tokens and we have three experts and so
uh for first token these are the logits or I should say probabilities not logits probabilities and for second token these are
probabilities.
So we average these columns vertically.
So uh that's what the mean does across batch and sequence. So 0.1 the average is 0.1.
On this spot the average is 0.6 and on this last spot the average is 0.3.
So the result router probabilities mean it's going to be this vector. So for
each expert, first expert has this average probability, second this one, and third this one. So that's our router
probability mean.
Comparing this to our tokens per expert which we calculated earlier. So tokens
per expert has this top k. So we cannot back propagate through it. It's not
differentiable. But the pytorch can back propagate through this uh router probabilities when we take all of the probabilities.
So when router changes weights slightly the logits or the probabilities that are created from those logits for every
uh expert might change from 0.8 to 0.81.
So this is differentiable. Check out my course on AI researcher from scratch to learn more about derivatives and gradients and this stuff. But when you are just selecting in this says if you
change weights slightly the index of the expert will still remain five. The index
will not go to 0.0015.
So that's why it's not differentiable because you are you just selected this expert. there is no smooth uh this
expert. there is no smooth uh this smooth transition between experts the continuous transition. So not smooth I
continuous transition. So not smooth I mean continuous and if the weights change enough then it will snap from expert with index five to expert at
index six or one or whatever but this snapping is what's not differentiable there is no continuous function between them.
So in our loss formula the first part has no gradients where we select indices and the second part that shows this probability distribution. It has it it
probability distribution. It has it it is differentiable I should say. So
Geminina is explaining it in details. So
I will also explain it in details. So
nondifferiable and differentiable means that here when it's differentiable neural network or
pytorch will know and calculate how to change the weights so the probabilities get more for this expert for example it
will be able to calculate using the chain rule from mathematics from calculus but it will not be able to use chain rule for this variable because
it's not differentiable.
It's not a continuous function. Check
out my course as I mentioned. So why do we need both? Why do we need probability distribution over all experts for every token and which experts were chosen
their indices to calculate this loss? Because the
router can cheat if you only look at probabilities.
Let's say we have 100 tokens and two experts and the router gives to every single token it gives 51% for expert A
and 49% to expert B. Now let's say we are just choosing one of these here probabilities look very balanced.
So this would have a low loss. But the
problem is we are choosing top one highest probability always.
So this would choose expert A always and never expert B.
So if we just use probabilities to calculate loss, it would seem like this is good, but it's not actually good. So we also need
to include this selection into calculating loss.
Main thing to note is we are not picking based on these probabilities. We're not
picking 51% expert A and 49% to pick expert B. We
are just picking the highest probability one. So top one or every time. Let's see
one. So top one or every time. Let's see
our formula how it applies to our example. So we have uh usage df to be
example. So we have uh usage df to be 100% which is one and probability for
that to be 51 0.51.
So the loss is 0.51 very high here uh plus 0 times this probability which is just zero.
So the final loss is 0.51.
Now let's expand our example to understand this better. So let's look at three scenarios to see how the math actually plays out. But let's f first
code that. So auxiliary losses storage
code that. So auxiliary losses storage sum tokens per expert times router probability mean which I just explained
times self dot number of experts. And
we'll see on the example why.
And then this will just return auxiliary loss. So this is it.
loss. So this is it.
H sorry times load balancing weight. So
uh this is just how important this loss how much we want the neural network to focus on this. So we can make it bigger or smaller. So if this is
bigger it will make the loss bigger. So
usually this number is small.
So we don't put too much effort into this load balancing loss. The main
effort is going to be on the main next token prediction loss. But let's just see some math examples to see what happens in different situations here.
Let's say we have two experts and so let's check our formula.
The first example I already explained.
So we looked at this example. Now we
would multiply this sum by two because we have two experts and that's done to normalize the loss value because if you didn't multiply by
the number of experts uh the more experts you have the smaller your loss would be just because you have more experts not because your models is
is improving just because you have more experts.
So let's further investigate why we multiply with the number of experts.
Um so in a perfectly balanced system every expert get gets exactly one over n one over number of experts traffic
because traffic is divided equally. So
uh fraction the number of tokens this expert gets is one over n and probability of this expert being selected is also 1 / n.
because each expert should have equal probability and equal amount of tokens.
By that logic we can see that in this example the loss will be proportional to 1 / n squared.
So the number of experts as it grows the loss shrinks uh quadratically. And in
this is not real example because here you wouldn't have number one. You will
have some number of tokens or some other number or this number one would be influenced here uh by this f.
But anyways what we just need to understand that as the number of expert grows this loss will actually shrink exponential or I should say quadratically. So uh we need to multiply
quadratically. So uh we need to multiply it with the number of experts.
So it's 1 / n the scaling.
So in our case where the expert um A gets 51% and it always gets selected.
We add that to expert B that never gets selected. The sum is 0 51 and we
selected. The sum is 0 51 and we multiply by two and that's our final loss.
two is number of experts which I just explained why we multiply to keep this normalized.
Scenario two, we got 90% probability for expert A. Let's say this is average
expert A. Let's say this is average probability and 10% for B. And so this A always gets selected. So now the loss is even higher because not only does it
always get selected but it probability is always high as well. You see how this multiplication what multiplication does is if one of if both of them are high
then the product will be very high.
But let's see this perfectly balanced.
So probabilities 50% for both and uh they get selected 50% of the time.
So it's 0.25 * 2.
Here sum times 2 is 1. So that's the perfect loss. It is a bit weird that the
perfect loss. It is a bit weird that the perfect loss is not zero. So like zero, it's like zero error, zero loss. But
this is how it works for now. And deepse
grpo is trying to replace this and other methods to trying to replace this auxiliary loss. So the neural network
auxiliary loss. So the neural network doesn't need to learn two things like the main loss and this auxiliary loss.
Okay, congratulations. We finally wrote this file component py. I recommend you commit this to GitHub. Do that
occasionally.
And I think other files should be faster and easier to write to understand.
Let's go to layers. Here we will uh define attention mechanism and transformer block.
So I'm going to import torch torchnn functional and also this torch tune modules from this import uh rotary
positional embeddings. So this needs to
positional embeddings. So this needs to be installed. So let's create new file
be installed. So let's create new file requirements.txt and here we will just write libraries we need to install for this. So data sets
this is from hugging face transformers torch tune which is the one that I just mentioned and torch torch AO we will see
this later and math plot lib for graphs so we will use data sets to load uh data and tokenizer transformers are we even
using that in this one I'm not sure so you will need to uh set this up with virtual environment and then pip install.
So you can just check this video my AI research setup code locally run on Google Collab or GPU cloud SSH GitHub.
Check this video. Uh this includes how to install and set this all up link in the description. So rotary positional
the description. So rotary positional embeddings is invented by Sudian Lin. I
always mention him on my channel. So he
has a blog post and ex account you can follow if you're interested to read about science about research. He works
at moonshotai.
Uh he's invented this after 11 years of blogging. So it does it's you cannot do
blogging. So it does it's you cannot do it overnight. It took him 11 years and
it overnight. It took him 11 years and this is used by everybody. So in 10 years uh people can become very good successful researchers.
So I'm also from components import mixture of experts which which we just coded. So
we are importing what we just coded. And
now let's create rotary positional embeddings.
So initialize uh dimension of the model. This is the token dimension. Now I'm calling this token
dimension. Now I'm calling this token dimension in the uh in what we just coded in components. So I should have same name maybe.
So it's just the token uh embedding size or model dimension and then maximum
sequence length and initialize the super and so we're going to initialize this.
So pass in the dimension of the model maximum sequence length and this is the base for calculating rope. So I have these two videos rotary positional
embeddings and rotation matrices and uh this rope as well. So you can watch both of them. Um I so maybe this one is more
of them. Um I so maybe this one is more important and I'm going to leave both in the description to understand rope. You
can also watch rope uh in other YouTube videos. This is a bit complex. This will
videos. This is a bit complex. This will
take you a few times to understand.
But you don't need to master it right now. You can watch if you want or you
now. You can watch if you want or you can leave it for later because the way we will code it is very simple.
So let me just show you here. Okay. So
we will just pass our input through the rope and that's it.
And we will not code this from scratch.
Now you could code it from scratch. The
reason I'm not coding is because this is very fast implementation. It trains the model faster.
But these videos that I showed you and other videos will show you how to code it from scratch. Sorry, I forgot to say the main purpose of rope is to tell you
when you have a sequence of tokens, the order of tokens. So dog is chasing a cat or cat is chasing a dog. It's very
important which word is where. It
indicates who's chasing who. So you do that with rope. If you uh don't use rope, I had a bug where my rope was not applied. Then the model did not get any
applied. Then the model did not get any information which word is at which position.
So it just had a bunch of words it needed to figure out. So it struggled and it does that by rotating pairs of
dimensions of the token. So you have a token vector embedding. these two
dimensions it takes like this pair next pair next pair next pair and rotating them with a rotation matrix. So
I'm getting into math here. So looking
at a vector so these two dimensions these two numbers looking this they it looks it as a vector and multiplies with a rotation matrix to rotate it.
But you can understand uh this in the courses that I explained in more detail and by how much it rotates the
dimensions that's how it encodes that's how the model knows uh which token is where by how much it's rotated rel like tokens relative to each other but this
rope has a specific requirement for the shape so it should be batch size what's t bro I forgot what's E sequence
length T is for like time I guess. So
sequence length number of heads head dimension.
Um we'll talk more about this in the attention mechanism like what is number of heads and head dimension and this stuff. So let's go to multi head
stuff. So let's go to multi head attention initialize uh dimension of the model which is just
the token embedding size and the number of heads max sequence length max context window
and drop out 10% here.
So you know this is the architecture of the transformer and whenever like these lines these lines mean like the token is going through the transformer and this
token has some size it's a vector of some size and that's that's the mo model dimension. So the token the size of the
dimension. So the token the size of the token is the model dimension that's like the main thing for the model the main dimension. So I have good
tutorials for attention mechanism which I think I mentioned in the beginning of the video. So coding llama 4 from
the video. So coding llama 4 from scratch, coding deepsev3 from scratch, coding quen 3 from scratch and AI research course from scratch. So you can
check those and check other attention mechanism videos. It's explained too
mechanism videos. It's explained too many times on YouTube. So I I will not explain it now.
All of the links below the video. So I
will assume you know what query key value vectors are.
So when I say heads what I mean is query vector and key and value all of them are
divided into equally sized chunks heads and each head is calculating it separately. So first head of the query
separately. So first head of the query key value are interacting. Second head
of the query key value are interacting.
Third head is interacting. And we'll
I'll just summarize how this attention works later. But uh this will be more
works later. But uh this will be more like a summary not not like full explanation.
Okay. So let's let's go into init um model dimension. Okay. I'm just going to save these uh variables into these
local variables.
So, DK I think this is the dimension of the key. We'll see. We'll see. Yeah. Divided
key. We'll see. We'll see. Yeah. Divided
by number of heads. As I said, um we're going to make query key and value same size as the model or the token vector.
It doesn't have to be same size.
Actually a lot of the times it's smaller but in a small model small large language model then it can it can be same size
and so let's create this query key value and it's going to be a linear layer and let me quickly explain what query
key value is. So each token has query key and value and query and when when token is looking at previous tokens, it
has its query and you will multiply query with every key of the previous token. So query is
like what information am I looking for?
So the goal of the attention mechanism is to get some information from previous tokens into this token. It's like
enriching it with context.
So what information and how are we going to get some information from previous tokens into this current token current token vector current token embedding. So
each token will have query. It's like
what am I looking for and then it will each token will also have key. Key is
like description of what information it contains.
So when you multiply query of this token with key of this previous token it's a dot product. So you can check what dot products are. So if the number
is small then there is not too much affinity. It means like these tokens are
affinity. It means like these tokens are not so relevant to each other or it's not important to take information from this token to this token. It's not so
important. If the dot product between
important. If the dot product between this query and some other key is high, it means this token is very relevant for this token
and then it will later we will take a lot of information.
So how do we take this information? So
we use value for that. Now the
difference so every token also has value and the difference between key and value is key is describing what the token has.
what information it holds, what information it will give you and the value is the actual information it gives you. So there is a difference
you. So there is a difference description of what it is and the actual like the thing the thing and the description of the thing are different
thing separate things.
So to summarize, so uh query is what am I looking for?
Key is what I contain description of the thing I contain and the value is what I'm what I will give you the actual thing that I contain.
So um so for this token when I'm trying to blend all of the information from previous tokens I will actually add
every single value vector just add but I will multiply every single value vector to this to this value vector. So
remember I'm not adding value vector to the token directly. I'm adding a value vector to
directly. I'm adding a value vector to its value vector. I'm just adding value vectors.
So I will add every single value vector just with plus. That's how you combine information from two different vectors with just plus. If you have this vector, this vector like presenting some
information, some information, if you combine them with a plus, it will add that information.
Um it works in neural networks. I don't
know if it works in like other things, math and stuff, but in neural networks, that's how you can add information combine.
So anyways, I'm going to add every value to the current token that I'm processing every previous value, but that value will be multiplied
with that single number which is dotproduct between query and key that says how important this value is. So if
the dot product is a small number close to zero then this value will be multiplied with zero or not with with a number very close to zero.
So when you add it will just add a bit of like small numbers. So it will not change too much this value.
But if the dot product is higher then uh you will basically add more information. the information will not be
information. the information will not be so diminished and you will add so more information from this because value is the information.
So that was kind of summary. So then at the end every token has this value that's enriched with values of all of the previous tokens not the future tokens. We don't want it to look at
tokens. We don't want it to look at future tokens because we are trying to make it predict the future tokens. It
doesn't know. So just looking at the previous tokens and so now this value contains information about itself what the token
actually is and the previous context like there is a tree fish wind it's blowing hard. So, and then this token is
blowing hard. So, and then this token is a maybe a flag or a ship that's moving.
And so, it takes context that the wind is blowing and stuff.
And so, here I'm just generating these uh query key value vectors. But here is the trick. You don't need to generate it
the trick. You don't need to generate it three separate times. You can combine it into a single calculation.
So within your GPU, within your graphics card, you don't need to send information back and forward, which is slow. Sending
information is actually slower than computing with the calculating most of the time.
So we send so we just going to calculate this all at once without going back and forth send separately query key value.
So going from the D model to D model times three. It's times three because we
times three. It's times three because we will do all three of these quick query value and as I mentioned
now key query value each of them will be uh same size as the token but it doesn't have to be uh in huge LLMs I I'm pretty
sure the key query value is a lot smaller five times or 10 times smaller vector than the huge token vector but in small large language small models
like this we can make it same size and we don't need bias although I've heard some people using bias also but this is more common without bias I
believe and um we also need this output so it's
similar so output is going to convert the value to the back to the token so remember at the end of attention mechanism
we added all the values but now we for every token we need to convert this value that's enriched with other tokens by the way we need to convert it back to
the token itself so that's what this output will do going from the value to the token itself and that's at the end
and I will also define rotary and by the way um I'm not processing anything in these lines of code. I'm just defining it now right now. I'm just so later we
will have forward pass to process this and you'll see how it works.
So then define rotary which is our class that we made here.
So I just need to pass in this uh dimension of the each head. Now this is a bit weird. So
each head. Now this is a bit weird. So
we have model dimension which is actually going to be same dimension as key query value and we are dividing by number of
heads to get dimension of each head and what heads are is you basically instead of calculating whole query times key and adding value you separate it into
independent heads independent parts. So
now this head of key the first head of key is just interacting with the first head of sorry first head of query is multiplying with the first head of this
key of this token and then adding the first head of the value. So each head is separate. So heads can learn different
separate. So heads can learn different things.
So I guess this uh dk is the dimension of the head which is a bit um confusingly named here. Maybe it should be like head dimension or head dim or
something. But it is what it is for
something. But it is what it is for right now.
Okay.
So let's go next. Dropout.
Oh yeah, that's just passed. This is the rate of the dropout probability. And
let's now define the forward path. So
this is our uh input which is batch time sequence length times token embedding just all of the tokens and conversations
and everything and so we need these shapes. So the first shape number will
shapes. So the first shape number will be batch size the second will be sequence length.
Uh here I think I had a mistake so I commented this should be deleted instead.
So anyways, I'm going to pass now input through query key value
linear layer and then generate query key value. So now
value. So now let me see let me see what this is.
So as you can see um this will be a single tensor that's three times larger than the model and now I want to separate it into three tensors which
will be equally sized query key value.
So I'm going to do that here.
So um this is maybe a bit confusing. So we
already know that X is shape batch size uh sequence length and then we have token dimension token embedding
but because we pass this X through this linear layer it's going to the last dimension will be three times token embedding dimension
like this and then we will split this into something here. Let's see what we split this large vector that contains
key query value concatenated into these things. So first keep in mind that the
things. So first keep in mind that the first two dimensions remain the same. So
number of different conversations and order and tokens in each conversation remain the same. But now we we do something with this key key query value
long vector. First we split it into
long vector. First we split it into three. We will have key, query and
three. We will have key, query and value. We just split like that. So easy.
value. We just split like that. So easy.
Key, query and value split into three equally. And then each of them we split
equally. And then each of them we split into heads as I said.
So um we just split like let's say query into heads and what's remaining is the dimension of
each head.
So maybe there is eight heads and I don't know maybe 100 of these. So let's
say let's say like a query has dimension of 800. So you
have eight heads. Each head has 100 uh dimension or size. And just to now
summarize, so each token will have three vectors which is query key value and then those will be divided into
heads and each head will have certain it's like a vector of certain size and there is some number of heads of
those vectors and then each head will like first heads will interact with each other second heads of query key value will interact with each other etc and
heads will be independent from each other. So only first will interact only
other. So only first will interact only second will interact but there is no first and second interacting together.
Okay.
Um we need to do some permutation here.
So you see I'm putting batch size to be the second one I'm putting. So this is
zero one two three four. So um uh this is a bit confusing number three number two but look at the indices. So I'm
taking this dimension which is the dimension that separates queries keys values and putting that to be the first dimension. So now imagine like all of
dimension. So now imagine like all of the stuff about queries here all of the stuff about values and keys
and then keeping the batch size there. Oh, this is a bit confusing.
there. Oh, this is a bit confusing.
Maybe you're going to need like some practice to understand this. You need
some practice here. And then separating by heads. Head is the next separation
by heads. Head is the next separation and then sequence length.
So imagine that we are separating by a like just take um first head of each token and now put it in a sequence like
sequence of tokens. So first head of the first token, first head of the second token, first head of third token, etc. And then
so that's what this sequence length here it just list of tokens.
And number four is the head dimension.
So that remains the same. So this is a bit confusing. Maybe you can copy paste
bit confusing. Maybe you can copy paste this into chat GPT try to understand visualize code. It it will take some
visualize code. It it will take some practice.
So just know that we are separating by key query value and then by kind of heads and in each head there is a sequence of tokens.
And why are we doing this permutation?
Well, first of all we will get a list of keys queries keys and values separated from this one long vector.
And so now each of them will be batch head and then within head dimension you have sequence of tokens and
um size of the head.
So it makes sense to separate keys queries values. You need them separated.
queries values. You need them separated.
Now this I think I messed up. I need to delete this as well I think.
And we will apply rotary embeddings or rotary positional encoding to
queries and keys only.
And uh we need to transpose swap these two before we pass them through rope because that's what this rope expects. So it expects batch
rope expects. So it expects batch sequence head and head dimension. That's
what it expects. So that's why we need to swap it. Batch sequence and then head and then head dimension. Let's see. You
see batch t is time or sequence length head and then head dimension. That's
what it expects. So that's why we transposed and passed through rotary and then transposed back to return it as it was. So I had a bug here before and some
was. So I had a bug here before and some person fixed it and then my large language model works a lot better.
And finally we this will do attention mechanism. So we're not going to go
mechanism. So we're not going to go ahead and code attention from scratch here. You can watch Andre Karpathy video
here. You can watch Andre Karpathy video or million other videos on coding attention mechanism from scratch. That's
the most explained thing in AI research I think. So I'm using this because it's
I think. So I'm using this because it's very fast. Usually you want to use these
very fast. Usually you want to use these like PyTorch native things. They're very
fast. They're very good. Uh the
engineers that build them are very good.
You only want to build stuff from scratch if it doesn't exist in PyTorch.
So passing query key value is causal means tokens are only looking at the previous tokens. They cannot see the
previous tokens. They cannot see the future tokens which is what we want in large language model.
Dropout if it exists.
That's so easy. Just calculate
attention. It's very fast. It's very
good.
H okay. And then attention output, we're going to transpose something. Let's see
what we're doing here.
Okay.
So, attention output will uh be same shape as our input which was batch size, sequence length and then
token embedding dimension.
But here we will actually have instead of token embedding dimension we will have value dimension because at the end we will just have all
of the sequence of tokens but values for each token not not the token embedding itself but the values for each token but in this case value size is same as the
token embedding size. So it will be literally same size as the input X at the beginning.
And now as I said for each token we need to convert the value. And remember this value also contains like added values from every other token
as information. But we need to convert
as information. But we need to convert that value for each token back into the vector embedding token. So we will do that by simply passing that list of
values through this output layer which will convert list of values to list of tokens and the shape will be same
and let's just understand this transpose and reshape. So transpose will do the
and reshape. So transpose will do the same as here. It will swap head dimension and the sequence length
dimension.
So this is what we will get originally.
This attention output is actually this shape which is batch head sequence head dimension. Okay. And then this transpose
dimension. Okay. And then this transpose will swap head and sequence. So we will get batch sequence head and then head dimension. So number of heads head
dimension. So number of heads head dimension and then we will combine number of heads and head dimension using this reshape.
So it will automatically figure out so sequence and batch are going to be the same because they are at the same place.
So it will not be changed and then it will figure out that it needs to combine head times number of heads times head dimension to get this size. So it will
merge those two independ now all of the independent heads for this value will now get merged into a single vario vector.
And so now we have this shape batch sequence and uh dmodel.
So this is list of values enriched values with other context and then we pass that through output to get list of actual tokens.
Next we have the simplest part of this whole file. So we just going to be the
whole file. So we just going to be the transformer block.
Uh let's see what this is. Initialize.
We have the model number of heads. This
is the expert you know mixture of expert. So the expert hidden dimension
expert. So the expert hidden dimension that was four times bigger than the dimension of the model if you remember the feed forward of the expert
and um we'll see what this is for rope. I
think I'm not using this. We'll see. I
forgot if I'm using this later.
So uh this Laura rank is for different implementation of deepsek multi head latent attention. I forgot to delete it.
latent attention. I forgot to delete it.
So I think you don't literally need any of these. We'll see later. Sorry, I
of these. We'll see later. Sorry, I
forgot if I use this or not. Okay, we
got max sequence length. You need this.
Number of experts. You need this.
Uh top case selecting top two experts and 10% dropout.
Okay, starting with the initialization of the superass and superass I mean like this nn module you want to initialize that within init
of this class.
Uh okay. First initialize our attention multi head attention class. Pass in the necessary parameters. D model number of
necessary parameters. D model number of heads max sequence like dropout and then mixture of experts.
I'm going to save that as feed forward.
You could also save it as instead of feed forward, but just pass in the parameters and then uh normalization. So we're
going to use RMS norm.
You can watch this video RMS norm from scratch. I explained everything here uh
scratch. I explained everything here uh link below. So that's normalization to
link below. So that's normalization to keep the numbers from becoming too large like or too small like thousand 10,000
million or 0.00001.
We want to keep numbers around number one.
So that's what normalization does. We're
going to have two of them. We'll see
later why. So drop out as well.
And let's see the forward for the transformer block.
We're going to pass through attention but first normalize. So if you remember, I'll show you the architecture of transformer. You first normalize and
transformer. You first normalize and then pass through attention mechanism.
This input X. So looking at this decoder part, looking at this attention mechanism, this norm goes after but people figured out you should put norm
before it works better. And here in feed forward we will also have norm before the feed forward. Now in decoder only transformer we don't have this cross
attention part. So we just have
attention part. So we just have attention and then directly to feed forward and norm is before feed forward and before attention.
So after we pass through attention there is a residual connection which we'll see. So attention
output first goes through dropout to regularize regularize to regularize.
Okay. And then residual okay I can't say regularize.
Regularize regularize maybe. So this is that
regularize maybe. So this is that residual connection. So we add the
residual connection. So we add the output of the attention plus this input.
So the idea is that we want to also preserve some information from the tokens themselves.
Not just process them through attention but also preserve some initial information as well. So as I said when you add two vectors it it's adding
information from both of them. So from
the original and from processed.
Next uh same thing but through the mixture of experts.
So we have this X that just came out of this pass through norm first and then through feed forward which is our mixture of experts and that's going to
return not only the output but also this auxiliary loss and then regularization
and adding the residual network to preserve some of the information from before the this uh feed forward ward and then
return processed x and auxiliary loss.
Next, let's combine all of this into a large language model. Whole large
language model. Um, this is also short part. I think it's very simple. The most
part. I think it's very simple. The most
difficult part was the experts and because I never explained it in a different in different videos so I explained it here. Uh, so now this
should be easy. Um, my filee.lm llm.py
disappeared for some reason.
And I also have this init.py. I don't
know how it got created, but anyways, you will actually need this. We're going
to we're going to create this later. So,
don't worry if you don't have it, but make sure to have this file. Llm
dash or underline llm.py.
llm.py.
Let's first import this good stuff. So
from typing optional configs.meconfig
configs.meconfig import model config we don't have this defined yet we'll make it and from sorry models
layers import transformer block which we just created and so let's create our large language model so I named ite
minimal llm maybe could have given it better name it's like a small mixture of experts or minimal mixture of experts llm
and so we're going to initialize this config with the classe model config this is the class that we will import later we will define it
and do the super initialization and so the config that we pass config I'll show you what config is
um first let's Okay. Yeah. Here. So, token embedding,
Okay. Yeah. Here. So, token embedding, it's going to be equal to n.bedding.
This is a big matrix of token embedding.
So, each of the tokens in the vocabulary.
So, we have let's say 200,000 possible tokens.
And if you don't know what tokens are, watch my llama 4 from scratch course in the description below.
So tokens are these like subwords that model is predicting the next token generating generating generating then generating
and so we have this big dictionary it's like a dictionary or it is literal dictionary I'm not sure if it's literal dictionary or like a dictionary but
anyways it's vocab size so 100,000 200,000 tokens and for each token it has vector embedding So if the for example token car can have
some vector embedding and it's maybe at position 70, we can just index into this dictionary with position 70 and take out
this token for this vector embedding for the token car. But we'll see maybe uh this is commonly explained everywhere.
So I'll just go quickly through this.
We're going to do some dropout position dropout. We'll see later on an example.
dropout. We'll see later on an example.
I'm going to show this through an example.
Now we just created one. We defined
transformer block that has attention and feed forward. But large language model
feed forward. But large language model will have multiple of these blocks stacked.
So tokens first go once and then go back and repeat and repeat and repeat and repeat. So we're going to create a list
repeat. So we're going to create a list of these blocks. each block having attention and feed forward.
Okay, so it's going to be array and then in that array you're going to say transformer block which we just defined earlier
and pass in all of the stuff it needs dimen model dimension head dimension feed forward hidden layer uh as I said uh we'll see later if we need these or
not I think we don't need this one but anyways let's have this v this is what is this I forgot uh we have this
it's not good if you don't know what your variables are but I think uh this is used for them deep seek latent attention so we will not need this we don't use this anywhere you see it's
only defined once in this whole file max sequence l number of experts so all of that will be passed through this configuration file so when we are
creating large language model we're going to put this configuration file separately so we can have multiple different large language models.
We don't need to hardcode any single one into these files. These are just like components, layers and stuff.
Okay. Um
so do this for the number of layers.
Large big language models I think they have 30 40 layers. Smaller ones have like 15 16 but absolutely huge ones who
knows how many they have. these like
grock four and stuff.
Sorry. Uh you don't need to change anything else here.
Next, let's define uh normalization and dropout. And this is for the output
and dropout. And this is for the output head. Now, output head is a separate
head. Now, output head is a separate thing from all of these layers. I never
mentioned it until now.
Let me pull up the architecture.
So these are the layers the green ones and it gets repeated many many many many many many times at the end we have after
all of the layers are repeated we are generating the next token at the end we're not generating next token within this attention and feed forward here we're just processing the current tokens
in the context window and after that has been processed many many times we're going to take the last token in the context text window and convert that to
the next token using these let's say these two things here linear and softmax. So this linear is output head we call this output head
and uh it's interesting it can be absolutely huge it can be onethird of the entire model. So how
many times are you repeating this 16 times all of these things and just this one single linear layer can be like a third of the entire model and you'll see
why especially for small models.
So um output dropout and output normalization now we are tying okay so this is the output head. So what output head will do
output head. So what output head will do is it will take the last token token size and convert that
through a to a vector that's length vocab size. So how many tokens do we
vocab size. So how many tokens do we have? That's the length of this vector.
have? That's the length of this vector.
And that vector will assign score for each token.
That score will tell us how much this token should be the next token.
It's not probability. It's like maybe affinity like let's just say like the bigger score the more this output head wants that token to be next token and but the
score doesn't have to add up to one.
It's not probability distribution yet.
Okay, let's just now summarize. So, LM
head output head converts the last token that's now highly processed into distribution or scores for every single
possible token in the vocabulary. How
likely it should be next token.
But again, we're not talking about likeliness yet.
It's just scores.
And uh we will use softmax to convert these scores to probability distribution that adds up to one to 100%. To now
generate to convert scores into probability for each token later. Um
this is one trick that people find found out. You actually want to have same
out. You actually want to have same weights of this LM head as your token embedding weights. So it's going to be
embedding weights. So it's going to be the exact same weights for this linear layer.
So what we are doing these weights are actually uh token embeddings themselves.
So we are taking the last token in the sequence in the context window that person is saying and multiplying that token embedding with every single token
itself like the the embedding of every single token in the vocabulary like 200,000 of them. this last enriched last token with that's enriched with the context that's processed through all of
these modules multiplying with every single token embedding and that's going to create the logit. So people found out that tying the weights like this works
well.
So now the weights for this output head are same like literal token learn token embeddings
and then this uh self apply self init weights here it will just initialize all of the weights. So embedding layers,
linear layers, RMS norm transformer blocks and all of the weights according to this rule in it weights
that doesn't exist here yet. So let's
define it.
Um we're going to check which type of weights we are initializing.
So if it's linear weights, we're going to initialize weights. Let me see. Yeah. Yeah. Yeah.
weights. Let me see. Yeah. Yeah. Yeah.
So normal distribution, we're going to initialize weights with normal distribution with mean zero and standard deviation 0.02.
And the bias will be just zero all zeros if bias exists.
And if it's embedding and not linear. So
if it's embedding, we will just initialize normal distribution of mean zero and standard deviation 0.02.
Next we will create forward pass define forward pass through this large language model. Let's see Gemini has good
model. Let's see Gemini has good explanation here. Uh X is just all of
explanation here. Uh X is just all of the our input which is batch sequence and then it's actually just a list of
token ids. So it's batch and sequence
token ids. So it's batch and sequence sequence of token ids and we will see later how we get this list of token ids and if we put all of those ids through
this token embedding that we actually defined just earlier we will convert each ID or we will exchange each ID with
its token embedding for that token.
So for example, if token cat has ID uh 3,50, we will first pass 3,50 as this ID. We
will see later, we didn't code that yet.
Now imagine we have that ID and then by passing X through this we will instead of 3,50 we will get the whole token
embedding for the cat and we will multiply with the square root of the model dimension and now that I'm looking at this this
was used when you are adding positional encodings to the token embedding but we are using rope
So maybe we don't need this. So maybe
I'll leave this here or I will delete it. I'm not sure yet. Let's leave it as
it. I'm not sure yet. Let's leave it as is. Uh this was meant to increase the
is. Uh this was meant to increase the embedding vector because uh when you are adding position vector if position
vector numbers are a lot larger than embedding numbers then it will kind of drown out um these embedding token it will all be like
position information because those numbers are very large. So you want to increase the size of these numbers by multiplying. But that was when you were
multiplying. But that was when you were adding position vectors.
But now we are using rope. So we are rotating the embedding token token embedding itself. So maybe we don't need
embedding itself. So maybe we don't need this. But let's leave it for now.
this. But let's leave it for now.
This uh dropout position dropout x will zero out some dimensions within the token embedding vector.
So imagine you have this token hello and world. It will randomly turn some of the
world. It will randomly turn some of the dimensions within the world token to zero.
So dropout is to regularize so uh the model doesn't get too dependent on these exact embedding for this like numbers
for this embedding. It improves the general ability.
Next um collect auxiliary losses from MOE layers.
So we will initialize it at this empty array and for each block we will pass our input through the block and the
block will return the processed input and this auxiliary loss.
And if the loss is not none and this auxiliary loss return return auxiliary loss is true,
we want the LLM to return um this array of auxiliary losses for each block
and output projection. So after we pass throughout all of the blocks, each block being attention plus feed forward, we're going to generate the next token here.
So through the when we pass through out to lm head, uh we will first normalize the X and drop out X and then pass through LM head to generate the next token. As I explained, LM head will
token. As I explained, LM head will convert from the token embedding which is the last token in the context length in the sequence in
the conversation to the logit or scores over entire vocabulary. Each possible
token gets a score to be the next token.
Um combine auxiliary losses.
So we just sum all of them. So we first collected them because we're going block by block. So we collected and then
by block. So we collected and then summed if it's not none and if we want to return then we will return logits and
auxiliary loss. So I guess we could also
auxiliary loss. So I guess we could also uh just sum them here as opposed to first putting them into array. I think that would also make
array. I think that would also make sense or just return logits if we're not returning this and that's it. So logit
is like scores for every token in the vocabulary to be the next token. So we
just have like maybe 10 more lines of code in this file because we finished all these three big files. So you need this file that's underscore_init_.py.
The purpose of this init pi is to have a clean API clean way to import to different uh scripts.
So without it we would be importing like this like my package which is going to be our models.components
from models.components import from models.layers from models.
models.layers from models.
But with this init pi, we can actually just say from models import all of these. So we don't need to say dot and
these. So we don't need to say dot and then file as well. So it's a lot cleaner. We can also say from models
cleaner. We can also say from models import everything, all of the classes here. Here we want to just export all of
here. Here we want to just export all of our classes so we can easily use them in other files. Uh let me show you what I
other files. Uh let me show you what I mean. So first we will import from
mean. So first we will import from components we will import expert topk router mixture of experts and from layers we will import all of our classes
although we don't have multi head latent I did not code that so I'm going to delete this just a second and from um
llm I'm going to import mae minimal llm and then we will export uh first we will combi uh define this all array
expert. So we will just put all of these
expert. So we will just put all of these classes that I just imported and that's it.
So we will this is how we will use these classes in different files and multi head latent attention doesn't exist. So I'm going to delete that.
exist. So I'm going to delete that.
Make sure to commit everything to GitHub.
Next, um, besides this models folder, let's create a new folder, optimizers.
Optimizers, um, I'm separating it because we're going to be doing research on this, but you can structure it differently as
well. And inside we're going to have
well. And inside we're going to have underscore_init_.py
underscore_init_.py and muon.py.
and muon.py.
So muon optimizer we're going to define muon optimizer ourselves and Adam we will just use from pytorch as I said you want to use everything you have in
pytorch those are very fast the best things unless you want to code it yourself and change it now I don't know if there is a single video on entire
YouTube about muon optimizer that's well explaining it if you go to my channel and search for muon I have a videos about muon but I don't think there is a
full course on muon optimizer anywhere on YouTube maybe you can also search for YouTube on YouTube search for muon optimizer um maybe this
one is actually good yeah this one is good so it's maybe more complex if you want to watch it I'm going to try to
summarize muon as well in the muonpy import torch and function is f and then we're going torch compile whatever we
defi define here. So zero power via Newton Schulz 5 uh that's how Keller Jordan defines it
and this is the tensor that we will uh pass through this function and five steps. We'll see what those steps are
steps. We'll see what those steps are and it returns tensor. We'll see what it returns. So Newton Schulz iteration to
returns. So Newton Schulz iteration to compute zero power orthogonalization of G. It's crazy. If you do Google search
G. It's crazy. If you do Google search for muon optimizer, there are my videos here as well.
The main idea behind the muon optimizer.
So you have a matrix of weights in a neural network. So weights for the first
neural network. So weights for the first neuron, weight for the second neuron, third. So this when you multiply input
third. So this when you multiply input with these weights, you get some hidden layer or some neurons, some activations, whatever.
And so you multiply.
So I'm just explaining neural networks.
So input times the first uh row of weights gives you the first neuron.
Input time second uh weights gives you second neuron. There is also maybe
second neuron. There is also maybe activation function in between. So just
before the neuron that's not so important. What's important is that
important. What's important is that this weight matrix if you look at each row of this matrix separately as a vector
if those vectors are perpendicular 90 degrees. So let's say this the first row vector is this the second row vector is this. If there is
90 degrees between them neural networks learn faster with less data.
Why? I don't know if anybody knows, maybe the inventors have some idea. I'm
not sure. I haven't read about it at all. I'm actually trying to keep up with
all. I'm actually trying to keep up with neon optimizer and I haven't understand like haven't understood why neural networks learn uh more with
less data. If uh this rule of
less data. If uh this rule of perpendicular rows of the weight matrix or columns, it can be rows or columns. if that rule is
true for the weight matrix. So, muon
optimizer we're trying to keep the rows of the weight matrix perpendicular to each other. If we look at rows at vector
each other. If we look at rows at vector as vectors the way we will keep the matrix orthonormal which means that either rows
are orthogonal to each other or columns are orthogonal to each other and rows or columns are normalized.
Um the way we will keep it like that is we will each of the update matrices. So
you know there is this update matrix which is gradients.
So this is the classic back propagation and I showed you the my course on back propagation from scratch. So um when neural network does back propagation
there is this update matrix.
It's like for each weight how much to add or subtract from the weight to make the loss go down. So the update matrix
gets sub multiplied with learning rate and subtracted from the current weights of the model to adjust the weights. So
again update matrix is made from gradients. It's like how to change
gradients. It's like how to change weights, what to subtract from weights to make the last loss go down. And it's
it has same shape as the weight matrix.
And so we're going to make these update matrices orthonormal them. And then as you just
orthonormal them. And then as you just add them, add them, add them, add them, then the weight them weights themselves will be orthonormal because all of the
matrices that made the weights are orthonormal.
But we will make it approximately orthonormal. these update weights
orthonormal. these update weights matrices uh because the true orthonormalization would involve calculating a singular values and
singular value de composition which is very expensive and slow. So we will approximate singular value de composition operation
with some kind of polomial. So it's it to me it looks like a polomial. Let's
see. First of all, uh we will apply muon only to 2D matrices. It's made for 2D matrices, not
matrices. It's made for 2D matrices, not uh embedding vectors, not normalization that that are not 2D. So only for these
2D weights and as I said we have these coefficients here and this is first reason it reminds me of a polomial.
You'll see later that these are coefficients in a something what looks like a polomial.
So we will convert um G to this half data type because this algorithm is stable enough to do this in FP16 and it
provides a lot of speed up as opposed to being FP32 for example.
Then we're going to check if the matrix is tall and so if it has more rows than columns.
So it's tall. If it's tall, we want to make it wide. We want to transpose it.
The algorithm works better or is standardized on matrices where rows number of rows are is less or equal than number of columns.
In the next row we have normalization.
So x is equal to x over and then x dotn norm along these um two dimensions
and then to avoid dividing by zero.
So Newton Schulz iterations which is our polomial approximation of this singular value the composition
and making the matrix. So this is what makes the matrix or norormal and it's going to diverge if it's spectral norm
largest singular value of the matrix is too high. So larger than this value.
too high. So larger than this value.
So we are dividing it uh by this norm to make sure that the matrix is small enough so it converges or the spectral
norm is small enough I should say. So
this minus2 and minus1 are rows and columns and it will make sure we calculate um the
forbid for binius whatever the norm is of each matrix individually if there are multiple matrices or if there is one
matrix just of that one matrix.
So if we have a single 2D matrix standard case uh let's say it's like this. This is the norm. We square all of
this. This is the norm. We square all of the values and then square root. So
that's the norm.
And the norm along among these dimensions is also the same.
But if we have multiple matrices, then it would actually add up all of the numbers of both matrices.
And putting it like this, it will now separate um so rows and columns and then rows and
columns as well. So two matrices and we also have this keep dim equals true. So
uh let's see what that does. So our
result is 1D tensor.
The first norm for the first matrix, the second norm for the second matrix.
But our input was 222. So our input this was 222.
But our result is just of size two. And
you cannot divide uh this 222 with or by this two. So you
need to reshape it to 211. But we don't need to reshape it if we just preserve uh all three dimensions.
So we just say keep dim equals true and then that will preserve.
So for this matrix for this matrix and now uh PyTorch can divide this.
it can divide the first matrix with the first norm and second matrix with the second norm. So after we did this
second norm. So after we did this normalization stabilization um this is where we come to the point of this polomial approximation that I was
talking about. So I'll just show you the
talking about. So I'll just show you the lines of code.
So we do uh some number of iterations and every iteration we make the matrix A
sorry I should say this X matrix X that that is our update matrix. We make it more and more orthonormal every iteration.
So five iterations we make it more orthonormal. 10 we make it even more
orthonormal. 10 we make it even more orthonormal. The question is how many is
orthonormal. The question is how many is enough? We don't want to waste compute
enough? We don't want to waste compute and this is something I'm also testing in my in the paper that we are writing here. So we're going to try different
here. So we're going to try different number of steps, 10 steps, five steps, two steps, etc. But the default is five steps.
So um look at this a is x time or x matrix multiply xrposed.
If X were perfectly orthogonal, A would be the identity matrix. Let me first show you the other lines of code and then we will discuss.
So here this is why I say it looks like polomials. So if you look at this A you
polomials. So if you look at this A you have you have literally a like something that reminds of a squared which is just
matrix multiply in this case. So a what this reminds of a square times this uh coefficient and then a to the first
power what remains of a to the first power or times coefficient. So that's
the b and then this is a bit weird. We have b * x but then we have also just x
times the third coefficient. So that's
why it kind of reminds me of polomial here and it literally is polomial.
So it computes the polomial using the coefficients B and C. This calculates
terms based on X * X transposed and X * X transposed and squared.
So we are talking about this one. And
then in the next one it combines the original X with the polomial corrections.
So polomial corrections and the original X. So this second part look you have X
X. So this second part look you have X the matrix the update matrix and you are
transforming it to be more orthogonal with this B. But you are also keeping some of the
B. But you are also keeping some of the old x with this a because this is just a
scalar coefficient. So some a bit of old
scalar coefficient. So some a bit of old x plus a bit of new x and every so every time this will be more and more
orthonormal matrix after every iteration. So this is very fast to do.
iteration. So this is very fast to do.
Look how fast it is. It's just plus and multiplication. very fast matrix
multiplication. very fast matrix multiplications on GPU as opposed to slow singular value de composition that needs to run on CPU only I'm pretty sure
okay like this is a bit tough so this B constructs the coefficient for the x23 and x25 terms
um I guess I need to make like a full video or figure out a way how to explain this muon better and how to understand it better myself as well but as I said
what this does is it makes this uh gradient like vector update matrix more orthonormal with orthonormal with each iteration.
And so if we transposed the matrix here, we're going to turn it back.
And whoa, whoa, whoa, whoa, whoa, whoa.
Just uh return this now uh matrix that's a bit more that's orthogonalized or approximately orthogonalized.
Then we define class muon which is the new optimizer inheriting from torch optim optimizer momentum orthogonalized by Newton Schulz
that's what we are doing and in it so parameters of the model learning rate is quite high compared to
Adam which is usually 0.01 01.
So 20 times smaller than this.
Uh we're going to be checking this this momentum. I'm going to be doing
momentum. I'm going to be doing ablations.
Uh 0.9 I find it works very well in the experiments. We'll see later. Number of
experiments. We'll see later. Number of
steps here. Number of Newton Schulz iterations five by default.
and using or not using nester of momentum. We'll see later about this. Oh
momentum. We'll see later about this. Oh
my god, what is this?
Then this default is equal to dictionary.
This will just package the hyperparameters.
You see just packaging hyperparameters into this default dictionary and initializing the superass.
Next torch no grad and then we are defining some step. The
reason we don't want torch to calculate gradients here is that we are not going to be training some weights within this
optimizer. We are using this optimizer
optimizer. We are using this optimizer to uh train the neural network but we are not going to be calculating gradients and updating whatever this
function does. We're not going to be
function does. We're not going to be training this function itself. This
function is just used to train other weights.
Then we will just loop through parameters.
Uh we'll check if parameters have gradient.
If they don't if it doesn't have gradient just continue and so we will get the gradient.
And here this state uh when passing parameters we get the momentum that's saved here.
So if we don't have momentum buffer in state then we will initialize it from zeros and it's going to have same shape
as these gradients for this whatever the parameters we are processing now.
Then we're going to store this momentum buffer uh into this buff. So it's going to be either like zeros or it's going to be
whatever we had previously.
For the next line, we have buff.lurp
linear interpolation.
Uh we have the gradients and then one minus the momentum.
So it updates the buffer in place. Okay.
because uh we are just using it dotlur here. It's like saying buffer is equal
here. It's like saying buffer is equal to buffer times momentum
plus gradient time 1 minus momentum. So
remember that our momentum is 0.95. So
it's very close to one. So that means that uh a lot of the old 95% of the old buffer will remain. So the
new buffer will be 95% old buffer and just 5% will be the new gradients.
So gradients will not change so much this buffer which provides stability because gradients can be a bit random.
We don't want gradients to change stuff so much. So we just want it to change
so much. So we just want it to change bit by bit. This creates an exponential moving average of gradients.
So the reason we need moving average is imagine these gradients are on average pointing in some direction
uh on the loss surface that reduces the loss.
But each gradient individually will be pointing in a bit random direction as well.
But on average they're pointing in the good directions. So that's why this like
good directions. So that's why this like calculating the average of the gradients is good.
So buffer will start from zero and then based on the gradient and gradients it will uh average out and have this exponential moving average of
gradients.
Then oh my god again we have this um checking if it's using nester momentum.
So if neto is false then it will apply the standard momentum of the gradient just being the buff itself.
So look if we have nest then it's going to be something which we will explain but if we don't have nest it's just going to be the buff
and then that buff will be so that's just going to be passed through the Newton Schulz. So that buff will be
Newton Schulz. So that buff will be literally the weight update matrix which again it's just
the gradient itself like exponential moving average of the gradients.
But if we have nester then something a bit more complex is happening.
The difference is that standard momentum momentum takes 5% of the current slope current gradient
and nester momentum takes almost 10%.
Because you can see at this point that we are also learning linearly interpolating the buff and the momentum.
So you are literally taking the buff which we had in the previous one as well and then multiplying that with 0.95 and then take uh giving even more of the
current. So at the end
current. So at the end uh this way standard momentum has only about 5% of the current gradient and
nester momentum has almost 10%.
This causes neester momentum to look ahead at the future spot by uh taking this step calculating the forward step and then adjusting based on this forward
step and so it will calculate the slope at the future spot. So because we are going from 5% current to almost 10%
current we are simulating the correction we would have made if we had stepped uh into that future. I mean did the next step there is this video on a stereo bit
momentum I think it's very good so then in the next line we just uh make this gra gradient update matrix more orthonormal
with zero power newtons and just one more line so padd and then we have this all this good stuff uh it looks a little
bit complicated this is the actual update of the weights And it's complicated because it's designed to uh work based on if the
update matrix is tall or wide, has more rows or more columns.
Uh this part will measure or check if our matrix is square, if it's tall or if it's wide. So uh if the matrix is tall,
it's wide. So uh if the matrix is tall, it has more rows than columns, then we want to increase. So this this number
will be larger. So if you look at this, this is uh rows over columns and we're going to get a maximum number
between one and rows over columns. So
only if rows is larger than columns then this number will be larger than one.
So then only if there is more rows than columns then we will have some number larger than one and then we take square root of that number but that's not so important like I'm just trying to say
that if we have more rows than columns then this number will be larger than one otherwise it will be one.
So tall update matrices will make the neural network learn slower and it's due to energy dilution.
So uh our shape correction for tall matrices will be greater than one. Uh so
we want to increase the learning rate multiply by shape correction. So we want to make the learning rate larger for tall matrices because they are diluted.
they if we don't then they will just cause the neural network to learn slower than if we had square update matrix or wide update matrix explaining why tall
update matrices make the neural network learn slower would take me uh too much time for this video so but I will explain it in the future for now you would have to ask like Gemini or
something if you want to understand this and study it yourself but in the future I'll make it looking at our final formula we have our weights which is
going to be our P the parameters.
So P in our code here we have P dot add.
So we are adding something um P minus this learning rate times orthogonal gradient. Uh this is
orthogonal gradient. Uh this is important. So we are literally
important. So we are literally uh modifying P in place. So here we are modifying P in place. we are just adding
well in this case we are adding negative learning rate and uh so when you say minus learning rate so that's going to be subtracting
then but it's going to multiply learning rate with this G g is our gradient so G
uh we are forcing G to be same shape as P so it's going to be 2D matrix X and uh we will multiply just it will
multiply G * minus learning rate and so it will add or in this case subtract it from the P itself.
So we will have something like this.
So that's going to be the muon optimizer. I'm going to just commit this
optimizer. I'm going to just commit this to GitHub and we can go on to the next thing. I will also create in it folder
thing. I will also create in it folder here in optimizers and inside I'll just import uh this from muon these two
classes m and zero power we are into shs and I'm going to export it like this so it's easy to import just from this module optimizer so I can say from optimizers import muon from optimizers
import this I don't need to say from optimizers do muon import this and we can easily add more optimizers and then export everything from here and
it's going to be very clean. We will be importing just from optimizers everything we need.
Then let's make a new file outside of everything train_e.py
train_e.py or train.py or whatever.
or train.py or whatever.
Here we will literally write our code for training the whole neural network.
So we will actually run this python uh train.py when we want to train the large
train.py when we want to train the large language model.
Let's start with some imports time OS and torch and then logging and then from torch utils data import data loader for
loading the data. Now we need to do this because it will just uh put a lot of warnings in the console when we are training the large language model about like tokenizing
uh process being split. So you need to do this for it not to split the tokenizing process or something. I'm not
100% sure.
Uh maybe this is if you have multiple if you're tokenizing with multiple GPUs or something. Okay. From configs.me config.
something. Okay. From configs.me config.
We don't have this yet but we will have it. Importe model config. So we will uh
it. Importe model config. So we will uh create this soon. And from config config
data set config import uh data config.
And from uh data.loader import prepare LM data set. And we will also have this we will call this as well.
From training trainer import train model from utils.h helpers import set seed
from utils.h helpers import set seed from utils.logger import uh setup
from utils.logger import uh setup logging.
Okay. We can print system info like if it's using CUDA or CPU if you want and so that's the device that we are
printing and if CUDA is available you can also print which GPU it is and how much memory uh it has and this will get the first GPU which is
if you have just one GPU it will be your one GPU that you have and I can also print torch version set up the logger to
I will put in just logs folder which it will create automatically and then logger info starting MO with training so this is just some
information uh print system info it's this function that we are just made so this is in main this is when it starts training and then set seed 42 we want
consistent seeds so it can we can reproduce experiments um model config. So this is what we import
model config. So this is what we import and we will create this. So this will just give us all of the configuration for our model and then we need to import data set
download it from the hugging face. I'm
going to use this data set hugging face uh TB small corp small LM corpus. This
is good for training small LLM and this is the name. There are there are three data sets in this repository. So there
this is the one that we will use.
And we will also use this tokenizer because it's uh made for this for this data set.
So now we will just set a sequence length from the config that we will uh create later and number of samp uh number of documents from the config as
well. How many we want to download
well. How many we want to download but we will be streaming downloading as we train. We will not download the
we train. We will not download the entire data set at once.
and then a cache directory for this hugging face downloads.
So make sure this cache directory is included in your in your git ignore.
So it's not here. Maybe I can just add like this. So it's not committed to
like this. So it's not committed to GitHub.
You don't want cache directories committed to GitHub.
uh split documents before uh tokenizing tokenization to prevent data leakage leakage. So we are splitting between
leakage. So we are splitting between training and evaluating data. So from
data sets import load data set and I told you that we will install data sets with requirements and so uh raw data set we'll just call
this load data set and pass in data set path which we defined here data config and data set name
and uh the split data at config.split.
Do we have that? Maybe we'll have that later. Why don't Why don't I have that?
later. Why don't Why don't I have that?
We'll see. We'll see. I'm not sure why I don't have that here. A cache directory streaming troop. We want it to download
streaming troop. We want it to download as it needs. We don't want to download entire data set because we will just train on very very small part of the data set.
So we will just now take number of samples maybe 10,000 documents from this row data set you know it's streaming so it's going to
be streaming uh 10,000 documents we will then take it load that into memory after it's it's streamed all of them
and then um we will take 10% of that number of documents needs to be validation data set
and remaining will be uh the remaining number of will be the number of training documents
and from data sets import data set and so I'm just going to take so this is raw train so data sets from
list raw samples we're just going to take from zero to number of well from the to number of training data set that will
be the training and validation. Same
from number of training until the end and logger we can say that we split the data into training and validation and number of documents
from data.loader import setup tokenizer
from data.loader import setup tokenizer tokenize and chunk and finalize data set.
Um so tokenizer is going to be setup tokenizer which we will write. We will
write this function but we will pass in data config configuration and then config vocab size is equal to
tokenizer vocab size uh tokenizing training set. So now
tokenize and chunk from our data.loader
which we will define as well. pass in
the documents tokenizer config finalize data set.
We also uh created this function.
I mean we will create it later to see what it does. It does some processing on the data set and then tokenizing validation set. It's
going to be the same.
And so we have um now the tokens themselves we can say uh length of those tokens for both.
This is for data loading uh from RAM memory into CPU. CPU will process the data turn data into tensors and then it will be sent to GPU and GPU will perform
multiplications.
So batch size is how many of these um different chunks of data are independently sent and processed. Better
GPUs are going to be able to take more.
Number of workers is the amount of background CPU processes that are loading data and sending it to CPU. So
sorry GPU. So uh if GPU is already processing one uh batch of data, the next batch is already being converted
and loaded by the uh CPU. So next two batches actually
uh CPU. So next two batches actually while this one is already being processed in the GPU persistent workers uh once it finishes
loop through entire data set. If there
if workers are not persistent then that processes will be removed.
But then it will need to start them again for the next epoch for the next processing of the same data set. So we
will not actually be removing them and starting them again. There is no reason for that because we are processing multiple epochs multiple times over the same data set. If we are doing that
and this pin memory is u data is put here for the faster access by GPU.
then train loader validation loader. So these
are going to load the data set and we will shuffle the documents.
So it doesn't train on the like some predictable maybe in the first part of data sets it's all like medical documents.
So it's going to just maybe learn uh like medical and legal documents and the rest of the data set we are not uh teaching has some math and other stuff.
So we will shuffle everything to randomize the document all order.
This is just printing the information uh experts. So this is just prints a
uh experts. So this is just prints a bunch of prints starting training. We can measure time
starting training. We can measure time train model.
We have that here. We imported that from training trainer and we need to pass in this. So we will define this. We didn't code this yet.
define this. We didn't code this yet.
Uh we can measure elapse time after we get the trained model and metrics.
Training complete results. Okay, I'll
just go quickly through a bunch of prints here.
Validation perplexity loss accuracy.
This is where we save the checkpoint. Uh
you can save that like it will create this folder and make sure this folder is not in the so it's in the g ignore.
Okay, it's not in g ignore. So I'm going to add it here maybe. So we don't want to con commit this uh checkpoint to the GitHub
and then torch.save.
So we will just kind of structure all of these this information so we can later load it easier.
So checkpoint path. Okay. So we we are just printing where the checkpoint path is. And this is the main. Oh, so that's
is. And this is the main. Oh, so that's it. So when we call when we call this uh
it. So when we call when we call this uh with the Python, it will just trigger main which is going to start training the model loading data set training the
model and that's it.
for this file. Let's create a new folder config and inside let's do init.py
and next to that data set uh_conig.py
and_config.py.
Let's do data set config. So we will import from data classes import data class from typing import optional collable union and import logging
and so we will just uh get this logger logging.logger
logging.logger and then uh do data class here and define data config.
So data class will automatically generate some boilerplate for classes that uh store data.
So you can later call uh some of these methods that are going to be useful for uh working with data.
So I'm quickly going to go through this data set path which is going to be our hugging face. We already uh did all of
hugging face. We already uh did all of this data set name split train uh tokenizer I think we're going to use
different tokenizer but this is default but it's going to be changed trust remote code false I think this is so like this is not so important none of
this uh let's just go quickly we have some defaults for sequence length stride so um when you're training large language
model your sequence length you train this data your tokens and then you you move onto the next so you move by the whole sequence length to get the next
data uh like next tokens so I had a bug where I trained on this sequence and then I move just by one token train on this sequence and then train on this
which was uh not which was a bug my loss was 0.001 001 which is not good. That's
how you know your loss is too low.
That's how you know you have some bug.
So when you train, you train on this sequence. You move by the entire
sequence. You move by the entire sequence of tokens. You train on the next sequence in the data. That's your
stride. It's equal to the sequence length.
Uh number of samples. Okay, I think this is not so important.
Uh let me just do this.
I want to go quickly through this. Uh
it's very easy. What this does is it just loads data set. We have these optional parameters uh streaming true. We already talked
about this save to disk load from disk.
If you want to save literally to your memory so maybe you don't want to like you don't want to process data set every time you train you run that training
which you which we will set this to true I believe. And then we have some checks.
I believe. And then we have some checks.
Let me see these checks.
Maybe not all of these are necessary.
I mean I think you can just copy this.
So data set path must not be empty.
Uh you need to like there are these like rules.
I think you can just copy paste this from my GitHub in this file if you want.
I don't want to spend too much time here. Let me see.
here. Let me see.
Stride must be positive. So, uh, I don't know if we even need to do so many of these checks. We don't even need to do
these checks. We don't even need to do so many of these checks.
You don't need to like write this if you don't want to. Maybe it's good.
Like if it's non if it's non- empty string, if it's not a white space. Let
me quickly go through this.
Okay, I think that's it.
So we just have this like config data class and that's it.
Now let's go toe_config.py.
So this is just config for our large language model. So we will again import
language model. So we will again import data class and from typing and define the data class. So dimension
of the model um this is small enough to run on free Google GPU. I'm pretty sure we'll check it later. Number of heads.
This is good. Number of heads. Number of
layers. So these are large language model layers. Hidden dimension should be
model layers. Hidden dimension should be four times this. Now is this four times this? I think it's more than four times.
this? I think it's more than four times.
But it's okay. Uh we can just check with this. We can check with different. It
this. We can check with different. It
doesn't have to be four times. Let's
keep it like that. Um here we are not using these in our large language model.
So maybe you don't need to actually write them. This is for deepseek and
write them. This is for deepseek and this one as well. So maybe you don't need these four. If you want you can leave them there.
As I said it's not good if I don't know what the what the variable does what the variable is. I think this is also for
variable is. I think this is also for deepseek because this was done by some other person. They added deepseal. They had
person. They added deepseal. They had
latent attention. So that's why I'm not sure what this is.
Batch size 24. I think you usually want to keep these numbers powers of two. But
sometimes you just like don't have enough GPU memory.
Maybe batch size as well. Okay. Maximum
steps thousand is maybe a low. But but
maybe if you are testing maybe just like 50 to 100 to like you do 10 steps to see if you have any bugs. Then you do maybe 200, 300 to 500 to test a bunch of
different experiments and then thousand 2,000 to test deeply the best of those experiments. You want to test them even
experiments. You want to test them even more with maybe thousand steps. Then
but you put 10 here just to check for bugs when you're running the large language model training for the first time.
Uh gradient accumulation that's simulating batch size. So if you cannot load too many batches into memory then you will just do like multiple times
without updating the weights without doing back propagation you will just update the gradient accumulate gradients multiple times as if you have larger
amount of samples larger amount of conversations mu1 learning rate this is from the experiments that I did momentum 0.9 you
remember that the default is 0.95 but through our experiments that we will do later right in the this research we will get that for our setup 0.9 works best
Adam learning rate this is all this is all from the experiments that we did max sequence length 512
uh number of documents maximum tokens that we want to load from the data set evaluate every 10 steps Maybe you can
increase this.
Wait, wait, wait, wait, wait. I'm not
sure what this is. Wait. Evaluate every
100 steps. So, what is this? I forgot.
I'm not sure why. Maybe this is We're not using this. I forgot about this.
Okay. Uh, weight decay regularizing.
Uh, these are just like regularizations.
Uh, use automatic precision. True. Let
the PyTorch speed up the training and then vocab size and logging milestones. is just for logging
and then MOA specific number of experts choosing top two experts load balancing weight which we talked about in the first part of the video and then when we initialize we just need
to check that uh head dimen so dimension of the key and query and value no sorry sorry head dimension is equal
to is divisible so you want to Make the token uh token embedding vector or the which is also the dimension of the model
divisible by number of heads and it needs to be equal to head dimension.
So now in the init we will just structure export it nicely. So import
model config and data config from what we just defined and then just put put it like this. So you can import all of this
like this. So you can import all of this from configs. You don't need to import
from configs. You don't need to import it from these files.
And that's it. Now let's go here. We can
uh commit this to GitHub.
Save and commit.
Let's make a new folder uh data.
And inside we want to make data set.py
loader.py. pi and streaming data set.py
data set.py as well as underscore_init_.py.
So let's go to data set. In the data set.py, Pi let's do some imports from
set.py, Pi let's do some imports from tor shoots data import data set and let's create this class uh text token data set and it's going to inherit from
data set token data set with configurable stride for creating training windows arguments are tokens list of token ids
and sequence length uh and then stride showing like so all of the tokens We want to put take this sequence and then this sequence and the
next sequence. So stride will usually be
next sequence. So stride will usually be equal to sequence length but it doesn't have to be. Uh here we will make it more adjustable maybe.
So as I said stride is by default going to be the sequence length. So it's
non-over overlapping window. So you take this set of tokens and then the next set the next. If stride is half of the
the next. If stride is half of the sequence length then it's 50% overlap.
So you take this then you take 50% overlap then 50% overlap and then if stride is one that's maximum overlap they're just moving one by one
tokens which is actually a bug that I had but depending on what you're doing you may actually even use this
and so initialize this class with token sequence length and stride that I just explained and set these two variables um Set stride. It should be sequence
length if it's not defined. So default
is sequence length.
Calculate number of samples based on stride.
So how many sequence lengths can we get based on the stride and the available tokens.
And so we have like this length which will just return number of samples. And
then we can also get item. If we pass in index, we will just get this sequence length based on the index. So index time
self stride.
And and you see when we have a sequence and we're training the large language model, we want to predict always the
next word. So the x will be this
next word. So the x will be this sequence. But in that X we will go token
sequence. But in that X we will go token by token and then try to predict the next word.
So you see X will be this um X will be like from the starting index to starting index plus sequence length but this will
not include the last token. So that's
our data that we have and then what we're trying to predict is y and y will be from the second token from the second
so starting index plus one until the end. So imagine that our data that we
end. So imagine that our data that we have is like this sequence and except the last and uh the y that we are trying to predict is the same sequence but just
moved by one token.
So what that means is if we have first just first token in X we are trying to predict first token in Y then we will have first and second
token in X we try to predict the second token in Y first second third in X trying to predict third in Y. So you see it's like we're always trying to predict
the next token and that next token will be like uh same index as the last token in X that we have. So we will look at all these and try to predict the next.
So that's actually the entire data set class that we need text token data set that's going to split our stuff into sequence sequences.
Now let's go to loader loader.py the
data set. Uh we're just going to do some imports like annotations and um I just going to do these imports quickly. You can just copy them from the
quickly. You can just copy them from the We'll see later how we use this. Getting
the logger setting up setup tokenizer.
So we will load the tokenizer and we will pass in the tokenizer name.
So this will just load tokenizer from the GitHub uh not not GitHub hugging face and we will usually use that small LM tokenizer.
And now we need to set the pad tokenizer to be equal to end of sequence tokenizer. That's something people do
tokenizer. That's something people do and then return tokenizer uh load raw data set.
So this will just load so load from disk.
Uh if we because we saved it, we we processed it or we downloaded it and we saved it. Now if it's streaming, we're
saved it. Now if it's streaming, we're going to load it a bit differently.
Um so I forgot to say if we want to load from disk then we load from disk and then we have we can also load data
set from the hugging phase as well.
So we talked about these if it's like we we talked about this stuff. So I can go quickly here. If we give some wrong name
quickly here. If we give some wrong name for the data set, we may throw an error here.
Uh and then we will just return that data set. Apply sampling and filters.
data set. Apply sampling and filters.
Uh the definition is a bit longer here.
So is streaming depending on if if we have this iterable data set. That's how
we know if it's streaming because this is just like iterator that we can materialize. That's how it's use. That's
materialize. That's how it's use. That's
how it's streaming.
Uh if it contain if config number of samples exists and if it's streaming then we will take first number of samples. So
just with this data set take like five samples.
So this will materialize that iterator and uh otherwise if it's not streaming we will just take the minimum number between either number of samples or the
length of entire data set.
So um if we actually have less samples in the data set than we want number of samples then this will throw an a warning that there is not enough samples
in the data set and we will just pick materialize or pick this number of samples.
So here we have some pre-processing functions.
Uh we can just apply whatever pre-processing functions we have. We'll
see if we have any defined in the config.
We can also apply filtering functions if we have any um applied or any defined.
Sorry guys, I made a mistake. So I made this file uh this file is unnecessarily too complex. I worked on this and some
too complex. I worked on this and some other people worked on this. So uh I'm like I think you could just you should just copy paste because there is no point in explaining all of this. This is
not so useful for your research. So just
copy paste this file and next time I'm going to make it a lot simpler. This was
my bad.
So this loader.py.
So in in it so in the data init py let me just define uh import these classes that we defined and then just export them like this.
Now let's create a new folder utils and inside uh let's create a new file helpers.py
and besides that let's create logger.py.
Now I want you to just copy paste my helpers or if you can code it if you want but I'm just going to copy paste this uh from the GitHub. So in utils helpers.
Oh no no no that that goes into helpers.
So this is just set seed and account parameters and then let's go back to uh logger and let me copy this one as well. This
is not so important for your research.
You can just copy. You don't need to understand. I mean uh you can maybe
understand. I mean uh you can maybe change it a little bit with AI and stuff but it's not so important for you to understand. And then let's create a new
understand. And then let's create a new folder training.
And inside let's create um trainer.py
and evaluation.py.
Maybe this could be evaluator maybe but I named it evaluation evaluation. So
let's name it like that as well and create init pie.
Guys let's be fast here. there is a lot of boilerplate code. So I don't even know if I if I want to encourage you to code this uh step from scratch like by hand or to copy paste. I'm not sure. So
you can see so there is a bunch of imports. You can maybe just copy these
imports. You can maybe just copy these like so many imports from everything we've defined. And this trainer is for
we've defined. And this trainer is for training the LLM. So early stopping is the class. So if it's training training
the class. So if it's training training training and now suddenly evaluation valuation loss starts to increase we will just stop the training uh because it's a sign that it's
starting to overfeit on data or something.
So uh we will just have a bunch of this stuff. I think this is not so important
stuff. I think this is not so important for your research. This is just u the class that you would usually copy paste.
It's not I don't know if I don't know if I should if I want to encourage you to spend time understanding this.
So what it's doing is it's uh checking the best loss and if three times the loss is worse. So there is no new best
loss three times in a row for example or whatever the patience is. Then it means like three times like okay there is no loss is not improving let me just uh
stop the training which we will use later. So you have this counter
later. So you have this counter and check if it's larger than patients and then set up muon optimizer. You know
when we coded muon we want to set 2D matrices to muon. So and we don't want like normalization and token. We don't
want them in muon. So just 2D matrices.
So weights. So those will be muon parameters and the other ones will be Adam parameters.
So you can print the amount of Adam and Muon and now we are initializing muon optimizer which we coded and Adam as
well torch optim AdamW you can check my video here Adam optimizer explains step by step so uh you will learn like the math and theory
behind Adam here link below the video let's continue so we will just return these two optimizers in array like this and then train the
model. All of the parameters that we
model. All of the parameters that we talked about scheduling for learning rate. We want learning rate to start low
rate. We want learning rate to start low and then quickly rise because in the beginning updates will be a bit random because it doesn't know anything. But
then it will quickly rise because large language model will quickly learn. In
the beginning it learns very quickly. So
we want to achieve maximum learning rate early and then slowly reduce the learning rate for the rest of the training uh because
uh because as it trains more we don't want to keep updating more and more the weights. We don't want to keep updating
weights. We don't want to keep updating more and more weights because it already knows. So we we want to be more and more
knows. So we we want to be more and more careful and conservative of how we change the weights because towards the rest of training it already knows. as
well.
Uh I talked about all of this good stuff here is describing you can read if you want all of the inputs. The output is model final matrix matrix history.
Um I'm going to go quickly through this.
I mean this is we've kind of talked about this.
So this is going to train the model.
Uh we can like write this bar in the console if we want. We don't want early stopping.
So, by the way, I don't know if I should still explain everything step by step.
Uh, maybe some people will find it boring. Maybe I should, maybe I
boring. Maybe I should, maybe I shouldn't. Tell me below. I don't know
shouldn't. Tell me below. I don't know who's even watching at this point. Okay.
Um so as long as we are still training, as long as we didn't reach maximum steps, we're going to be
adding data.
And we have this attention mask.
Let me see. Okay, so we are just defining stuff here.
And then we're pushing uh these to GPU like the tensors which are batches which are which is our actual tokens and data.
And then uh forward pass with automatic precision.
We have cross entropy loss. We have
shifting labels.
So you know that well this is the X and Y I was explaining.
We want to make sure first of all that uh these logits are continuous in memory next to each other. So this is logits are like
like not scores for every token in the vocabulary which token should be next the what the LLM thinks.
Okay.
We also need to adjust total loss with the for the gradient accumulation steps and if we have this auxiliary loss then
we will add that to the total loss. So
this is the mixture of experts loss and then we can do backward because we also we did backward here as well. We did
backward here depending on if we have this auxiliary loss use amper. No, it's it's depending on if
use amper. No, it's it's depending on if we are using automatic precision. So
here we also have auxiliary loss as well.
And make sure to scale back the loss from automatic precision when you're doing back propagation.
Optimizer step. Let's see.
Okay. Um,
we will just use the muon and Adam optimizer. So for each of the
optimizer. So for each of the optimizers, we will just unscale and then do the step zero gradient.
So this is uh contained in my course AI research from scratch about zeroing gradients or neural network from scratch. I also have that video.
scratch. I also have that video.
and then logging what happened, loss and uh accuracy and stuff. And then
we can evaluate as well.
and then printing whatever we evaluated and then updating next step and uh updating the bar in the console every
20th step and final evaluation of the model after it's it has trained.
Uh remember that we are doing cosine decay on the learning rate.
So we need to get the current learning rate total time. So we are just printing a
total time. So we are just printing a bunch of stuff here. Saving into output direction the metrics directory output directory and plot
training metrics. So this is just about
training metrics. So this is just about plotting and saving.
So this is let me go quickly. This is
just about drawing stuff with the I think you can copy paste this because it's not like you don't need to do research with this. This is just for plotting and then train model.
So we will initialize the model set the seed. Uh now call everything we
seed. Uh now call everything we initialized above.
Print setup optimizers schedulers. We
have warm-up for the learning rate which is going to be some percentage like 10% or 5% of the whole entire like all of the steps
and we're going to write the logic here.
So this is just the learning rate scheduling the logic.
Okay. And then train model we call that.
And that's it.
That's it for the trainer. Again, guys,
tell tell me in the descript in the comments below if I need to or if I should explain everything. I can explain I don't know if it's useful or if people will watch or if people care so much
about if I explain everything step by step. Uh I'm thinking like I just need
step. Uh I'm thinking like I just need to explain like the research parts and not this like boilerplate.
And then in evaluation.py, we will just evaluate model. I have a bunch of
evaluate model. I have a bunch of imports.
uh we will initialize these things here before we evaluate and we don't need gradients
we will just uh get the data let's go let's go uh auto cast to cuda the the floats this
will be automatically managed and then um doing the logits from the model because model will generate the logits And oh wait that's it that's it that's
it that's it in init.py Pi we will just export all of
in init.py Pi we will just export all of this. So let me save everything. Let me
this. So let me save everything. Let me
commit everything to uh GitHub like this.
And let's finally create experiments folder.
Okay. And inside we can create um muon_us_adam folder.
And you can also maybe make it like experiment one muon versus Adam
like that.
Let's create a new file in that new folder.
Uh run Adam learning rate sweep.
The main idea here is to uh find the best learning rate for Adam optimizer for our neural network.
So let me just go through all of this.
So I'm going to just say where the uh different paths for the script and project root are for this folder that
where we will import stuff like from uh run multiple run experiments.
I'm going to import run multiple experiments but we will define this. So
I'm going to say I'm going to print all of the learning rays that we will do experiments with.
And these are the names of the experiments.
And I'll just go quickly because this is just like running the experiments and then um just printing out the result, the
summary and then comparing them. And we
can actually get the winner here and printing the winner and then running it all. That's it. So
these names will look up. So it will use these names to look up the configs. We
define configs for the experiment somewhere else. So inside of this
somewhere else. So inside of this experiment one muon folder muon versus Adam. Let's make a new file and it
Adam. Let's make a new file and it should be experiment configs/ experimentconfigs.py.
experimentconfigs.py.
So this will create a new folder experiment configs as well. And since
this file is absolutely huge, I recommend you copy it. And here we have all of the configs.
And so look how many different experiments we have. So many. So I just I'm just going to copy paste it here.
So we have all of the names here. For
example, muon step decay. This is the name, description, optimizer type. We're
going to use this config to u select.
And then maximum steps, muon learning rate, Adam learning rate. So we have all of these configs like that. So you can use actually AI to help you figure this out. You need to think like what you
out. You need to think like what you want to experiment, etc. Now let's make a new file in the experiment one muon versus Adam folder. And it's going to be
again it's going to be experiment training experiment trainer.py. And at
this point um all of this code should be familiar.
So I'm again going to just copy all of the code because it's this is uh we explained all of this throughout the course. It's all about like uh learning
course. It's all about like uh learning rate scheduling setting up parameters m optimizer setting up Adam optimizer plotting
stuff. So this is just uh repeated what
stuff. So this is just uh repeated what we already explained. So tell me below if you if you don't like me just copy pasting it. If you think I should like
pasting it. If you think I should like go through this then tell me in the comments.
And so in both of these uh new folders experiment configs experiment training.
I'm going to add this in it but it's going to be empty maybe for later use if we need it. You'll see that after you do all of these experiments you're going to get a bunch of these files which are
just reports and plots of everything.
But you can you don't need to necessarily uh generate these folders I should say.
Next to run Adam learning rate sweep I'm going to create a new file run Adam optimization suit uh suite how to say and uh this is still going to load from
the configs and run the configs that we have. So I'm going to copy paste again
have. So I'm going to copy paste again because it's basically the same. It just
gonna load stuff from the uh configs from this run multiple experiments.
These are all of the names of the experiments of the configs it's going to run and then plot and that's it. Then we
will make this uh run experiments.py.
So copy paste this as well because we've explained all this stuff. So it will just it's able to run to prepare data run single experiment run multiple
experiments compare experiments and all this good stuff and then main uh we have bunch of arguments defined here
and we also have a new file run optimal muon suite so here we are just running these experiments
for the muon optimizer and that's Everything else is same. Now I will commit everything. Save all and commit
commit everything. Save all and commit to GitHub. Then I want to go to
to GitHub. Then I want to go to extensions in my uh VS code type of and I want to make sure like so I have
cursor anti-gravity VS code whatever if you have the some VS code type of editor you can install this extension Google Collab.
But this does make it kind of difficult to run. So instead I'm going to create a
to run. So instead I'm going to create a new uh Google Collab notebook and I'm going to change runtime type to
be GPU. Save and connect. Then I'm going
be GPU. Save and connect. Then I'm going to go and copy like clone this repository that I I just made.
So I'm just going to uh get clone like that.
And then I need to CD like with this CD into that into the folder name but with this percentage sign. And then pip
install-r requirements.txt.
requirements.txt.
Let me install requirements.
First we should be able to just run python train.
Let's see if that runs. But uh this is the base of your of your like large language model. So now you want to do
language model. So now you want to do some research. Now you want to maybe fix
some research. Now you want to maybe fix some bugs. I do recommend using uh
some bugs. I do recommend using uh coding ID like anti-gravity by Google is actually free right now. So you can fix any bugs if you find or you can talk to it to help you. Now I'm going to debug
this. See if there are any bugs. But
this. See if there are any bugs. But
because we have the main uh scaffolding, the main structure, it's going to be easy for you to use AI code AI editor to fix bugs as well if there are any. And
then uh this is your base. You can think about research you want to do and do the research from here on. Okay. So it looks like it's actually training. So there is
no bugs and uh this may be like too many steps.
This may take a lot of long time but it's actually training so there is no bug so everything works here in my repository blueberry lm if
you go uh to this experiment 9 muon versus Adam you can get some instructions on how to run these experiments how to think about them.
So like for example I can uh run this command to just run this experiment but I first need to cd into uh these uh
proper folders. So if you look at this
proper folders. So if you look at this path um previously it was just noptims and then trying to access uh this immediately this file but that doesn't
exist. So I need to cd into experiments
exist. So I need to cd into experiments and then uh this folder or I can I think I can just add it here as well like this. So it knows where to look uh for
this. So it knows where to look uh for this and it looks like the experiments are running
and it's training. So there is no bugs.
So step 10, step 20, there is no bugs.
For writing the paper itself you will go to overleaf.com and create a new paper. So on this side you have latex code
or latex text and then it's rendered here. Now you need to know the format
here. Now you need to know the format but maybe I recommend you first write your paper in markdown file and then you can um get with the help of AI convert
that to latex.
But uh be careful when writing uh paper.
No AI slop, no AI generated uh stuff like be very careful that every sentence has its place that every sentence is meaningful. Actually archive stopped
meaningful. Actually archive stopped accepting reviews in computer science I believe because it was like all AI generated by people. So you need to uh
think about your data, analyze it, uh see what's happening there, understand what's happening there. You can upload images to a folder like this and then uh
link the images using these like figures and you specify the path. Uh I think uh you can ask CH GP how to link the image
and I think this is not so difficult. So
uh introduction you want to just be short and concise.
You don't want your paper to be too long. It's better to explain with as
long. It's better to explain with as little pages and text as possible.
And we can you can do some uh background and related work if you want and then explain your methodology.
So you can just look at what I wrote here and then edit it change it based on your experiments and you can also show the results. Now
here I should have colored uh different colors. So now it's same color so you
colors. So now it's same color so you can't see. So I can uh fix this and I
can't see. So I can uh fix this and I explained here also like the results and the ablations and stuff.
I think it's pretty straightforward to write the research paper. Although you
will spend like uh a lot of time thinking about related work but don't get overwhelmed. Don't read too many
get overwhelmed. Don't read too many papers. Don't uh waste too much time
papers. Don't uh waste too much time trying to understand like different papers. You need to balance it. Uh all
papers. You need to balance it. Uh all
I'm trying to say is this related work could be overwhelming. So uh make sure you don't like overwhelm yourself. But I
think that's it. Writing the paper is quite simple. You've done that in school
quite simple. You've done that in school with presentations with homeworks. You
just analyze, you keep it concise, you explain, you keep it clear, professional.
Uh don't say I did something. You can
say the experiment was done, not I did the experiment. Or you can also say we
the experiment. Or you can also say we did the experiment. You can say that if you want, but I think there is not much else I can say about writing the paper
itself. Just analyzing the the results.
itself. Just analyzing the the results.
The most difficult part is like coming up with the idea of the research and executing on it like knowing which questions you want to
answer. So I think that's it uh for this
answer. So I think that's it uh for this time. You can find this latex code in
time. You can find this latex code in the GitHub repository below. So
paper.ext text and paper PDF as well.
Join my school to become AI researcher.
We have additional courses here that are not available uh on YouTube. So from
math, PyTorch, neural networks, transformers, large language models from scratch, etc. And the community, you can ask questions and get support. Link
below 7-day free trial and then uh just $9 per month. You can check this video that explains more what's contained in here.
Loading video analysis...