TLDW logo

Code, Write & Publish AI Research Paper - Full Course - LLM From Scratch - Muon vs Adam Optimizer

By Vuk Rosić

Summary

## Key takeaways - **Hardest research: picking questions**: Usually the most difficult thing in research is figuring out what to even do, what questions to answer. Answering questions is easier than figuring out what questions to answer. [01:22], [01:29] - **Publish research publicly always**: I recommend keeping all of your research in public. Otherwise, this is going to make it so much better so people can learn about you, know about you, you can build your network. [06:36], [06:38] - **Router noise prevents expert collapse**: There are some issues when choosing an expert; the neural network can learn to always choose same experts and then those experts learn and then they're good and it gets stuck choosing only few experts. [17:02], [17:12] - **Auxiliary loss balances expert usage**: Auxiliary loss is trying to make the LLM to give equal number of tokens to each expert. If some experts are getting more tokens on average than others then this auxiliary loss will be higher. [31:19], [31:29] - **Muon orthogonalizes gradient matrices**: Muon optimizer keeps the rows of the weight matrix perpendicular to each other. Neural networks learn faster with less data if weight matrix rows are perpendicular. [02:15:00], [02:15:39]

Topics Covered

  • Hardest Research Skill: Picking Questions
  • Research Demands Patient Thinking
  • Publish Research Publicly Always
  • Auxiliary Loss Balances Expert Usage
  • MuOn Keeps Weight Matrices Orthogonal

Full Transcript

We will code and write this entire AI research paper that I just wrote just for this video to show you how to write and code entire AI research paper and

you can put your name onto this and it will be very easy for you to do something else. Change the this a little

something else. Change the this a little bit and do your own experiments and do your own real uh research step by step everything explained in this course.

This research paper is very beginner friendly to create. You don't need any tough mathematics and it will be very easy for you to create your own experiments and uh do your own research

on top of what I will teach right now.

This GitHub repository is the repository for this research. Uh it contains large language model that we will code from scratch and all of the experiments that we will do. And in this experiments

folder, I have the base that I used for this research and it will be very easy for you to add your own experiments.

My goal is to create AI research lab, hopefully best in the world. And every

paper we make, we're going to make a full course on how we did it. Right now,

it's me and my Discord community. But

I'm going to be moving to different cities, meeting people, and hiring people soon. And by making these videos,

people soon. And by making these videos, I can pay for the research, for salaries, for compute. And so, just by watching this video, you are helping us advance science. Usually the most

advance science. Usually the most difficult thing in research is figuring out what to even do, what questions to answer. Uh answering questions is easier

answer. Uh answering questions is easier than figuring out what questions to answer. This is also the problem with

answer. This is also the problem with large language models with humanity's last exam where large language models are tested if they can answer these questions and if they can, they're going to be good at doing research. But in

reality, coming up with the questions themselves is even more difficult. And

this skill of knowing what to work on is something you just build over years. And

even most experienced researchers struggle with this. But by writing research papers, you will eventually become better and better at this. So

when doing research, you will spend a lot of time thinking. It's like

reasoning and thinking. You will maybe be walking around the street around the forest and thinking about your research.

That's what a lot of people do. And you

cannot rush research publish it in a week. You need to spend a lot of time

week. You need to spend a lot of time thinking about it. Of course, in this case, because this is a course and tutorial and we did a lot of uh thinking, then it will be a lot easier

and faster for you. But if you are looking to uh create a groundbreaking research which I actually have a goal for our company eventually to to uh

start doing research that's accepted to these conferences and everything then we will spend a lot more time on each paper. Also decide what your goal for

paper. Also decide what your goal for doing research is. And it's not it's not going to be to get hired at some uh company like OpenAI or Deep Mind. If you

get hired there, there are some benefits. But just spending every day

benefits. But just spending every day whole day just because you want to get hired, there are easier ways to make money and usually you will get a high salary after like 5 10 years of

experience and you can get a lot money in other fields as well even faster where you develop a product or something. So uh your purpose of doing

something. So uh your purpose of doing research needs to be different than making money or getting hired because it's the more difficult way to make money and get hired than other ways. Uh

and so your purpose can be to contribute to science long term or to uh answer things find answers to questions you are curious about. And also there is no some

curious about. And also there is no some hidden mentor that when once you meet them they're going to discover your potential. I highly doubt and Robert

potential. I highly doubt and Robert Green talks about this. So, uh, it's all about you sitting down and figuring out your own life. Maybe it's good to have friends from AI research field, talk to

them, get their opinions, see what everybody else is doing. But at the end of the day, it will be you sitting down in your room and figuring out what you want to research and do. Also, you can

do what I do. So, you can either make YouTube videos or publish blog posts, LinkedIn posts and either create your own company or some AI research lab or uh join. It's also very good to join

uh join. It's also very good to join OpenAI deep mind but you need to understand like why you are joining it why you are joining but anyways whatever your goal is uh I do recommend you do

your research in public publish make these either videos or blogs or posts on social media so for me I want to keep improving these research papers and courses I want to publish everything

make courses so everybody can learn and this will accelerate science and I plan to keep improving our research papers As I said, eventually we want to publish in

conferences and hire people and uh create a proper lab. But it took me a couple of years to figure out what exactly I want. So you will also take some time to figure out what you want to

do and why you want to do research. So

then uh let's start with the course. You

can join my school community to become AI researcher. We have exclusive courses

AI researcher. We have exclusive courses here and community that are not available on YouTube. It helps me fund my research and make better videos as

well. So everything from math, PyTorch,

well. So everything from math, PyTorch, neural networks, attention, LLMs, and more courses. Click the link below and

more courses. Click the link below and watch this video to learn more what this school contains. So I'm going to assume

school contains. So I'm going to assume you have some basics on doing AI research and large language models for machine learning or PyTorch. So if you don't have any basics, I recommend

watching this uh video become AI researcher from scratch full course. I'm

going to link it below in the description and then it's also good to watch uh this coding llama 4 from scratch. So if you want you can watch

scratch. So if you want you can watch this because this is also very beginner large language models because in this paper in this course I'm going to assume you know about large language models a

little bit at least and or checking out this code deep from scratch. You don't

need to watch the entire thing, maybe just a little bit to familiarize yourself. All of the links below the

yourself. All of the links below the video. Go to your GitHub and let's

video. Go to your GitHub and let's create a new repository. I will name it no teams for novel optimizers or new

optimizers. You can name it whatever you

optimizers. You can name it whatever you want. So all of the code and my paper

want. So all of the code and my paper will be here. I'm going to add description development of new neural network optimizers.

I'm going to make it public. I recommend

you share your research and everything publicly. Otherwise, this is going to

publicly. Otherwise, this is going to make it so much better so people can learn about you, know about you, you can build your network. Keeping it private is not going to work well for you

because you will do research that nobody knows about. Same philosophy goes back

knows about. Same philosophy goes back to my channel. I'm sure a lot of people have watched my videos and this will help me a lot build the network. So, I

recommend keeping all of your research in public. And then I'm going to add

in public. And then I'm going to add read me so I don't need to add it manually later. I'm going to go get

manually later. I'm going to go get ignore and select Python and L license license MIT. This is the most permissive license and create repository.

This is the repository we created. I'm

going to copy its URL. So I'm going to clone the repository locally to my computer. I'm using Google Anti-gravity

computer. I'm using Google Anti-gravity right now. So I can click here and then

right now. So I can click here and then paste the URL. And then I'm going to select where to clone it. After you

clone it, it will offer you to open this new repository in a new window or the same window. So you can open it just

same window. So you can open it just like I opened it here. Uh if you don't know how to clone from GitHub, then you can watch some uh Git and GitHub basics

introduction on YouTube so you learn about GitHub and what cloning is or you can ask Chad GPT or AI to help you. So

you need to be able to set up your environment like this. This is basic like Python introduction or setting up Python environment and opening it in a code editor. So you can watch YouTube

code editor. So you can watch YouTube videos or ask GPT if you don't know how to do this. But anyways, once we open this, we're going to start creating some

new folders here. Let's create a new folder models and inside new file layers. py for

python.

Then next to that file, I'm going to say new file components.py

and a new file llm.py. So this is where our large

llm.py. So this is where our large language model will reside.

Um it's like mixture of experts lm or you can name it whatever you want. Now

let's open components.py.

Let's open and let's start writing some code. So I'm going to import torch. I'm

code. So I'm going to import torch. I'm

going to import uh torch.n as nnn torchn functional as f and uh from typing import type tpple and

optional.

Now you may type this all out. You may

copy these imports if you want from my code, but I recommend you type manually all of the code here. Uh we're going to write now so you understand how it works

and how it's written etc. First we will code a single expert.

So let's create a class expert which inherits from NN module and I'm just going to add comment single expert network essentially a feed

forward layer. So single expert is just

forward layer. So single expert is just a feed forward layer. So I believe you understand how experts work how mixture of experts work. If you don't I listed

videos and courses you should watch to understand how large language model works because but I can also just tell you so

each of the feed forward layer in neural network it looks like this so you have expert one two three so each of them is just a feed forward classic neural

network multi-layer perceptron and then we have router to select the expert based on the the token that is being generated. ated right now.

generated. ated right now.

Then I'm going to initialize the expert.

So we just pass in self and dimension of the model.

So dimension of the feed forward. So

imagine this as uh having the first input layer. This neural network has

input layer. This neural network has first input layer and then expands to this hidden layer. So this is the D model. This is the input the token. This

model. This is the input the token. This

is the size of the token that this expert gets. This is the token

expert gets. This is the token embedding. And then it expands. So this

embedding. And then it expands. So this

middle layer which is going to be processing this token. So this middle layer is usually four times bigger than this token embedding input size. And

then we contract back. So it's like input hidden and then going back to the token size. And this is four times

token size. And this is four times bigger. So this hidden layer of neurons

bigger. So this hidden layer of neurons contains a lot of the facts and knowledge and rules for processing the token. Like if you are talking about uh

token. Like if you are talking about uh some sports, this hidden layer would contain like knowledge about rules of the sport, famous players in that sport

etc. So I'm going to have uh this is the token size the first and the last and then the fit forward is the middle the

four times bigger usually.

Okay. And then dropout this is good for to prevent overfitting.

And we're going to initialize this super. Oh my god. This super is just

super. Oh my god. This super is just initializing this NN module.

And then this is the first linear layer.

So our input layer. So as I said, this goes from the D model which is token size to the hidden layer which is four times bigger. So that's our first linear

times bigger. So that's our first linear layer. And I'm going to show you on the

layer. And I'm going to show you on the image.

It looks similar like this. Just imagine

that this middle layer would have uh eight neurons instead of three because I said that whatever the input token this

gets a token. This is four times bigger and then we go back to token size and so this contains a lot of like knowledge about

uh whatever the so each expert is going to be specific for something. In the

neural network, the large language model will learn for each expert to assign some specific knowledge like this expert is for math, this is for sports, this is for history etc.

So I explained all of this in the courses and you can watch on YouTube more courses about mixture of experts.

So then second linear layer will go from the mod from the hidden back to the token size and we don't need bias for these

and the dropout we will initialize dropout so we prevent overfeitting now some people don't use dropout but I use

it it worked better with my experiments and then let's define forward method so the forward method is will be the how we

use. So here we just initialized these

use. So here we just initialized these this network expert and forward is where we use it.

So first uh it's going to look like this. So I'm just going to type all of

this. So I'm just going to type all of the this. So you can see so you know

the this. So you can see so you know it's going to start from here. So uh

this is the input. It gets passed through the first linear layer.

And after we pass through the first linear layer, so it's going like this.

And then it's going to be passed through activation function celu in this case.

And so imagine now we are at this point we are we have values for the middle layer for the hidden layer. Dropout

means we will ignore some of the activations make them zero. Some of the neurons make them zero in the middle layer. And then we're going to multiply

layer. And then we're going to multiply this middle layer with the weights to uh get the output of that's the token

size. So that's so this linear will

size. So that's so this linear will multiply the hidden layer with the weights to condense it back to the token size. So that's how we process this is

size. So that's how we process this is the classic feed forward. So this is just one expert.

Then uh let's make sure we save all of this commit to GitHub. So uh you have connected your GitHub. So I'm just going to press generate and it's going to

generate some message and sync.

So you can learn more about GitHub in other YouTube tutorials if you don't know about it. So guys, I switched to cursor because I just couldn't disable the autocomp completion AI generated

suggestions in my editor. So let's

continue with cursor.

Let's now create a top k router. So

router will choose which expert one or multiple to choose uh based on the current token. So just create a class

current token. So just create a class that inherits from nn module and router that selects topk experts for each token.

So top k most similar most relevant I should say experts.

First I'm going to initialize the uh topk router. So I'm going to say dimension of the model which is the token embedding size number of experts

and how many of those experts we should choose. Default is two and then I'm just

choose. Default is two and then I'm just going to super in it to initialize the parent class.

And so I'm just going to assign these uh values to the local variables top k and number of experts.

Then uh we will define gate which is just a linear layer that goes from the token embedding to number of experts. So it

will just look at the token and convert this to some numbers and we will use those numbers to assign probability or

likeliness or affinity for each expert.

So how much it wants to choose each expert in particular and we have this noise. Uh there are some issues when choosing an expert. uh

the neural network can learn to always choose same experts and then those experts learn and then they're good and then when it even tries to choose

different experts they didn't learn so it gets punished. So then it gets stuck choosing only few experts and but it's suboptimal because if it instead uh

balances out choosing all of the experts then it in the end uh it will perform better and be able to save more data save more data into like each experts

and knowledge and development of the experts.

Now let's define forward for the topk router.

So uh we're going to start with the forward itself. So x is our tensor that

forward itself. So x is our tensor that contains all of the tokens. So it can be uh three-dimensional tensor. So each

token but all of the tokens are in a sequence. So sequence length or context

sequence. So sequence length or context window or conversation is just a single conversation with all of the tokens in that conversation. So that's sequence

that conversation. So that's sequence length. And then batch size. So you can

length. And then batch size. So you can have multiple conversations. All

conversations are independent.

So usually you would have these three dimensions. Tokens for token embedding

dimensions. Tokens for token embedding and then multiple of those tokens in an array that's sequence of tokens for that conversation and then multiple

conversations that are independent and this will return tpple which is uh torch tensor torch tensor to tensor.

Gemini gave good explanation of the three things uh this router will return.

So let's start here. Top K indices is the second thing. So it's just index of the selected experts. So let's say expert two and expert five are selected

because we are selecting two by default.

And uh each of the expert will um have some weight or importance of how much it

contributes to the final uh token. So

maybe this will have 70% weight and this will have 30% weight and you will understand this better later. So those

weights are multiplied. Uh and so those are the weights for the top K experts.

and router probabilities will return all of the uh probabilities across all of the experts. This is necessary for the

the experts. This is necessary for the auxiliary loss later.

So these two just return for the top selected experts and this is probabilities for all of the experts but we will learn more about later. Then I

will just add some comments. So the

input X which which is the input tensor is batch size sequence length D model.

Uh you need to be able to understand what uh this means like these dimensions like intuitively. So I explained that

like intuitively. So I explained that well in my AI from scratch AI research from scratch course and then what it returns router weights expert as router

probabilities.

And first thing we want to do is uh take the x input tensor and then uh get its shape into these three

variables. So dimension of the model and

variables. So dimension of the model and sequence length and batch size get shape of these three separately. So the size how many conversations how many tokens

in sequence and how many dimensions per token and this is going to be maximum number of tokens possible. So maximum sequence length

then we want to compute router logit. So

now we want the router to select experts. So I'm going to say router

experts. So I'm going to say router logits equals two. And don't worry these uh words will become brighter blue color

later when we start using them. So we

will just pass X through our gate which we defined.

Oh we didn't define it.

Okay never mind. We'll we'll define it later.

Oh we did we did define it here. This is

the gate. So it's a linear layer that goes from D model to number of experts.

But we can easily pass X even though X has this big shape. It will just so these ones will be copied and it will just go from D model to number of experts instead.

And so it will return batch size sequence length number of experts. So

that's going to be our logit. So it's

like uh showing logits are showing the affinity like the bigger number the bigger affinity for this expert. So that

expert will be selected and we're going to add some noise during training only for exploration. So as I said it we will randomize a little bit the logits

so it doesn't get stuck reinforcing same few experts all that every time and so if it's training and if we have

some noise that we want to add if it's more than zero then we will say noise is equal to we will create the noise.

So those will be random numbers of the shape router logits. So we have some random value for each of the logits.

Each of the logit. Yeah. And we will multiply with the standard deviation here 0.1.

And so router logits will be equal to router logits plus noise. It will add noise to all of the logits.

get full probability distribution for load balancing loss. So we need this later. We also need to get this. This is

later. We also need to get this. This is

the third thing we return.

So router probabilities we will do softmax over all of the logits.

Uh how softmax works is if you have some logit numbers like one five six seven it will put all of those numbers between

zero and one and make all of them add up to one. So it will squish them between

to one. So it will squish them between zero and one and make them all add up to one so that you can look at that as probabilities. Now

probabilities. Now and along last dimensions so a long number of experts dimensions. So for

each token we just take its experts and just do over experts just for each token separately and select top k experts. This is the

first and the second thing we return from this uh gate.

So there is this torch.topk top K function and we just pass in the logits and that's going to be assigned to top K

and because router logits has this big shape we just want to select uh over the last dimension so for each token

so that's the last dimension sorry my bad my bad this top K means we are selecting top two experts experts.

So two out of whatever it is number of experts which could be for example eight and so uh we say top k weights just like we

calculated with the soft max these probabilities we will also use the same thing to calculate weights which is also which will also add up to one. So as I said

weights we will multiply output of each expert with certain weight for that expert. So weight will say how important

expert. So weight will say how important this expert was. So if we have two selected experts one gets 70% the other

get 30% then we will this one will have 70% influence when multiplied and this one 30.

So the same thing and we will return all three things that I mentioned. So it's also a bit

I mentioned. So it's also a bit difficult to explain for me but don't worry I'm going to repeat all of this as we code following things you will understand because it all goes back on

itself and you will understand it. So

the topk router will choose for each token it will choose for example two experts out of eight possible experts or whatever we have and it will return the

indices of those experts for example second and fifth and weights of those experts for example 70% weight 30% that's based on the logits on basically

that's based on how the neural network classifies importance which which numbers it gives here and weights will add up to 100%. and

then router probabilities for entire all of the experts for that token which we need later to make sure it's balanced uh when it

selects experts it doesn't just select only two experts all the time same two experts next class will be a lot simpler I believe so it's just mixture of experts

class also inherits from nn module so we will combine our router and experts here. So let's first initialize

experts here. So let's first initialize self dimension of the model which is the token embedding size uh dimension of the feed forward hidden

layers which is four times larger usually than the model size although it's not always number of experts eight as default.

Uh we are selecting two experts out of eight for each token.

Dropout 0.1 10%.

Load balancing weight. Uh we'll talk about this later. The goal here is to make sure that the router doesn't always

select same experts. So we want to make it select different experts.

So initialize super. And then I'm just gonna all of these variables I'm going to assign them.

Okay, these three and create experts. So

experts will be module list and just call this expert class pass in the model DFF and dropout for so for every expert. So number of

experts is eight. So this will create eight experts in this module list and then we will create router and that's it I think. Yeah that's it.

So that's all we need. So for topk router that we defined here we just pass in um number of experts d model and topk number. Then let's define the uh forward

number. Then let's define the uh forward method for mixture of experts. So how

does the data go through this class?

So input is x torch tensor and output is tpple which is torch tensor and optional torch tensor. Let's see what this is. So

torch tensor. Let's see what this is. So

first input is our classic batch size sequence length. So this is just our uh

sequence length. So this is just our uh list of conversations and returns output which is going to be

processed list of conversations. So it's

going to be the same list of conversations but it will be processed.

So what does mixture of experts do mixture of experts it comes after the attention mechanism. So I'm just

attention mechanism. So I'm just repeating now the structure of LLM and for each token it will just modify it a

little bit process it uh based on maybe context around other tokens. You know

that in attention mechanism information from all of the tokens will be blended added into every token.

But attention will only look at the previous tokens for every token. So for

every token it will just add information from previous tokens. But now that information is just added with a plus with plus sign. So it's just adding a bit of this vector to this vector. And

so feed forward networks or mixture of experts will uh take this just added vector that has these numbers that are just added and process them somehow uh

refine them, inject more information into them. So imagine this as like

into them. So imagine this as like processing. So attention is adding

processing. So attention is adding information and feed forward is for processing information, adding more data, analyzing it.

the output is same same size as the input same shape. So

every token just gets processed.

So there is no next token generation here. So this is not generating next

here. So this is not generating next token. That's only the last output head

token. That's only the last output head at the end. So so this is just for processing information maybe for understanding for the LLM what each token means and how it relates to

other tokens and injecting new knowledge and an analysis etc. And then we have this auxiliary loss. This is just a number

uh that neural network that the LLM wants to make as low as possible. So

this number will show how imbalanced expert selection is. So to put it simply

uh using this we are trying to make the large language model choose every expert equal number of times.

So we don't want it to choose some expert more frequently. We want each expert to get equal number of tokens in general on average.

And so if some experts are getting more tokens on average than others then this auxiliary loss will be higher number

and so the loss the error will be higher number and this loss will be later added to the general large language model loss.

So it's trying to minimize the general loss and so it's trying to minimize this one as well because this is going to be added. So

added. So in short in summary, auxiliary loss is trying to make the LLM to give equal

number of tokens to each expert.

In this example, you can see that uh this our mixture of experts layer will return this auxiliary loss besides the output the process tokens that I just

explained and this auxiliary loss will be added to total loss at the end. So

that's the total loss and then we do back propagation with the total loss. So

we want to minimize this total loss total error.

So again, we're just going to get the shape to these variables and we will pass x our input through the

router to get our expert selection uh router probabilities and all of this that I described about the router and

uh initialize output tensor. So this is the output that we will get after processing and it's all going to be zeros and the

shape will be same as X.

Let's explain next piece of code with the Gemini. It's easier just listen to

the Gemini. It's easier just listen to the theory and then we will code it after and then I'm going to repeat and explain again shortly as well. So we are

looping through experts. So for each expert, we're going to go one by one and for each expert, we will process the tokens that that expert got. In our

code, I'm going to say process each expert as a comment. And for expert index in range, so in range all of the experts,

I'm going to say find tokens routed to this expert.

And that's how this is how we will do it.

Expert mask is equal to expert indices equal to expert index and also

dot any uh along dimension last dimension. So

remember what does this mean? So

remember uh the first iteration we are gathering all of the tokens from the first expert and then processing all of

the tokens for the first expert. So

remember that each token has two experts assigned. So here we are checking if

assigned. So here we are checking if this current token that we are processing

uh if this first expert is at any of the two places. So the first selected

two places. So the first selected expert, the second selected expert. So

is this current expert within those two experts that are selected for the current token?

And so this expert mask will be the same shape. So it will be batch size and

shape. So it will be batch size and sequence length. So basically for every

sequence length. So basically for every token in our context window in our input, it will have true if that token needs to be processed by this current

expert and false if it doesn't need to be processed.

That's what this expert mask will have.

So you can see here that previously we got for every token we have two experts two indices.

So then for this particular expert in the for loop we are checking

in that list of tokens where each token has two experts. if this current expert that we are checking is contained within

uh within that list.

So we will check for all of the tokens at once.

Now this may be a bit confusing but maybe if you understand it rewatch this or ask AI to help you then eventually you will understand it. So expert mask

is batch size sequence length. So for

each token it has true or false. this

token does need to be processed by this expert or this token doesn't need to be processed.

So then if expert mask any which just checks if any of the values in this expert mask is true. So if this expert

needs to process even a single token so even if it's single token has true then we will process.

So get tokens for this expert.

Expert input is equal to X and expert mask.

This part will be easiest to show on an example. Let's say we have our X which

example. Let's say we have our X which is for example this is the the first token second token. So we have list of

tokens which is that conversation.

second conversation and then for expert zero let's say first token is sent to expert zero

and then this one is not this one is no no and yes so each of the tokens has its own true or false and the shape of this

expert mask will be two three so three tokens and two conversations each having three tokens

so it's it's going to be like this first token, second token, first token, etc. Sorry, third token. And I will show

you this example in the Python code. So

this is what I just explained. And this

is our mask. True false true and false true for the second conversation.

And if we apply this mask to the X, let's just see what we get.

So we get the first token, third token and sixth token.

So first, third and sixth have true. So

it's like it's now just one sequence of tokens. It's flattened it's flattened

tokens. It's flattened it's flattened the batch and sequence dimension into a single dimension. Now it's just a

single dimension. Now it's just a sequence of tokens that had this true.

So if X the full size of the input is 100 tokens. If only 10 tokens are

100 tokens. If only 10 tokens are selected expert input will be a smaller tensor of size 10.

Then in the code we will add this expert output will be self.experts

and choose this particular expert. So

expert is just a list of experts. It's

module and choose this particular expert and pass in expert input which is a list of these tokens for this particular

expert that will pass these selected tokens through our expert which is a feed forward network which is linear relu linear. We defined this in the

relu linear. We defined this in the beginning of the video but in reality we are using selu not relu is more complex and it's better function you would use

relu only for some other simple things for example in deepseek uh sparse attention and some other things and remember we

just need to pass in a single token when we are processing this that token that token is not looking at the previous tokens that was happening in the attention

mechanism. So in this feed forward

mechanism. So in this feed forward network, we don't need to pass the token and then look at all of the previous tokens. So that was actually at first

tokens. So that was actually at first confusing to me as well, but I just realized that was in the attention mechanism. And here we just pass the

mechanism. And here we just pass the token itself and process the token itself, the vector and nothing else for that particular token through this

expert. Then maybe a bit trickier part.

expert. Then maybe a bit trickier part.

So I said each token is passed through two experts and for each of those experts we need to multiply the output with some weight and the output here and

then add. So if the first expert for

then add. So if the first expert for that token has weight 0.8 we will we will multiply the output of

that expert for that token with 0.8 and add that to 0.2 two times the output of the second expert for that token. And

when we add those weighted outputs, then that's our final process token. To

understand next piece of code, let's look at this scenario. So we are of course we are selecting top two experts as always and let's say we are currently

processing expert five. So all of the tokens for this expert five and so let's say this current token has experts two

and five and let's say uh weight is 90% for expert 2 and 10% for expert five.

That's the data we already have. So here

we need to grab this 0.1 or 10% and multiply the expert 5's output. We must

not accidentally grab 0.9 and multiply expert five's output. So

expert five is 0.1. Here first line of code we will add is mask for expert is equal to and then we check if expert

indices is equal to expert index.

So looking at this line of code uh we will check. So if expert indices is two and five expert index is five

then this will yield false for this one and five= 5 true.

So the mask for expert will be false true like this.

So that's going to be assigned for this particular token.

Now I'm just going to show how to add this in the code. So we are here.

I'm just going to say mask for expert is equal to this line of code that I just explained.

And so let's see the next line of code.

positions is equal to mask for expert and then inside expert mask. Let's see

what this does. So remember two experts are chosen for this token and each of the experts has its weight. So 0.9 0.1

and so we have false true because we are currently processing expert 5. So now we need to turn this

expert 5. So now we need to turn this false true into zero and one for indices. So we will use

this zero and one. We will use one to select this second weight 0.1.

So you see here this is weight at index zero. This is

weight at index one. And so this is going to be false and this is going to be true. So we need to convert

be true. So we need to convert this false true that we have into just an index saying that from this we want

to pick this one that has index one.

So this may get a bit of confusing. So

looking at this line of code again positions is equal to mask for experts and then index this with expert mask. So

let's first remind ourselves what these two are.

So this expert mask is uh for this particular expert that we are processing. So expert five showing

are processing. So expert five showing true for every token it needs to process and false for every token it doesn't need to process.

You can uh read it here as well. And

then mask for expert. Remember

if we have token zero has experts two and five then it will be false true and false force if it's not contained

and 53 so it's true false for expert five.

So mask for expert is actually for this particular token if it's using this particular expert or not.

And expert mask is for this particular expert showing which tokens in the sequence it's using.

Now check this out. This is interesting.

Both of these arrays. So this array and this array is showing you list of tokens. So first token if it's using

tokens. So first token if it's using this expert or not. Second token if it's using this expert or not. Third and here

as well first token if it's using this expert or not. Second and third. And so

what we will do is this second token is not using this expert.

So we will just remove this completely.

That's one way to do it. You could also remove if both of them are false here.

But it's maybe faster easier in this way to use this array to remove tokens here that that are not being

used.

So this line in our code will remove those tokens and then we will only be left with tokens that are being used or processed by this expert. It also

explained here what I just described. If

you want you can read but at the end in this example instead of having three tokens we will just have two tokens at the end because the middle token is not

processed by this current expert five from the example. We will see later how we use this position. So it will also be easier for you to understand. Now we

have dotflat and dot arg max last dimension.

So let me continue with this example to explain this float and arg max. So now

we have this filtered tensor filtered list of just these two tokens that are using this expert five. So it's second selected in this token and it's first

selected in this token.

So when we say dotflat it will convert true false into one or zero. So false

becomes zero true becomes one.

So now we have this instead and because we converted it to numbers we can now use arg max and minus one means along

this last dimension. So we will compare this zero and this one. This is the last most inner dimension. Which one is bigger? Arg max. Which one is bigger?

bigger? Arg max. Which one is bigger?

This one is bigger. And index of this is one.

So the result of arg max of this token will be one because it's going to look at the index of the biggest number

and result of arg max of this row will be zero. uh it's just the index of the

be zero. uh it's just the index of the biggest number here.

Uh you can read through this. I

explained this right now. And so

positions will be one and zero. So for

the first token, the weight for this particular expert is at index one. And

for the second token, the weight of this expert five is at index zero. So we will pick those weights to multiply

later the output of the expert 5.

Next uh gather weights only for relevant tokens.

So again let me just uh do all of this and we also have squeeze. So let's

explain this uh part of code. I'm going

to continue with the same example.

So token zero and token two are processed with expert five and these are the weights

and so as I said index here is one index here is zero. This is what we calculated previously.

So we had um so token one is not processed so it's filtered out.

So the position that we just calculated is one zero. This is the vector 1 zero.

So index one index zero and um router weights the full tensor weights for all tokens.

So in the first step filtered weights.

So we will just grab weights for only the tokens that we are processing with this expert using this expert mask. As if you remember expert mask is just showing

which tokens are being processed for this expert. So

this expert. So it will be token zero and token two.

So if these are router weights we will ignore this token. It's not being processed because token zero and token two are being processed.

And so when we apply this uh router weights and then expert mask we will just get this. So array of and this is contains u this is for token

zero and token two.

Now this is 2x two matrix or tensor.

But if you look at indices up it's just uh two. So it's just one array of two

uh two. So it's just one array of two values. So

values. So uh we want to pick we want to pick

this guy and this guy and we can do that using this gather.

But to use this gather we need to match their shapes.

So if this is 2x two we will also make this 2x two by putting each of these into its own array.

So that's why we will just unsqueeze.

If I scroll down, we will unsqueeze the last dimension.

Which means it's going to look like this. Now it will each of them will be

this. Now it will each of them will be in its own array. And now we can use gather to pick

uh the index one. So whatever is at index one here and whatever is at index zero here.

So if I scroll down to show you dot gather minus one and then indices.

Indices is actually just our positions.

So position one and position zero.

And so that's how we select and minus one means along the last dimension or innermost dimension. So we process this one. Pick index one. We process

this one. Pick index one. We process

this one. So this is the innermost dimension. So I recommend you play

dimension. So I recommend you play around with this. You need to maybe get some feeling for like these dimensions if you don't have strong basics. You can

also check that in my AI from scratch course, AI research from scratch. Or you

can ask Chad GPT to give you more examples to give you some tasks that you can run in Google Collab or in Python in your environment.

to get the feeling for squeezing, unsqueezing, indexing and all this good stuff. At the end, we get this 2x one

stuff. At the end, we get this 2x one matrix. Did I earlier say that the

matrix. Did I earlier say that the indices were 2x2? I meant 2x one. So,

it's going to be 2x one matrix. And so,

this is the weight we for the first token that we multiply the expert of the the output of the expert five with this weight. And for the second token the

weight. And for the second token the output of expert five for multiply with this weight and finally we get this dots squeeze the last dimension.

So this will remove the innermost dimension. So our current tensor is like

dimension. So our current tensor is like this. But we want to remove this

this. But we want to remove this innermost dimension. So we just get this

innermost dimension. So we just get this array of these numbers. And now the shape is two. So it was 2 by one 2 one.

But we removed with this with squeeze squeeze last dimension. So now we just two and you see what it looks like when you squeeze this innermost dimension the last dimension.

And then what we want to do so as I showed you earlier we already have this output tensor we are collecting when we are processing each expert individually

we are just adding outputs to this tensor. So this output tensor is same

tensor. So this output tensor is same it's defined before experts here. So now we are processing experts

here. So now we are processing experts and we are just adding output of each expert to this. So

so output so but for only for tokens that are being processed by this expert um let's ignore this unsqueeze for now.

So we have expert weights which is just our array of weights. So for each token, so for each of the tokens to multiply the expert output with this

weight.

So I explained this uh I don't want to repeat too much maybe it's boring for some people. You may also ask J GPT to

some people. You may also ask J GPT to explain this again if you want. And we

will need this unsqueeze so we can multiply properly and then add the result to existing output. We are accumulating

output. We are accumulating every expert's output in this total output.

And so this will work well because uh weights for one expert for for this particular token weights for from one expert will be added the output the weighted output and the weighted output

of the other expert will be added for this particular token that we are currently processing. So at the end we

currently processing. So at the end we will just have this token as I said weighted output of the one expert plus weighted output of the other

expert gets added and that's our token.

And by the way if you like if this is not 100% clear don't worry like if you are beginning with this series only a couple times it will be difficult for people who've been doing this for a few

years they might like it's it will be easy for them to understand but don't get discouraged like others understand it. I don't understand it which it's

it. I don't understand it which it's expected if you are uh don't have too many too much experience. So that's it for the experts.

Now let's do the auxiliary loss.

Uh the goal of this again is to make sure that every expert gets equal amount of tokens. If some experts are getting

of tokens. If some experts are getting more tokens then the loss will be higher and neural network will learn to minimize loss. So it will learn to

minimize loss. So it will learn to divide tokens equally among experts.

So let's just define the beginning value.

If it's training then we will just say auxiliary loss is equal to self domp compute load balancing loss. And we will define this later. We don't have this function yet.

and we will give uh router probabilities and expert indices.

I will explain. Let's just see that we return output and auxiliary loss. We

return from this forward method of the mixture of experts. So we return the output the tokens process tokens and the auxiliary loss. Remember

auxiliary loss. Remember that router probabilities are for each token the probabilities

for every expert for that token.

So it's going to be batch size sequence length number of experts. So all of the experts for every token in the sequence.

So it can be like this probabilities for this particular token let's say and we will use this in our auxiliary loss to calculate

uh to make sure that every token is getting every expert is getting equal amount of tokens.

Also this is useful for back propagation because uh we are sending the probability distribution.

So, PyTorch can adjust the weights to select better probability distribution.

It answers how strongly did the router prefer expert X on average. And we don't want it to prefer any expert. We want it

to be equal for each expert. And again

let's remind ourselves expert indices a tensor containing the integer ids of the top k experts that were actually

selected for each token. So for each token it says two and five meaning that each token like this token got expert

two and five selected and so the loss is simple actual usage which we get from our indices that I just explained

times average confidence which we get probabil from probabilities which is the first thing that I explained so router probabilities

but don't worry we we will understand all of this better. Let's see how it's coded. So in our code, let's actually

coded. So in our code, let's actually define this function that we just used.

So compute load balancing loss and it's going to initialize with with self and then router probabilities that we mentioned and expert indices. We'll see

later how these are used so you will understand them better. And we will return torch.tensor tensor which will

return torch.tensor tensor which will actually be just a single number. The

loss and let me add this comment.

Compute auxiliary loss to ensure balanced expert usage. This encourages

the router to distribute tokens evenly across experts.

Then let's first code and then see on an example what it does. So compute the fraction of tokens routed to each expert.

F do one hot. So we are one hot encoding and we're passing in expert indices and number of classes will be number of

experts and converting this to float.

Let's see what this does. This is

example. Let's say for this particular token expert one and three are selected.

The output of this line after we process will be these two one hot encoded vectors. So let's say we are se out of

vectors. So let's say we are se out of four experts we are selecting two.

So this means that expert zero is not selected. Expert one is selected and

selected. Expert one is selected and zero and zero these are not selected.

So that's the first selected expert for this particular token we are processing.

and the second selected expert is at index three.

So for every token we are generating this these two one hot encoded vectors to show which out of the maximum number

of experts are selected for this particular token. So this line will

particular token. So this line will convert this shape which was batch size sequence length and top k which is two

into batch size sequence length top k and then number of experts. So each of these two each of these top k will now have a vector of size number of experts

it will be one and one hot encoded vector.

I'm going to summarize I think it's very easy. So imagine this expert indices is

easy. So imagine this expert indices is just an array of two numbers. So one

five and then instead of having one five we will have like one hot encoded. So 0

1 0 0 0. So there is eight experts and for the five we will have 0 0 1 0 0.

So that vector so that's going to be expert mask.

It's going to be like this. But since we are selecting out of eight expert this will be on of length eight not four. And

then we will just convert all of these integers into floating point numbers using dotflat because later in the code

we will need to uh use this in a fraction. So, Python needs this to be in

fraction. So, Python needs this to be in a float number and we will use these one hot encoded vectors to count how many

times each expert has been used.

So, tokens per expert is equal to expert mask dot sum. So now we are summing these one hot encoded vectors along this

dim 012 which I will explain but we also have divided by expert mask dots sum. Let's

understand it with an example. Let's say

we have just one batch which means just one conversation and that conversation has two tokens and each token picks two experts. So we know

that and so let's say there is uh total three possible experts. So out of three each

possible experts. So out of three each token each of the two tokens gets two experts.

And so total slots is four which is two tokens each having two experts.

And let's see what these slots are.

Let's say token zero gets experts zero and one and token one gets experts one and two.

So the expert mask is the one hot encoded vectors that we just generated.

So we we have one batch which is one conversation, one sequence. That

sequence has two tokens.

Each of the two tokens has two experts selected for that token.

And then each of the experts is encoded with one hot encoded vector that we just did.

This is what it looks like. So for first token, we're going to have expert zero which is just going to be encoded like this. So this is the first token. We'll

this. So this is the first token. We'll

have this and this expert zero and expert one. And second token will have

expert one. And second token will have expert one and expert two.

Then dot sum and dimension 012 we sum across the batch sequence length

and top k top experts too. So what will this do is this will just sum it up like

this. So it will just sum how many times

this. So it will just sum how many times expert zero appears. So just sum all of these numbers for expert um zero. Sum

all of the numbers for expert one. Sum

all of the numbers for expert two. I

know it's a bit confusing. So actually

it says here we are summing across these first three dimensions.

But when I look at how we are summing, it looks like we are summing like the number of experts itself. So it looks kind of weird and confusing. But just

you just need to develop intuition by looking at this couple of times thinking about it.

So at the end after this sum we will collapse all of the information about token position everything we will just

have how many times each expert was selected.

So expert zero selected one time and then this and this. So this is the resulting vector 1 to one.

So that's going to result from this sum which is again kind of weird. We specify

these three dimensions that we're going to collapse and then it looks like we are actually summing across this last but we are not.

You just need to get intuition because summing across the last dimension would mean we sum 1 + 0 plus 0. So that would mean that we sum across the last dimension.

But this means we are actually summing across this like the the second to last all of the

all of these and that sum is this first part and then we are dividing by this expert

mask dot sum. Uh here we're just going to take this array which was one to one

and then divide by the total amount of tokens.

So you see our goal is to convert this one to one into something that's going to add up to one. So we get like probabilities or percentages I should

say. So 25 50 25. So the first token

say. So 25 50 25. So the first token selected 25% of the time. second token

50% of the time and 25. So that's our goal to get this. So the way we will get that is we're going to divide this count for

every expert. Each of them we will

every expert. Each of them we will divide with the total amount of these counts or these experts. So there is four of

them in total. You see one, two, three, four.

That's how we get that's how we get them to add up to one.

So we can look at them as percentages and our load balancing loss or auxiliary loss will try to push these uh weights

of the gate so that this vector is actually 0.33 0.3 so it's equal for every expert.

Let's see the next line of code. Compute

the average probability of routing to each expert.

So, uh, router probability mean, it's going to be router probabilities which we already uh passed. We're going to remind

uh passed. We're going to remind ourselves. So, we do the mean across

ourselves. So, we do the mean across zero and one.

Remember that router probabilities is the output of the softmax before we pick the top k. So it's raw probability distribution

uh for across all eight experts for each token. So

each tokens how likely it is across all eight experts for every expert and it's going to sum up to one per token and

mean across batch and sequence length.

Uh we'll say on average how strong was the router's desire to send tokens to expert X. Let's see example. So let's

expert X. Let's see example. So let's

say we have a batch or sequence of two tokens and we have three experts and so

uh for first token these are the logits or I should say probabilities not logits probabilities and for second token these are

probabilities.

So we average these columns vertically.

So uh that's what the mean does across batch and sequence. So 0.1 the average is 0.1.

On this spot the average is 0.6 and on this last spot the average is 0.3.

So the result router probabilities mean it's going to be this vector. So for

each expert, first expert has this average probability, second this one, and third this one. So that's our router

probability mean.

Comparing this to our tokens per expert which we calculated earlier. So tokens

per expert has this top k. So we cannot back propagate through it. It's not

differentiable. But the pytorch can back propagate through this uh router probabilities when we take all of the probabilities.

So when router changes weights slightly the logits or the probabilities that are created from those logits for every

uh expert might change from 0.8 to 0.81.

So this is differentiable. Check out my course on AI researcher from scratch to learn more about derivatives and gradients and this stuff. But when you are just selecting in this says if you

change weights slightly the index of the expert will still remain five. The index

will not go to 0.0015.

So that's why it's not differentiable because you are you just selected this expert. there is no smooth uh this

expert. there is no smooth uh this smooth transition between experts the continuous transition. So not smooth I

continuous transition. So not smooth I mean continuous and if the weights change enough then it will snap from expert with index five to expert at

index six or one or whatever but this snapping is what's not differentiable there is no continuous function between them.

So in our loss formula the first part has no gradients where we select indices and the second part that shows this probability distribution. It has it it

probability distribution. It has it it is differentiable I should say. So

Geminina is explaining it in details. So

I will also explain it in details. So

nondifferiable and differentiable means that here when it's differentiable neural network or

pytorch will know and calculate how to change the weights so the probabilities get more for this expert for example it

will be able to calculate using the chain rule from mathematics from calculus but it will not be able to use chain rule for this variable because

it's not differentiable.

It's not a continuous function. Check

out my course as I mentioned. So why do we need both? Why do we need probability distribution over all experts for every token and which experts were chosen

their indices to calculate this loss? Because the

router can cheat if you only look at probabilities.

Let's say we have 100 tokens and two experts and the router gives to every single token it gives 51% for expert A

and 49% to expert B. Now let's say we are just choosing one of these here probabilities look very balanced.

So this would have a low loss. But the

problem is we are choosing top one highest probability always.

So this would choose expert A always and never expert B.

So if we just use probabilities to calculate loss, it would seem like this is good, but it's not actually good. So we also need

to include this selection into calculating loss.

Main thing to note is we are not picking based on these probabilities. We're not

picking 51% expert A and 49% to pick expert B. We

are just picking the highest probability one. So top one or every time. Let's see

one. So top one or every time. Let's see

our formula how it applies to our example. So we have uh usage df to be

example. So we have uh usage df to be 100% which is one and probability for

that to be 51 0.51.

So the loss is 0.51 very high here uh plus 0 times this probability which is just zero.

So the final loss is 0.51.

Now let's expand our example to understand this better. So let's look at three scenarios to see how the math actually plays out. But let's f first

code that. So auxiliary losses storage

code that. So auxiliary losses storage sum tokens per expert times router probability mean which I just explained

times self dot number of experts. And

we'll see on the example why.

And then this will just return auxiliary loss. So this is it.

loss. So this is it.

H sorry times load balancing weight. So

uh this is just how important this loss how much we want the neural network to focus on this. So we can make it bigger or smaller. So if this is

bigger it will make the loss bigger. So

usually this number is small.

So we don't put too much effort into this load balancing loss. The main

effort is going to be on the main next token prediction loss. But let's just see some math examples to see what happens in different situations here.

Let's say we have two experts and so let's check our formula.

The first example I already explained.

So we looked at this example. Now we

would multiply this sum by two because we have two experts and that's done to normalize the loss value because if you didn't multiply by

the number of experts uh the more experts you have the smaller your loss would be just because you have more experts not because your models is

is improving just because you have more experts.

So let's further investigate why we multiply with the number of experts.

Um so in a perfectly balanced system every expert get gets exactly one over n one over number of experts traffic

because traffic is divided equally. So

uh fraction the number of tokens this expert gets is one over n and probability of this expert being selected is also 1 / n.

because each expert should have equal probability and equal amount of tokens.

By that logic we can see that in this example the loss will be proportional to 1 / n squared.

So the number of experts as it grows the loss shrinks uh quadratically. And in

this is not real example because here you wouldn't have number one. You will

have some number of tokens or some other number or this number one would be influenced here uh by this f.

But anyways what we just need to understand that as the number of expert grows this loss will actually shrink exponential or I should say quadratically. So uh we need to multiply

quadratically. So uh we need to multiply it with the number of experts.

So it's 1 / n the scaling.

So in our case where the expert um A gets 51% and it always gets selected.

We add that to expert B that never gets selected. The sum is 0 51 and we

selected. The sum is 0 51 and we multiply by two and that's our final loss.

two is number of experts which I just explained why we multiply to keep this normalized.

Scenario two, we got 90% probability for expert A. Let's say this is average

expert A. Let's say this is average probability and 10% for B. And so this A always gets selected. So now the loss is even higher because not only does it

always get selected but it probability is always high as well. You see how this multiplication what multiplication does is if one of if both of them are high

then the product will be very high.

But let's see this perfectly balanced.

So probabilities 50% for both and uh they get selected 50% of the time.

So it's 0.25 * 2.

Here sum times 2 is 1. So that's the perfect loss. It is a bit weird that the

perfect loss. It is a bit weird that the perfect loss is not zero. So like zero, it's like zero error, zero loss. But

this is how it works for now. And deepse

grpo is trying to replace this and other methods to trying to replace this auxiliary loss. So the neural network

auxiliary loss. So the neural network doesn't need to learn two things like the main loss and this auxiliary loss.

Okay, congratulations. We finally wrote this file component py. I recommend you commit this to GitHub. Do that

occasionally.

And I think other files should be faster and easier to write to understand.

Let's go to layers. Here we will uh define attention mechanism and transformer block.

So I'm going to import torch torchnn functional and also this torch tune modules from this import uh rotary

positional embeddings. So this needs to

positional embeddings. So this needs to be installed. So let's create new file

be installed. So let's create new file requirements.txt and here we will just write libraries we need to install for this. So data sets

this is from hugging face transformers torch tune which is the one that I just mentioned and torch torch AO we will see

this later and math plot lib for graphs so we will use data sets to load uh data and tokenizer transformers are we even

using that in this one I'm not sure so you will need to uh set this up with virtual environment and then pip install.

So you can just check this video my AI research setup code locally run on Google Collab or GPU cloud SSH GitHub.

Check this video. Uh this includes how to install and set this all up link in the description. So rotary positional

the description. So rotary positional embeddings is invented by Sudian Lin. I

always mention him on my channel. So he

has a blog post and ex account you can follow if you're interested to read about science about research. He works

at moonshotai.

Uh he's invented this after 11 years of blogging. So it does it's you cannot do

blogging. So it does it's you cannot do it overnight. It took him 11 years and

it overnight. It took him 11 years and this is used by everybody. So in 10 years uh people can become very good successful researchers.

So I'm also from components import mixture of experts which which we just coded. So

we are importing what we just coded. And

now let's create rotary positional embeddings.

So initialize uh dimension of the model. This is the token dimension. Now I'm calling this token

dimension. Now I'm calling this token dimension in the uh in what we just coded in components. So I should have same name maybe.

So it's just the token uh embedding size or model dimension and then maximum

sequence length and initialize the super and so we're going to initialize this.

So pass in the dimension of the model maximum sequence length and this is the base for calculating rope. So I have these two videos rotary positional

embeddings and rotation matrices and uh this rope as well. So you can watch both of them. Um I so maybe this one is more

of them. Um I so maybe this one is more important and I'm going to leave both in the description to understand rope. You

can also watch rope uh in other YouTube videos. This is a bit complex. This will

videos. This is a bit complex. This will

take you a few times to understand.

But you don't need to master it right now. You can watch if you want or you

now. You can watch if you want or you can leave it for later because the way we will code it is very simple.

So let me just show you here. Okay. So

we will just pass our input through the rope and that's it.

And we will not code this from scratch.

Now you could code it from scratch. The

reason I'm not coding is because this is very fast implementation. It trains the model faster.

But these videos that I showed you and other videos will show you how to code it from scratch. Sorry, I forgot to say the main purpose of rope is to tell you

when you have a sequence of tokens, the order of tokens. So dog is chasing a cat or cat is chasing a dog. It's very

important which word is where. It

indicates who's chasing who. So you do that with rope. If you uh don't use rope, I had a bug where my rope was not applied. Then the model did not get any

applied. Then the model did not get any information which word is at which position.

So it just had a bunch of words it needed to figure out. So it struggled and it does that by rotating pairs of

dimensions of the token. So you have a token vector embedding. these two

dimensions it takes like this pair next pair next pair next pair and rotating them with a rotation matrix. So

I'm getting into math here. So looking

at a vector so these two dimensions these two numbers looking this they it looks it as a vector and multiplies with a rotation matrix to rotate it.

But you can understand uh this in the courses that I explained in more detail and by how much it rotates the

dimensions that's how it encodes that's how the model knows uh which token is where by how much it's rotated rel like tokens relative to each other but this

rope has a specific requirement for the shape so it should be batch size what's t bro I forgot what's E sequence

length T is for like time I guess. So

sequence length number of heads head dimension.

Um we'll talk more about this in the attention mechanism like what is number of heads and head dimension and this stuff. So let's go to multi head

stuff. So let's go to multi head attention initialize uh dimension of the model which is just

the token embedding size and the number of heads max sequence length max context window

and drop out 10% here.

So you know this is the architecture of the transformer and whenever like these lines these lines mean like the token is going through the transformer and this

token has some size it's a vector of some size and that's that's the mo model dimension. So the token the size of the

dimension. So the token the size of the token is the model dimension that's like the main thing for the model the main dimension. So I have good

tutorials for attention mechanism which I think I mentioned in the beginning of the video. So coding llama 4 from

the video. So coding llama 4 from scratch, coding deepsev3 from scratch, coding quen 3 from scratch and AI research course from scratch. So you can

check those and check other attention mechanism videos. It's explained too

mechanism videos. It's explained too many times on YouTube. So I I will not explain it now.

All of the links below the video. So I

will assume you know what query key value vectors are.

So when I say heads what I mean is query vector and key and value all of them are

divided into equally sized chunks heads and each head is calculating it separately. So first head of the query

separately. So first head of the query key value are interacting. Second head

of the query key value are interacting.

Third head is interacting. And we'll

I'll just summarize how this attention works later. But uh this will be more

works later. But uh this will be more like a summary not not like full explanation.

Okay. So let's let's go into init um model dimension. Okay. I'm just going to save these uh variables into these

local variables.

So, DK I think this is the dimension of the key. We'll see. We'll see. Yeah. Divided

key. We'll see. We'll see. Yeah. Divided

by number of heads. As I said, um we're going to make query key and value same size as the model or the token vector.

It doesn't have to be same size.

Actually a lot of the times it's smaller but in a small model small large language model then it can it can be same size

and so let's create this query key value and it's going to be a linear layer and let me quickly explain what query

key value is. So each token has query key and value and query and when when token is looking at previous tokens, it

has its query and you will multiply query with every key of the previous token. So query is

like what information am I looking for?

So the goal of the attention mechanism is to get some information from previous tokens into this token. It's like

enriching it with context.

So what information and how are we going to get some information from previous tokens into this current token current token vector current token embedding. So

each token will have query. It's like

what am I looking for and then it will each token will also have key. Key is

like description of what information it contains.

So when you multiply query of this token with key of this previous token it's a dot product. So you can check what dot products are. So if the number

is small then there is not too much affinity. It means like these tokens are

affinity. It means like these tokens are not so relevant to each other or it's not important to take information from this token to this token. It's not so

important. If the dot product between

important. If the dot product between this query and some other key is high, it means this token is very relevant for this token

and then it will later we will take a lot of information.

So how do we take this information? So

we use value for that. Now the

difference so every token also has value and the difference between key and value is key is describing what the token has.

what information it holds, what information it will give you and the value is the actual information it gives you. So there is a difference

you. So there is a difference description of what it is and the actual like the thing the thing and the description of the thing are different

thing separate things.

So to summarize, so uh query is what am I looking for?

Key is what I contain description of the thing I contain and the value is what I'm what I will give you the actual thing that I contain.

So um so for this token when I'm trying to blend all of the information from previous tokens I will actually add

every single value vector just add but I will multiply every single value vector to this to this value vector. So

remember I'm not adding value vector to the token directly. I'm adding a value vector to

directly. I'm adding a value vector to its value vector. I'm just adding value vectors.

So I will add every single value vector just with plus. That's how you combine information from two different vectors with just plus. If you have this vector, this vector like presenting some

information, some information, if you combine them with a plus, it will add that information.

Um it works in neural networks. I don't

know if it works in like other things, math and stuff, but in neural networks, that's how you can add information combine.

So anyways, I'm going to add every value to the current token that I'm processing every previous value, but that value will be multiplied

with that single number which is dotproduct between query and key that says how important this value is. So if

the dot product is a small number close to zero then this value will be multiplied with zero or not with with a number very close to zero.

So when you add it will just add a bit of like small numbers. So it will not change too much this value.

But if the dot product is higher then uh you will basically add more information. the information will not be

information. the information will not be so diminished and you will add so more information from this because value is the information.

So that was kind of summary. So then at the end every token has this value that's enriched with values of all of the previous tokens not the future tokens. We don't want it to look at

tokens. We don't want it to look at future tokens because we are trying to make it predict the future tokens. It

doesn't know. So just looking at the previous tokens and so now this value contains information about itself what the token

actually is and the previous context like there is a tree fish wind it's blowing hard. So, and then this token is

blowing hard. So, and then this token is a maybe a flag or a ship that's moving.

And so, it takes context that the wind is blowing and stuff.

And so, here I'm just generating these uh query key value vectors. But here is the trick. You don't need to generate it

the trick. You don't need to generate it three separate times. You can combine it into a single calculation.

So within your GPU, within your graphics card, you don't need to send information back and forward, which is slow. Sending

information is actually slower than computing with the calculating most of the time.

So we send so we just going to calculate this all at once without going back and forth send separately query key value.

So going from the D model to D model times three. It's times three because we

times three. It's times three because we will do all three of these quick query value and as I mentioned

now key query value each of them will be uh same size as the token but it doesn't have to be uh in huge LLMs I I'm pretty

sure the key query value is a lot smaller five times or 10 times smaller vector than the huge token vector but in small large language small models

like this we can make it same size and we don't need bias although I've heard some people using bias also but this is more common without bias I

believe and um we also need this output so it's

similar so output is going to convert the value to the back to the token so remember at the end of attention mechanism

we added all the values but now we for every token we need to convert this value that's enriched with other tokens by the way we need to convert it back to

the token itself so that's what this output will do going from the value to the token itself and that's at the end

and I will also define rotary and by the way um I'm not processing anything in these lines of code. I'm just defining it now right now. I'm just so later we

will have forward pass to process this and you'll see how it works.

So then define rotary which is our class that we made here.

So I just need to pass in this uh dimension of the each head. Now this is a bit weird. So

each head. Now this is a bit weird. So

we have model dimension which is actually going to be same dimension as key query value and we are dividing by number of

heads to get dimension of each head and what heads are is you basically instead of calculating whole query times key and adding value you separate it into

independent heads independent parts. So

now this head of key the first head of key is just interacting with the first head of sorry first head of query is multiplying with the first head of this

key of this token and then adding the first head of the value. So each head is separate. So heads can learn different

separate. So heads can learn different things.

So I guess this uh dk is the dimension of the head which is a bit um confusingly named here. Maybe it should be like head dimension or head dim or

something. But it is what it is for

something. But it is what it is for right now.

Okay.

So let's go next. Dropout.

Oh yeah, that's just passed. This is the rate of the dropout probability. And

let's now define the forward path. So

this is our uh input which is batch time sequence length times token embedding just all of the tokens and conversations

and everything and so we need these shapes. So the first shape number will

shapes. So the first shape number will be batch size the second will be sequence length.

Uh here I think I had a mistake so I commented this should be deleted instead.

So anyways, I'm going to pass now input through query key value

linear layer and then generate query key value. So now

value. So now let me see let me see what this is.

So as you can see um this will be a single tensor that's three times larger than the model and now I want to separate it into three tensors which

will be equally sized query key value.

So I'm going to do that here.

So um this is maybe a bit confusing. So we

already know that X is shape batch size uh sequence length and then we have token dimension token embedding

but because we pass this X through this linear layer it's going to the last dimension will be three times token embedding dimension

like this and then we will split this into something here. Let's see what we split this large vector that contains

key query value concatenated into these things. So first keep in mind that the

things. So first keep in mind that the first two dimensions remain the same. So

number of different conversations and order and tokens in each conversation remain the same. But now we we do something with this key key query value

long vector. First we split it into

long vector. First we split it into three. We will have key, query and

three. We will have key, query and value. We just split like that. So easy.

value. We just split like that. So easy.

Key, query and value split into three equally. And then each of them we split

equally. And then each of them we split into heads as I said.

So um we just split like let's say query into heads and what's remaining is the dimension of

each head.

So maybe there is eight heads and I don't know maybe 100 of these. So let's

say let's say like a query has dimension of 800. So you

have eight heads. Each head has 100 uh dimension or size. And just to now

summarize, so each token will have three vectors which is query key value and then those will be divided into

heads and each head will have certain it's like a vector of certain size and there is some number of heads of

those vectors and then each head will like first heads will interact with each other second heads of query key value will interact with each other etc and

heads will be independent from each other. So only first will interact only

other. So only first will interact only second will interact but there is no first and second interacting together.

Okay.

Um we need to do some permutation here.

So you see I'm putting batch size to be the second one I'm putting. So this is

zero one two three four. So um uh this is a bit confusing number three number two but look at the indices. So I'm

taking this dimension which is the dimension that separates queries keys values and putting that to be the first dimension. So now imagine like all of

dimension. So now imagine like all of the stuff about queries here all of the stuff about values and keys

and then keeping the batch size there. Oh, this is a bit confusing.

there. Oh, this is a bit confusing.

Maybe you're going to need like some practice to understand this. You need

some practice here. And then separating by heads. Head is the next separation

by heads. Head is the next separation and then sequence length.

So imagine that we are separating by a like just take um first head of each token and now put it in a sequence like

sequence of tokens. So first head of the first token, first head of the second token, first head of third token, etc. And then

so that's what this sequence length here it just list of tokens.

And number four is the head dimension.

So that remains the same. So this is a bit confusing. Maybe you can copy paste

bit confusing. Maybe you can copy paste this into chat GPT try to understand visualize code. It it will take some

visualize code. It it will take some practice.

So just know that we are separating by key query value and then by kind of heads and in each head there is a sequence of tokens.

And why are we doing this permutation?

Well, first of all we will get a list of keys queries keys and values separated from this one long vector.

And so now each of them will be batch head and then within head dimension you have sequence of tokens and

um size of the head.

So it makes sense to separate keys queries values. You need them separated.

queries values. You need them separated.

Now this I think I messed up. I need to delete this as well I think.

And we will apply rotary embeddings or rotary positional encoding to

queries and keys only.

And uh we need to transpose swap these two before we pass them through rope because that's what this rope expects. So it expects batch

rope expects. So it expects batch sequence head and head dimension. That's

what it expects. So that's why we need to swap it. Batch sequence and then head and then head dimension. Let's see. You

see batch t is time or sequence length head and then head dimension. That's

what it expects. So that's why we transposed and passed through rotary and then transposed back to return it as it was. So I had a bug here before and some

was. So I had a bug here before and some person fixed it and then my large language model works a lot better.

And finally we this will do attention mechanism. So we're not going to go

mechanism. So we're not going to go ahead and code attention from scratch here. You can watch Andre Karpathy video

here. You can watch Andre Karpathy video or million other videos on coding attention mechanism from scratch. That's

the most explained thing in AI research I think. So I'm using this because it's

I think. So I'm using this because it's very fast. Usually you want to use these

very fast. Usually you want to use these like PyTorch native things. They're very

fast. They're very good. Uh the

engineers that build them are very good.

You only want to build stuff from scratch if it doesn't exist in PyTorch.

So passing query key value is causal means tokens are only looking at the previous tokens. They cannot see the

previous tokens. They cannot see the future tokens which is what we want in large language model.

Dropout if it exists.

That's so easy. Just calculate

attention. It's very fast. It's very

good.

H okay. And then attention output, we're going to transpose something. Let's see

what we're doing here.

Okay.

So, attention output will uh be same shape as our input which was batch size, sequence length and then

token embedding dimension.

But here we will actually have instead of token embedding dimension we will have value dimension because at the end we will just have all

of the sequence of tokens but values for each token not not the token embedding itself but the values for each token but in this case value size is same as the

token embedding size. So it will be literally same size as the input X at the beginning.

And now as I said for each token we need to convert the value. And remember this value also contains like added values from every other token

as information. But we need to convert

as information. But we need to convert that value for each token back into the vector embedding token. So we will do that by simply passing that list of

values through this output layer which will convert list of values to list of tokens and the shape will be same

and let's just understand this transpose and reshape. So transpose will do the

and reshape. So transpose will do the same as here. It will swap head dimension and the sequence length

dimension.

So this is what we will get originally.

This attention output is actually this shape which is batch head sequence head dimension. Okay. And then this transpose

dimension. Okay. And then this transpose will swap head and sequence. So we will get batch sequence head and then head dimension. So number of heads head

dimension. So number of heads head dimension and then we will combine number of heads and head dimension using this reshape.

So it will automatically figure out so sequence and batch are going to be the same because they are at the same place.

So it will not be changed and then it will figure out that it needs to combine head times number of heads times head dimension to get this size. So it will

merge those two independ now all of the independent heads for this value will now get merged into a single vario vector.

And so now we have this shape batch sequence and uh dmodel.

So this is list of values enriched values with other context and then we pass that through output to get list of actual tokens.

Next we have the simplest part of this whole file. So we just going to be the

whole file. So we just going to be the transformer block.

Uh let's see what this is. Initialize.

We have the model number of heads. This

is the expert you know mixture of expert. So the expert hidden dimension

expert. So the expert hidden dimension that was four times bigger than the dimension of the model if you remember the feed forward of the expert

and um we'll see what this is for rope. I

think I'm not using this. We'll see. I

forgot if I'm using this later.

So uh this Laura rank is for different implementation of deepsek multi head latent attention. I forgot to delete it.

latent attention. I forgot to delete it.

So I think you don't literally need any of these. We'll see later. Sorry, I

of these. We'll see later. Sorry, I

forgot if I use this or not. Okay, we

got max sequence length. You need this.

Number of experts. You need this.

Uh top case selecting top two experts and 10% dropout.

Okay, starting with the initialization of the superass and superass I mean like this nn module you want to initialize that within init

of this class.

Uh okay. First initialize our attention multi head attention class. Pass in the necessary parameters. D model number of

necessary parameters. D model number of heads max sequence like dropout and then mixture of experts.

I'm going to save that as feed forward.

You could also save it as instead of feed forward, but just pass in the parameters and then uh normalization. So we're

going to use RMS norm.

You can watch this video RMS norm from scratch. I explained everything here uh

scratch. I explained everything here uh link below. So that's normalization to

link below. So that's normalization to keep the numbers from becoming too large like or too small like thousand 10,000

million or 0.00001.

We want to keep numbers around number one.

So that's what normalization does. We're

going to have two of them. We'll see

later why. So drop out as well.

And let's see the forward for the transformer block.

We're going to pass through attention but first normalize. So if you remember, I'll show you the architecture of transformer. You first normalize and

transformer. You first normalize and then pass through attention mechanism.

This input X. So looking at this decoder part, looking at this attention mechanism, this norm goes after but people figured out you should put norm

before it works better. And here in feed forward we will also have norm before the feed forward. Now in decoder only transformer we don't have this cross

attention part. So we just have

attention part. So we just have attention and then directly to feed forward and norm is before feed forward and before attention.

So after we pass through attention there is a residual connection which we'll see. So attention

output first goes through dropout to regularize regularize to regularize.

Okay. And then residual okay I can't say regularize.

Regularize regularize maybe. So this is that

regularize maybe. So this is that residual connection. So we add the

residual connection. So we add the output of the attention plus this input.

So the idea is that we want to also preserve some information from the tokens themselves.

Not just process them through attention but also preserve some initial information as well. So as I said when you add two vectors it it's adding

information from both of them. So from

the original and from processed.

Next uh same thing but through the mixture of experts.

So we have this X that just came out of this pass through norm first and then through feed forward which is our mixture of experts and that's going to

return not only the output but also this auxiliary loss and then regularization

and adding the residual network to preserve some of the information from before the this uh feed forward ward and then

return processed x and auxiliary loss.

Next, let's combine all of this into a large language model. Whole large

language model. Um, this is also short part. I think it's very simple. The most

part. I think it's very simple. The most

difficult part was the experts and because I never explained it in a different in different videos so I explained it here. Uh, so now this

should be easy. Um, my filee.lm llm.py

disappeared for some reason.

And I also have this init.py. I don't

know how it got created, but anyways, you will actually need this. We're going

to we're going to create this later. So,

don't worry if you don't have it, but make sure to have this file. Llm

dash or underline llm.py.

llm.py.

Let's first import this good stuff. So

from typing optional configs.meconfig

configs.meconfig import model config we don't have this defined yet we'll make it and from sorry models

layers import transformer block which we just created and so let's create our large language model so I named ite

minimal llm maybe could have given it better name it's like a small mixture of experts or minimal mixture of experts llm

and so we're going to initialize this config with the classe model config this is the class that we will import later we will define it

and do the super initialization and so the config that we pass config I'll show you what config is

um first let's Okay. Yeah. Here. So, token embedding,

Okay. Yeah. Here. So, token embedding, it's going to be equal to n.bedding.

This is a big matrix of token embedding.

So, each of the tokens in the vocabulary.

So, we have let's say 200,000 possible tokens.

And if you don't know what tokens are, watch my llama 4 from scratch course in the description below.

So tokens are these like subwords that model is predicting the next token generating generating generating then generating

and so we have this big dictionary it's like a dictionary or it is literal dictionary I'm not sure if it's literal dictionary or like a dictionary but

anyways it's vocab size so 100,000 200,000 tokens and for each token it has vector embedding So if the for example token car can have

some vector embedding and it's maybe at position 70, we can just index into this dictionary with position 70 and take out

this token for this vector embedding for the token car. But we'll see maybe uh this is commonly explained everywhere.

So I'll just go quickly through this.

We're going to do some dropout position dropout. We'll see later on an example.

dropout. We'll see later on an example.

I'm going to show this through an example.

Now we just created one. We defined

transformer block that has attention and feed forward. But large language model

feed forward. But large language model will have multiple of these blocks stacked.

So tokens first go once and then go back and repeat and repeat and repeat and repeat. So we're going to create a list

repeat. So we're going to create a list of these blocks. each block having attention and feed forward.

Okay, so it's going to be array and then in that array you're going to say transformer block which we just defined earlier

and pass in all of the stuff it needs dimen model dimension head dimension feed forward hidden layer uh as I said uh we'll see later if we need these or

not I think we don't need this one but anyways let's have this v this is what is this I forgot uh we have this

it's not good if you don't know what your variables are but I think uh this is used for them deep seek latent attention so we will not need this we don't use this anywhere you see it's

only defined once in this whole file max sequence l number of experts so all of that will be passed through this configuration file so when we are

creating large language model we're going to put this configuration file separately so we can have multiple different large language models.

We don't need to hardcode any single one into these files. These are just like components, layers and stuff.

Okay. Um

so do this for the number of layers.

Large big language models I think they have 30 40 layers. Smaller ones have like 15 16 but absolutely huge ones who

knows how many they have. these like

grock four and stuff.

Sorry. Uh you don't need to change anything else here.

Next, let's define uh normalization and dropout. And this is for the output

and dropout. And this is for the output head. Now, output head is a separate

head. Now, output head is a separate thing from all of these layers. I never

mentioned it until now.

Let me pull up the architecture.

So these are the layers the green ones and it gets repeated many many many many many many times at the end we have after

all of the layers are repeated we are generating the next token at the end we're not generating next token within this attention and feed forward here we're just processing the current tokens

in the context window and after that has been processed many many times we're going to take the last token in the context text window and convert that to

the next token using these let's say these two things here linear and softmax. So this linear is output head we call this output head

and uh it's interesting it can be absolutely huge it can be onethird of the entire model. So how

many times are you repeating this 16 times all of these things and just this one single linear layer can be like a third of the entire model and you'll see

why especially for small models.

So um output dropout and output normalization now we are tying okay so this is the output head. So what output head will do

output head. So what output head will do is it will take the last token token size and convert that

through a to a vector that's length vocab size. So how many tokens do we

vocab size. So how many tokens do we have? That's the length of this vector.

have? That's the length of this vector.

And that vector will assign score for each token.

That score will tell us how much this token should be the next token.

It's not probability. It's like maybe affinity like let's just say like the bigger score the more this output head wants that token to be next token and but the

score doesn't have to add up to one.

It's not probability distribution yet.

Okay, let's just now summarize. So, LM

head output head converts the last token that's now highly processed into distribution or scores for every single

possible token in the vocabulary. How

likely it should be next token.

But again, we're not talking about likeliness yet.

It's just scores.

And uh we will use softmax to convert these scores to probability distribution that adds up to one to 100%. To now

generate to convert scores into probability for each token later. Um

this is one trick that people find found out. You actually want to have same

out. You actually want to have same weights of this LM head as your token embedding weights. So it's going to be

embedding weights. So it's going to be the exact same weights for this linear layer.

So what we are doing these weights are actually uh token embeddings themselves.

So we are taking the last token in the sequence in the context window that person is saying and multiplying that token embedding with every single token

itself like the the embedding of every single token in the vocabulary like 200,000 of them. this last enriched last token with that's enriched with the context that's processed through all of

these modules multiplying with every single token embedding and that's going to create the logit. So people found out that tying the weights like this works

well.

So now the weights for this output head are same like literal token learn token embeddings

and then this uh self apply self init weights here it will just initialize all of the weights. So embedding layers,

linear layers, RMS norm transformer blocks and all of the weights according to this rule in it weights

that doesn't exist here yet. So let's

define it.

Um we're going to check which type of weights we are initializing.

So if it's linear weights, we're going to initialize weights. Let me see. Yeah. Yeah. Yeah.

weights. Let me see. Yeah. Yeah. Yeah.

So normal distribution, we're going to initialize weights with normal distribution with mean zero and standard deviation 0.02.

And the bias will be just zero all zeros if bias exists.

And if it's embedding and not linear. So

if it's embedding, we will just initialize normal distribution of mean zero and standard deviation 0.02.

Next we will create forward pass define forward pass through this large language model. Let's see Gemini has good

model. Let's see Gemini has good explanation here. Uh X is just all of

explanation here. Uh X is just all of the our input which is batch sequence and then it's actually just a list of

token ids. So it's batch and sequence

token ids. So it's batch and sequence sequence of token ids and we will see later how we get this list of token ids and if we put all of those ids through

this token embedding that we actually defined just earlier we will convert each ID or we will exchange each ID with

its token embedding for that token.

So for example, if token cat has ID uh 3,50, we will first pass 3,50 as this ID. We

will see later, we didn't code that yet.

Now imagine we have that ID and then by passing X through this we will instead of 3,50 we will get the whole token

embedding for the cat and we will multiply with the square root of the model dimension and now that I'm looking at this this

was used when you are adding positional encodings to the token embedding but we are using rope

So maybe we don't need this. So maybe

I'll leave this here or I will delete it. I'm not sure yet. Let's leave it as

it. I'm not sure yet. Let's leave it as is. Uh this was meant to increase the

is. Uh this was meant to increase the embedding vector because uh when you are adding position vector if position

vector numbers are a lot larger than embedding numbers then it will kind of drown out um these embedding token it will all be like

position information because those numbers are very large. So you want to increase the size of these numbers by multiplying. But that was when you were

multiplying. But that was when you were adding position vectors.

But now we are using rope. So we are rotating the embedding token token embedding itself. So maybe we don't need

embedding itself. So maybe we don't need this. But let's leave it for now.

this. But let's leave it for now.

This uh dropout position dropout x will zero out some dimensions within the token embedding vector.

So imagine you have this token hello and world. It will randomly turn some of the

world. It will randomly turn some of the dimensions within the world token to zero.

So dropout is to regularize so uh the model doesn't get too dependent on these exact embedding for this like numbers

for this embedding. It improves the general ability.

Next um collect auxiliary losses from MOE layers.

So we will initialize it at this empty array and for each block we will pass our input through the block and the

block will return the processed input and this auxiliary loss.

And if the loss is not none and this auxiliary loss return return auxiliary loss is true,

we want the LLM to return um this array of auxiliary losses for each block

and output projection. So after we pass throughout all of the blocks, each block being attention plus feed forward, we're going to generate the next token here.

So through the when we pass through out to lm head, uh we will first normalize the X and drop out X and then pass through LM head to generate the next token. As I explained, LM head will

token. As I explained, LM head will convert from the token embedding which is the last token in the context length in the sequence in

the conversation to the logit or scores over entire vocabulary. Each possible

token gets a score to be the next token.

Um combine auxiliary losses.

So we just sum all of them. So we first collected them because we're going block by block. So we collected and then

by block. So we collected and then summed if it's not none and if we want to return then we will return logits and

auxiliary loss. So I guess we could also

auxiliary loss. So I guess we could also uh just sum them here as opposed to first putting them into array. I think that would also make

array. I think that would also make sense or just return logits if we're not returning this and that's it. So logit

is like scores for every token in the vocabulary to be the next token. So we

just have like maybe 10 more lines of code in this file because we finished all these three big files. So you need this file that's underscore_init_.py.

The purpose of this init pi is to have a clean API clean way to import to different uh scripts.

So without it we would be importing like this like my package which is going to be our models.components

from models.components import from models.layers from models.

models.layers from models.

But with this init pi, we can actually just say from models import all of these. So we don't need to say dot and

these. So we don't need to say dot and then file as well. So it's a lot cleaner. We can also say from models

cleaner. We can also say from models import everything, all of the classes here. Here we want to just export all of

here. Here we want to just export all of our classes so we can easily use them in other files. Uh let me show you what I

other files. Uh let me show you what I mean. So first we will import from

mean. So first we will import from components we will import expert topk router mixture of experts and from layers we will import all of our classes

although we don't have multi head latent I did not code that so I'm going to delete this just a second and from um

llm I'm going to import mae minimal llm and then we will export uh first we will combi uh define this all array

expert. So we will just put all of these

expert. So we will just put all of these classes that I just imported and that's it.

So we will this is how we will use these classes in different files and multi head latent attention doesn't exist. So I'm going to delete that.

exist. So I'm going to delete that.

Make sure to commit everything to GitHub.

Next, um, besides this models folder, let's create a new folder, optimizers.

Optimizers, um, I'm separating it because we're going to be doing research on this, but you can structure it differently as

well. And inside we're going to have

well. And inside we're going to have underscore_init_.py

underscore_init_.py and muon.py.

and muon.py.

So muon optimizer we're going to define muon optimizer ourselves and Adam we will just use from pytorch as I said you want to use everything you have in

pytorch those are very fast the best things unless you want to code it yourself and change it now I don't know if there is a single video on entire

YouTube about muon optimizer that's well explaining it if you go to my channel and search for muon I have a videos about muon but I don't think there is a

full course on muon optimizer anywhere on YouTube maybe you can also search for YouTube on YouTube search for muon optimizer um maybe this

one is actually good yeah this one is good so it's maybe more complex if you want to watch it I'm going to try to

summarize muon as well in the muonpy import torch and function is f and then we're going torch compile whatever we

defi define here. So zero power via Newton Schulz 5 uh that's how Keller Jordan defines it

and this is the tensor that we will uh pass through this function and five steps. We'll see what those steps are

steps. We'll see what those steps are and it returns tensor. We'll see what it returns. So Newton Schulz iteration to

returns. So Newton Schulz iteration to compute zero power orthogonalization of G. It's crazy. If you do Google search

G. It's crazy. If you do Google search for muon optimizer, there are my videos here as well.

The main idea behind the muon optimizer.

So you have a matrix of weights in a neural network. So weights for the first

neural network. So weights for the first neuron, weight for the second neuron, third. So this when you multiply input

third. So this when you multiply input with these weights, you get some hidden layer or some neurons, some activations, whatever.

And so you multiply.

So I'm just explaining neural networks.

So input times the first uh row of weights gives you the first neuron.

Input time second uh weights gives you second neuron. There is also maybe

second neuron. There is also maybe activation function in between. So just

before the neuron that's not so important. What's important is that

important. What's important is that this weight matrix if you look at each row of this matrix separately as a vector

if those vectors are perpendicular 90 degrees. So let's say this the first row vector is this the second row vector is this. If there is

90 degrees between them neural networks learn faster with less data.

Why? I don't know if anybody knows, maybe the inventors have some idea. I'm

not sure. I haven't read about it at all. I'm actually trying to keep up with

all. I'm actually trying to keep up with neon optimizer and I haven't understand like haven't understood why neural networks learn uh more with

less data. If uh this rule of

less data. If uh this rule of perpendicular rows of the weight matrix or columns, it can be rows or columns. if that rule is

true for the weight matrix. So, muon

optimizer we're trying to keep the rows of the weight matrix perpendicular to each other. If we look at rows at vector

each other. If we look at rows at vector as vectors the way we will keep the matrix orthonormal which means that either rows

are orthogonal to each other or columns are orthogonal to each other and rows or columns are normalized.

Um the way we will keep it like that is we will each of the update matrices. So

you know there is this update matrix which is gradients.

So this is the classic back propagation and I showed you the my course on back propagation from scratch. So um when neural network does back propagation

there is this update matrix.

It's like for each weight how much to add or subtract from the weight to make the loss go down. So the update matrix

gets sub multiplied with learning rate and subtracted from the current weights of the model to adjust the weights. So

again update matrix is made from gradients. It's like how to change

gradients. It's like how to change weights, what to subtract from weights to make the last loss go down. And it's

it has same shape as the weight matrix.

And so we're going to make these update matrices orthonormal them. And then as you just

orthonormal them. And then as you just add them, add them, add them, add them, then the weight them weights themselves will be orthonormal because all of the

matrices that made the weights are orthonormal.

But we will make it approximately orthonormal. these update weights

orthonormal. these update weights matrices uh because the true orthonormalization would involve calculating a singular values and

singular value de composition which is very expensive and slow. So we will approximate singular value de composition operation

with some kind of polomial. So it's it to me it looks like a polomial. Let's

see. First of all, uh we will apply muon only to 2D matrices. It's made for 2D matrices, not

matrices. It's made for 2D matrices, not uh embedding vectors, not normalization that that are not 2D. So only for these

2D weights and as I said we have these coefficients here and this is first reason it reminds me of a polomial.

You'll see later that these are coefficients in a something what looks like a polomial.

So we will convert um G to this half data type because this algorithm is stable enough to do this in FP16 and it

provides a lot of speed up as opposed to being FP32 for example.

Then we're going to check if the matrix is tall and so if it has more rows than columns.

So it's tall. If it's tall, we want to make it wide. We want to transpose it.

The algorithm works better or is standardized on matrices where rows number of rows are is less or equal than number of columns.

In the next row we have normalization.

So x is equal to x over and then x dotn norm along these um two dimensions

and then to avoid dividing by zero.

So Newton Schulz iterations which is our polomial approximation of this singular value the composition

and making the matrix. So this is what makes the matrix or norormal and it's going to diverge if it's spectral norm

largest singular value of the matrix is too high. So larger than this value.

too high. So larger than this value.

So we are dividing it uh by this norm to make sure that the matrix is small enough so it converges or the spectral

norm is small enough I should say. So

this minus2 and minus1 are rows and columns and it will make sure we calculate um the

forbid for binius whatever the norm is of each matrix individually if there are multiple matrices or if there is one

matrix just of that one matrix.

So if we have a single 2D matrix standard case uh let's say it's like this. This is the norm. We square all of

this. This is the norm. We square all of the values and then square root. So

that's the norm.

And the norm along among these dimensions is also the same.

But if we have multiple matrices, then it would actually add up all of the numbers of both matrices.

And putting it like this, it will now separate um so rows and columns and then rows and

columns as well. So two matrices and we also have this keep dim equals true. So

uh let's see what that does. So our

result is 1D tensor.

The first norm for the first matrix, the second norm for the second matrix.

But our input was 222. So our input this was 222.

But our result is just of size two. And

you cannot divide uh this 222 with or by this two. So you

need to reshape it to 211. But we don't need to reshape it if we just preserve uh all three dimensions.

So we just say keep dim equals true and then that will preserve.

So for this matrix for this matrix and now uh PyTorch can divide this.

it can divide the first matrix with the first norm and second matrix with the second norm. So after we did this

second norm. So after we did this normalization stabilization um this is where we come to the point of this polomial approximation that I was

talking about. So I'll just show you the

talking about. So I'll just show you the lines of code.

So we do uh some number of iterations and every iteration we make the matrix A

sorry I should say this X matrix X that that is our update matrix. We make it more and more orthonormal every iteration.

So five iterations we make it more orthonormal. 10 we make it even more

orthonormal. 10 we make it even more orthonormal. The question is how many is

orthonormal. The question is how many is enough? We don't want to waste compute

enough? We don't want to waste compute and this is something I'm also testing in my in the paper that we are writing here. So we're going to try different

here. So we're going to try different number of steps, 10 steps, five steps, two steps, etc. But the default is five steps.

So um look at this a is x time or x matrix multiply xrposed.

If X were perfectly orthogonal, A would be the identity matrix. Let me first show you the other lines of code and then we will discuss.

So here this is why I say it looks like polomials. So if you look at this A you

polomials. So if you look at this A you have you have literally a like something that reminds of a squared which is just

matrix multiply in this case. So a what this reminds of a square times this uh coefficient and then a to the first

power what remains of a to the first power or times coefficient. So that's

the b and then this is a bit weird. We have b * x but then we have also just x

times the third coefficient. So that's

why it kind of reminds me of polomial here and it literally is polomial.

So it computes the polomial using the coefficients B and C. This calculates

terms based on X * X transposed and X * X transposed and squared.

So we are talking about this one. And

then in the next one it combines the original X with the polomial corrections.

So polomial corrections and the original X. So this second part look you have X

X. So this second part look you have X the matrix the update matrix and you are

transforming it to be more orthogonal with this B. But you are also keeping some of the

B. But you are also keeping some of the old x with this a because this is just a

scalar coefficient. So some a bit of old

scalar coefficient. So some a bit of old x plus a bit of new x and every so every time this will be more and more

orthonormal matrix after every iteration. So this is very fast to do.

iteration. So this is very fast to do.

Look how fast it is. It's just plus and multiplication. very fast matrix

multiplication. very fast matrix multiplications on GPU as opposed to slow singular value de composition that needs to run on CPU only I'm pretty sure

okay like this is a bit tough so this B constructs the coefficient for the x23 and x25 terms

um I guess I need to make like a full video or figure out a way how to explain this muon better and how to understand it better myself as well but as I said

what this does is it makes this uh gradient like vector update matrix more orthonormal with orthonormal with each iteration.

And so if we transposed the matrix here, we're going to turn it back.

And whoa, whoa, whoa, whoa, whoa, whoa.

Just uh return this now uh matrix that's a bit more that's orthogonalized or approximately orthogonalized.

Then we define class muon which is the new optimizer inheriting from torch optim optimizer momentum orthogonalized by Newton Schulz

that's what we are doing and in it so parameters of the model learning rate is quite high compared to

Adam which is usually 0.01 01.

So 20 times smaller than this.

Uh we're going to be checking this this momentum. I'm going to be doing

momentum. I'm going to be doing ablations.

Uh 0.9 I find it works very well in the experiments. We'll see later. Number of

experiments. We'll see later. Number of

steps here. Number of Newton Schulz iterations five by default.

and using or not using nester of momentum. We'll see later about this. Oh

momentum. We'll see later about this. Oh

my god, what is this?

Then this default is equal to dictionary.

This will just package the hyperparameters.

You see just packaging hyperparameters into this default dictionary and initializing the superass.

Next torch no grad and then we are defining some step. The

reason we don't want torch to calculate gradients here is that we are not going to be training some weights within this

optimizer. We are using this optimizer

optimizer. We are using this optimizer to uh train the neural network but we are not going to be calculating gradients and updating whatever this

function does. We're not going to be

function does. We're not going to be training this function itself. This

function is just used to train other weights.

Then we will just loop through parameters.

Uh we'll check if parameters have gradient.

If they don't if it doesn't have gradient just continue and so we will get the gradient.

And here this state uh when passing parameters we get the momentum that's saved here.

So if we don't have momentum buffer in state then we will initialize it from zeros and it's going to have same shape

as these gradients for this whatever the parameters we are processing now.

Then we're going to store this momentum buffer uh into this buff. So it's going to be either like zeros or it's going to be

whatever we had previously.

For the next line, we have buff.lurp

linear interpolation.

Uh we have the gradients and then one minus the momentum.

So it updates the buffer in place. Okay.

because uh we are just using it dotlur here. It's like saying buffer is equal

here. It's like saying buffer is equal to buffer times momentum

plus gradient time 1 minus momentum. So

remember that our momentum is 0.95. So

it's very close to one. So that means that uh a lot of the old 95% of the old buffer will remain. So the

new buffer will be 95% old buffer and just 5% will be the new gradients.

So gradients will not change so much this buffer which provides stability because gradients can be a bit random.

We don't want gradients to change stuff so much. So we just want it to change

so much. So we just want it to change bit by bit. This creates an exponential moving average of gradients.

So the reason we need moving average is imagine these gradients are on average pointing in some direction

uh on the loss surface that reduces the loss.

But each gradient individually will be pointing in a bit random direction as well.

But on average they're pointing in the good directions. So that's why this like

good directions. So that's why this like calculating the average of the gradients is good.

So buffer will start from zero and then based on the gradient and gradients it will uh average out and have this exponential moving average of

gradients.

Then oh my god again we have this um checking if it's using nester momentum.

So if neto is false then it will apply the standard momentum of the gradient just being the buff itself.

So look if we have nest then it's going to be something which we will explain but if we don't have nest it's just going to be the buff

and then that buff will be so that's just going to be passed through the Newton Schulz. So that buff will be

Newton Schulz. So that buff will be literally the weight update matrix which again it's just

the gradient itself like exponential moving average of the gradients.

But if we have nester then something a bit more complex is happening.

The difference is that standard momentum momentum takes 5% of the current slope current gradient

and nester momentum takes almost 10%.

Because you can see at this point that we are also learning linearly interpolating the buff and the momentum.

So you are literally taking the buff which we had in the previous one as well and then multiplying that with 0.95 and then take uh giving even more of the

current. So at the end

current. So at the end uh this way standard momentum has only about 5% of the current gradient and

nester momentum has almost 10%.

This causes neester momentum to look ahead at the future spot by uh taking this step calculating the forward step and then adjusting based on this forward

step and so it will calculate the slope at the future spot. So because we are going from 5% current to almost 10%

current we are simulating the correction we would have made if we had stepped uh into that future. I mean did the next step there is this video on a stereo bit

momentum I think it's very good so then in the next line we just uh make this gra gradient update matrix more orthonormal

with zero power newtons and just one more line so padd and then we have this all this good stuff uh it looks a little

bit complicated this is the actual update of the weights And it's complicated because it's designed to uh work based on if the

update matrix is tall or wide, has more rows or more columns.

Uh this part will measure or check if our matrix is square, if it's tall or if it's wide. So uh if the matrix is tall,

it's wide. So uh if the matrix is tall, it has more rows than columns, then we want to increase. So this this number

will be larger. So if you look at this, this is uh rows over columns and we're going to get a maximum number

between one and rows over columns. So

only if rows is larger than columns then this number will be larger than one.

So then only if there is more rows than columns then we will have some number larger than one and then we take square root of that number but that's not so important like I'm just trying to say

that if we have more rows than columns then this number will be larger than one otherwise it will be one.

So tall update matrices will make the neural network learn slower and it's due to energy dilution.

So uh our shape correction for tall matrices will be greater than one. Uh so

we want to increase the learning rate multiply by shape correction. So we want to make the learning rate larger for tall matrices because they are diluted.

they if we don't then they will just cause the neural network to learn slower than if we had square update matrix or wide update matrix explaining why tall

update matrices make the neural network learn slower would take me uh too much time for this video so but I will explain it in the future for now you would have to ask like Gemini or

something if you want to understand this and study it yourself but in the future I'll make it looking at our final formula we have our weights which is

going to be our P the parameters.

So P in our code here we have P dot add.

So we are adding something um P minus this learning rate times orthogonal gradient. Uh this is

orthogonal gradient. Uh this is important. So we are literally

important. So we are literally uh modifying P in place. So here we are modifying P in place. we are just adding

well in this case we are adding negative learning rate and uh so when you say minus learning rate so that's going to be subtracting

then but it's going to multiply learning rate with this G g is our gradient so G

uh we are forcing G to be same shape as P so it's going to be 2D matrix X and uh we will multiply just it will

multiply G * minus learning rate and so it will add or in this case subtract it from the P itself.

So we will have something like this.

So that's going to be the muon optimizer. I'm going to just commit this

optimizer. I'm going to just commit this to GitHub and we can go on to the next thing. I will also create in it folder

thing. I will also create in it folder here in optimizers and inside I'll just import uh this from muon these two

classes m and zero power we are into shs and I'm going to export it like this so it's easy to import just from this module optimizer so I can say from optimizers import muon from optimizers

import this I don't need to say from optimizers do muon import this and we can easily add more optimizers and then export everything from here and

it's going to be very clean. We will be importing just from optimizers everything we need.

Then let's make a new file outside of everything train_e.py

train_e.py or train.py or whatever.

or train.py or whatever.

Here we will literally write our code for training the whole neural network.

So we will actually run this python uh train.py when we want to train the large

train.py when we want to train the large language model.

Let's start with some imports time OS and torch and then logging and then from torch utils data import data loader for

loading the data. Now we need to do this because it will just uh put a lot of warnings in the console when we are training the large language model about like tokenizing

uh process being split. So you need to do this for it not to split the tokenizing process or something. I'm not

100% sure.

Uh maybe this is if you have multiple if you're tokenizing with multiple GPUs or something. Okay. From configs.me config.

something. Okay. From configs.me config.

We don't have this yet but we will have it. Importe model config. So we will uh

it. Importe model config. So we will uh create this soon. And from config config

data set config import uh data config.

And from uh data.loader import prepare LM data set. And we will also have this we will call this as well.

From training trainer import train model from utils.h helpers import set seed

from utils.h helpers import set seed from utils.logger import uh setup

from utils.logger import uh setup logging.

Okay. We can print system info like if it's using CUDA or CPU if you want and so that's the device that we are

printing and if CUDA is available you can also print which GPU it is and how much memory uh it has and this will get the first GPU which is

if you have just one GPU it will be your one GPU that you have and I can also print torch version set up the logger to

I will put in just logs folder which it will create automatically and then logger info starting MO with training so this is just some

information uh print system info it's this function that we are just made so this is in main this is when it starts training and then set seed 42 we want

consistent seeds so it can we can reproduce experiments um model config. So this is what we import

model config. So this is what we import and we will create this. So this will just give us all of the configuration for our model and then we need to import data set

download it from the hugging face. I'm

going to use this data set hugging face uh TB small corp small LM corpus. This

is good for training small LLM and this is the name. There are there are three data sets in this repository. So there

this is the one that we will use.

And we will also use this tokenizer because it's uh made for this for this data set.

So now we will just set a sequence length from the config that we will uh create later and number of samp uh number of documents from the config as

well. How many we want to download

well. How many we want to download but we will be streaming downloading as we train. We will not download the

we train. We will not download the entire data set at once.

and then a cache directory for this hugging face downloads.

So make sure this cache directory is included in your in your git ignore.

So it's not here. Maybe I can just add like this. So it's not committed to

like this. So it's not committed to GitHub.

You don't want cache directories committed to GitHub.

uh split documents before uh tokenizing tokenization to prevent data leakage leakage. So we are splitting between

leakage. So we are splitting between training and evaluating data. So from

data sets import load data set and I told you that we will install data sets with requirements and so uh raw data set we'll just call

this load data set and pass in data set path which we defined here data config and data set name

and uh the split data at config.split.

Do we have that? Maybe we'll have that later. Why don't Why don't I have that?

later. Why don't Why don't I have that?

We'll see. We'll see. I'm not sure why I don't have that here. A cache directory streaming troop. We want it to download

streaming troop. We want it to download as it needs. We don't want to download entire data set because we will just train on very very small part of the data set.

So we will just now take number of samples maybe 10,000 documents from this row data set you know it's streaming so it's going to

be streaming uh 10,000 documents we will then take it load that into memory after it's it's streamed all of them

and then um we will take 10% of that number of documents needs to be validation data set

and remaining will be uh the remaining number of will be the number of training documents

and from data sets import data set and so I'm just going to take so this is raw train so data sets from

list raw samples we're just going to take from zero to number of well from the to number of training data set that will

be the training and validation. Same

from number of training until the end and logger we can say that we split the data into training and validation and number of documents

from data.loader import setup tokenizer

from data.loader import setup tokenizer tokenize and chunk and finalize data set.

Um so tokenizer is going to be setup tokenizer which we will write. We will

write this function but we will pass in data config configuration and then config vocab size is equal to

tokenizer vocab size uh tokenizing training set. So now

tokenize and chunk from our data.loader

which we will define as well. pass in

the documents tokenizer config finalize data set.

We also uh created this function.

I mean we will create it later to see what it does. It does some processing on the data set and then tokenizing validation set. It's

going to be the same.

And so we have um now the tokens themselves we can say uh length of those tokens for both.

This is for data loading uh from RAM memory into CPU. CPU will process the data turn data into tensors and then it will be sent to GPU and GPU will perform

multiplications.

So batch size is how many of these um different chunks of data are independently sent and processed. Better

GPUs are going to be able to take more.

Number of workers is the amount of background CPU processes that are loading data and sending it to CPU. So

sorry GPU. So uh if GPU is already processing one uh batch of data, the next batch is already being converted

and loaded by the uh CPU. So next two batches actually

uh CPU. So next two batches actually while this one is already being processed in the GPU persistent workers uh once it finishes

loop through entire data set. If there

if workers are not persistent then that processes will be removed.

But then it will need to start them again for the next epoch for the next processing of the same data set. So we

will not actually be removing them and starting them again. There is no reason for that because we are processing multiple epochs multiple times over the same data set. If we are doing that

and this pin memory is u data is put here for the faster access by GPU.

then train loader validation loader. So these

are going to load the data set and we will shuffle the documents.

So it doesn't train on the like some predictable maybe in the first part of data sets it's all like medical documents.

So it's going to just maybe learn uh like medical and legal documents and the rest of the data set we are not uh teaching has some math and other stuff.

So we will shuffle everything to randomize the document all order.

This is just printing the information uh experts. So this is just prints a

uh experts. So this is just prints a bunch of prints starting training. We can measure time

starting training. We can measure time train model.

We have that here. We imported that from training trainer and we need to pass in this. So we will define this. We didn't code this yet.

define this. We didn't code this yet.

Uh we can measure elapse time after we get the trained model and metrics.

Training complete results. Okay, I'll

just go quickly through a bunch of prints here.

Validation perplexity loss accuracy.

This is where we save the checkpoint. Uh

you can save that like it will create this folder and make sure this folder is not in the so it's in the g ignore.

Okay, it's not in g ignore. So I'm going to add it here maybe. So we don't want to con commit this uh checkpoint to the GitHub

and then torch.save.

So we will just kind of structure all of these this information so we can later load it easier.

So checkpoint path. Okay. So we we are just printing where the checkpoint path is. And this is the main. Oh, so that's

is. And this is the main. Oh, so that's it. So when we call when we call this uh

it. So when we call when we call this uh with the Python, it will just trigger main which is going to start training the model loading data set training the

model and that's it.

for this file. Let's create a new folder config and inside let's do init.py

and next to that data set uh_conig.py

and_config.py.

Let's do data set config. So we will import from data classes import data class from typing import optional collable union and import logging

and so we will just uh get this logger logging.logger

logging.logger and then uh do data class here and define data config.

So data class will automatically generate some boilerplate for classes that uh store data.

So you can later call uh some of these methods that are going to be useful for uh working with data.

So I'm quickly going to go through this data set path which is going to be our hugging face. We already uh did all of

hugging face. We already uh did all of this data set name split train uh tokenizer I think we're going to use

different tokenizer but this is default but it's going to be changed trust remote code false I think this is so like this is not so important none of

this uh let's just go quickly we have some defaults for sequence length stride so um when you're training large language

model your sequence length you train this data your tokens and then you you move onto the next so you move by the whole sequence length to get the next

data uh like next tokens so I had a bug where I trained on this sequence and then I move just by one token train on this sequence and then train on this

which was uh not which was a bug my loss was 0.001 001 which is not good. That's

how you know your loss is too low.

That's how you know you have some bug.

So when you train, you train on this sequence. You move by the entire

sequence. You move by the entire sequence of tokens. You train on the next sequence in the data. That's your

stride. It's equal to the sequence length.

Uh number of samples. Okay, I think this is not so important.

Uh let me just do this.

I want to go quickly through this. Uh

it's very easy. What this does is it just loads data set. We have these optional parameters uh streaming true. We already talked

about this save to disk load from disk.

If you want to save literally to your memory so maybe you don't want to like you don't want to process data set every time you train you run that training

which you which we will set this to true I believe. And then we have some checks.

I believe. And then we have some checks.

Let me see these checks.

Maybe not all of these are necessary.

I mean I think you can just copy this.

So data set path must not be empty.

Uh you need to like there are these like rules.

I think you can just copy paste this from my GitHub in this file if you want.

I don't want to spend too much time here. Let me see.

here. Let me see.

Stride must be positive. So, uh, I don't know if we even need to do so many of these checks. We don't even need to do

these checks. We don't even need to do so many of these checks.

You don't need to like write this if you don't want to. Maybe it's good.

Like if it's non if it's non- empty string, if it's not a white space. Let

me quickly go through this.

Okay, I think that's it.

So we just have this like config data class and that's it.

Now let's go toe_config.py.

So this is just config for our large language model. So we will again import

language model. So we will again import data class and from typing and define the data class. So dimension

of the model um this is small enough to run on free Google GPU. I'm pretty sure we'll check it later. Number of heads.

This is good. Number of heads. Number of

layers. So these are large language model layers. Hidden dimension should be

model layers. Hidden dimension should be four times this. Now is this four times this? I think it's more than four times.

this? I think it's more than four times.

But it's okay. Uh we can just check with this. We can check with different. It

this. We can check with different. It

doesn't have to be four times. Let's

keep it like that. Um here we are not using these in our large language model.

So maybe you don't need to actually write them. This is for deepseek and

write them. This is for deepseek and this one as well. So maybe you don't need these four. If you want you can leave them there.

As I said it's not good if I don't know what the what the variable does what the variable is. I think this is also for

variable is. I think this is also for deepseek because this was done by some other person. They added deepseal. They had

person. They added deepseal. They had

latent attention. So that's why I'm not sure what this is.

Batch size 24. I think you usually want to keep these numbers powers of two. But

sometimes you just like don't have enough GPU memory.

Maybe batch size as well. Okay. Maximum

steps thousand is maybe a low. But but

maybe if you are testing maybe just like 50 to 100 to like you do 10 steps to see if you have any bugs. Then you do maybe 200, 300 to 500 to test a bunch of

different experiments and then thousand 2,000 to test deeply the best of those experiments. You want to test them even

experiments. You want to test them even more with maybe thousand steps. Then

but you put 10 here just to check for bugs when you're running the large language model training for the first time.

Uh gradient accumulation that's simulating batch size. So if you cannot load too many batches into memory then you will just do like multiple times

without updating the weights without doing back propagation you will just update the gradient accumulate gradients multiple times as if you have larger

amount of samples larger amount of conversations mu1 learning rate this is from the experiments that I did momentum 0.9 you

remember that the default is 0.95 but through our experiments that we will do later right in the this research we will get that for our setup 0.9 works best

Adam learning rate this is all this is all from the experiments that we did max sequence length 512

uh number of documents maximum tokens that we want to load from the data set evaluate every 10 steps Maybe you can

increase this.

Wait, wait, wait, wait, wait. I'm not

sure what this is. Wait. Evaluate every

100 steps. So, what is this? I forgot.

I'm not sure why. Maybe this is We're not using this. I forgot about this.

Okay. Uh, weight decay regularizing.

Uh, these are just like regularizations.

Uh, use automatic precision. True. Let

the PyTorch speed up the training and then vocab size and logging milestones. is just for logging

and then MOA specific number of experts choosing top two experts load balancing weight which we talked about in the first part of the video and then when we initialize we just need

to check that uh head dimen so dimension of the key and query and value no sorry sorry head dimension is equal

to is divisible so you want to Make the token uh token embedding vector or the which is also the dimension of the model

divisible by number of heads and it needs to be equal to head dimension.

So now in the init we will just structure export it nicely. So import

model config and data config from what we just defined and then just put put it like this. So you can import all of this

like this. So you can import all of this from configs. You don't need to import

from configs. You don't need to import it from these files.

And that's it. Now let's go here. We can

uh commit this to GitHub.

Save and commit.

Let's make a new folder uh data.

And inside we want to make data set.py

loader.py. pi and streaming data set.py

data set.py as well as underscore_init_.py.

So let's go to data set. In the data set.py, Pi let's do some imports from

set.py, Pi let's do some imports from tor shoots data import data set and let's create this class uh text token data set and it's going to inherit from

data set token data set with configurable stride for creating training windows arguments are tokens list of token ids

and sequence length uh and then stride showing like so all of the tokens We want to put take this sequence and then this sequence and the

next sequence. So stride will usually be

next sequence. So stride will usually be equal to sequence length but it doesn't have to be. Uh here we will make it more adjustable maybe.

So as I said stride is by default going to be the sequence length. So it's

non-over overlapping window. So you take this set of tokens and then the next set the next. If stride is half of the

the next. If stride is half of the sequence length then it's 50% overlap.

So you take this then you take 50% overlap then 50% overlap and then if stride is one that's maximum overlap they're just moving one by one

tokens which is actually a bug that I had but depending on what you're doing you may actually even use this

and so initialize this class with token sequence length and stride that I just explained and set these two variables um Set stride. It should be sequence

length if it's not defined. So default

is sequence length.

Calculate number of samples based on stride.

So how many sequence lengths can we get based on the stride and the available tokens.

And so we have like this length which will just return number of samples. And

then we can also get item. If we pass in index, we will just get this sequence length based on the index. So index time

self stride.

And and you see when we have a sequence and we're training the large language model, we want to predict always the

next word. So the x will be this

next word. So the x will be this sequence. But in that X we will go token

sequence. But in that X we will go token by token and then try to predict the next word.

So you see X will be this um X will be like from the starting index to starting index plus sequence length but this will

not include the last token. So that's

our data that we have and then what we're trying to predict is y and y will be from the second token from the second

so starting index plus one until the end. So imagine that our data that we

end. So imagine that our data that we have is like this sequence and except the last and uh the y that we are trying to predict is the same sequence but just

moved by one token.

So what that means is if we have first just first token in X we are trying to predict first token in Y then we will have first and second

token in X we try to predict the second token in Y first second third in X trying to predict third in Y. So you see it's like we're always trying to predict

the next token and that next token will be like uh same index as the last token in X that we have. So we will look at all these and try to predict the next.

So that's actually the entire data set class that we need text token data set that's going to split our stuff into sequence sequences.

Now let's go to loader loader.py the

data set. Uh we're just going to do some imports like annotations and um I just going to do these imports quickly. You can just copy them from the

quickly. You can just copy them from the We'll see later how we use this. Getting

the logger setting up setup tokenizer.

So we will load the tokenizer and we will pass in the tokenizer name.

So this will just load tokenizer from the GitHub uh not not GitHub hugging face and we will usually use that small LM tokenizer.

And now we need to set the pad tokenizer to be equal to end of sequence tokenizer. That's something people do

tokenizer. That's something people do and then return tokenizer uh load raw data set.

So this will just load so load from disk.

Uh if we because we saved it, we we processed it or we downloaded it and we saved it. Now if it's streaming, we're

saved it. Now if it's streaming, we're going to load it a bit differently.

Um so I forgot to say if we want to load from disk then we load from disk and then we have we can also load data

set from the hugging phase as well.

So we talked about these if it's like we we talked about this stuff. So I can go quickly here. If we give some wrong name

quickly here. If we give some wrong name for the data set, we may throw an error here.

Uh and then we will just return that data set. Apply sampling and filters.

data set. Apply sampling and filters.

Uh the definition is a bit longer here.

So is streaming depending on if if we have this iterable data set. That's how

we know if it's streaming because this is just like iterator that we can materialize. That's how it's use. That's

materialize. That's how it's use. That's

how it's streaming.

Uh if it contain if config number of samples exists and if it's streaming then we will take first number of samples. So

just with this data set take like five samples.

So this will materialize that iterator and uh otherwise if it's not streaming we will just take the minimum number between either number of samples or the

length of entire data set.

So um if we actually have less samples in the data set than we want number of samples then this will throw an a warning that there is not enough samples

in the data set and we will just pick materialize or pick this number of samples.

So here we have some pre-processing functions.

Uh we can just apply whatever pre-processing functions we have. We'll

see if we have any defined in the config.

We can also apply filtering functions if we have any um applied or any defined.

Sorry guys, I made a mistake. So I made this file uh this file is unnecessarily too complex. I worked on this and some

too complex. I worked on this and some other people worked on this. So uh I'm like I think you could just you should just copy paste because there is no point in explaining all of this. This is

not so useful for your research. So just

copy paste this file and next time I'm going to make it a lot simpler. This was

my bad.

So this loader.py.

So in in it so in the data init py let me just define uh import these classes that we defined and then just export them like this.

Now let's create a new folder utils and inside uh let's create a new file helpers.py

and besides that let's create logger.py.

Now I want you to just copy paste my helpers or if you can code it if you want but I'm just going to copy paste this uh from the GitHub. So in utils helpers.

Oh no no no that that goes into helpers.

So this is just set seed and account parameters and then let's go back to uh logger and let me copy this one as well. This

is not so important for your research.

You can just copy. You don't need to understand. I mean uh you can maybe

understand. I mean uh you can maybe change it a little bit with AI and stuff but it's not so important for you to understand. And then let's create a new

understand. And then let's create a new folder training.

And inside let's create um trainer.py

and evaluation.py.

Maybe this could be evaluator maybe but I named it evaluation evaluation. So

let's name it like that as well and create init pie.

Guys let's be fast here. there is a lot of boilerplate code. So I don't even know if I if I want to encourage you to code this uh step from scratch like by hand or to copy paste. I'm not sure. So

you can see so there is a bunch of imports. You can maybe just copy these

imports. You can maybe just copy these like so many imports from everything we've defined. And this trainer is for

we've defined. And this trainer is for training the LLM. So early stopping is the class. So if it's training training

the class. So if it's training training training and now suddenly evaluation valuation loss starts to increase we will just stop the training uh because it's a sign that it's

starting to overfeit on data or something.

So uh we will just have a bunch of this stuff. I think this is not so important

stuff. I think this is not so important for your research. This is just u the class that you would usually copy paste.

It's not I don't know if I don't know if I should if I want to encourage you to spend time understanding this.

So what it's doing is it's uh checking the best loss and if three times the loss is worse. So there is no new best

loss three times in a row for example or whatever the patience is. Then it means like three times like okay there is no loss is not improving let me just uh

stop the training which we will use later. So you have this counter

later. So you have this counter and check if it's larger than patients and then set up muon optimizer. You know

when we coded muon we want to set 2D matrices to muon. So and we don't want like normalization and token. We don't

want them in muon. So just 2D matrices.

So weights. So those will be muon parameters and the other ones will be Adam parameters.

So you can print the amount of Adam and Muon and now we are initializing muon optimizer which we coded and Adam as

well torch optim AdamW you can check my video here Adam optimizer explains step by step so uh you will learn like the math and theory

behind Adam here link below the video let's continue so we will just return these two optimizers in array like this and then train the

model. All of the parameters that we

model. All of the parameters that we talked about scheduling for learning rate. We want learning rate to start low

rate. We want learning rate to start low and then quickly rise because in the beginning updates will be a bit random because it doesn't know anything. But

then it will quickly rise because large language model will quickly learn. In

the beginning it learns very quickly. So

we want to achieve maximum learning rate early and then slowly reduce the learning rate for the rest of the training uh because

uh because as it trains more we don't want to keep updating more and more the weights. We don't want to keep updating

weights. We don't want to keep updating more and more weights because it already knows. So we we want to be more and more

knows. So we we want to be more and more careful and conservative of how we change the weights because towards the rest of training it already knows. as

well.

Uh I talked about all of this good stuff here is describing you can read if you want all of the inputs. The output is model final matrix matrix history.

Um I'm going to go quickly through this.

I mean this is we've kind of talked about this.

So this is going to train the model.

Uh we can like write this bar in the console if we want. We don't want early stopping.

So, by the way, I don't know if I should still explain everything step by step.

Uh, maybe some people will find it boring. Maybe I should, maybe I

boring. Maybe I should, maybe I shouldn't. Tell me below. I don't know

shouldn't. Tell me below. I don't know who's even watching at this point. Okay.

Um so as long as we are still training, as long as we didn't reach maximum steps, we're going to be

adding data.

And we have this attention mask.

Let me see. Okay, so we are just defining stuff here.

And then we're pushing uh these to GPU like the tensors which are batches which are which is our actual tokens and data.

And then uh forward pass with automatic precision.

We have cross entropy loss. We have

shifting labels.

So you know that well this is the X and Y I was explaining.

We want to make sure first of all that uh these logits are continuous in memory next to each other. So this is logits are like

like not scores for every token in the vocabulary which token should be next the what the LLM thinks.

Okay.

We also need to adjust total loss with the for the gradient accumulation steps and if we have this auxiliary loss then

we will add that to the total loss. So

this is the mixture of experts loss and then we can do backward because we also we did backward here as well. We did

backward here depending on if we have this auxiliary loss use amper. No, it's it's depending on if

use amper. No, it's it's depending on if we are using automatic precision. So

here we also have auxiliary loss as well.

And make sure to scale back the loss from automatic precision when you're doing back propagation.

Optimizer step. Let's see.

Okay. Um,

we will just use the muon and Adam optimizer. So for each of the

optimizer. So for each of the optimizers, we will just unscale and then do the step zero gradient.

So this is uh contained in my course AI research from scratch about zeroing gradients or neural network from scratch. I also have that video.

scratch. I also have that video.

and then logging what happened, loss and uh accuracy and stuff. And then

we can evaluate as well.

and then printing whatever we evaluated and then updating next step and uh updating the bar in the console every

20th step and final evaluation of the model after it's it has trained.

Uh remember that we are doing cosine decay on the learning rate.

So we need to get the current learning rate total time. So we are just printing a

total time. So we are just printing a bunch of stuff here. Saving into output direction the metrics directory output directory and plot

training metrics. So this is just about

training metrics. So this is just about plotting and saving.

So this is let me go quickly. This is

just about drawing stuff with the I think you can copy paste this because it's not like you don't need to do research with this. This is just for plotting and then train model.

So we will initialize the model set the seed. Uh now call everything we

seed. Uh now call everything we initialized above.

Print setup optimizers schedulers. We

have warm-up for the learning rate which is going to be some percentage like 10% or 5% of the whole entire like all of the steps

and we're going to write the logic here.

So this is just the learning rate scheduling the logic.

Okay. And then train model we call that.

And that's it.

That's it for the trainer. Again, guys,

tell tell me in the descript in the comments below if I need to or if I should explain everything. I can explain I don't know if it's useful or if people will watch or if people care so much

about if I explain everything step by step. Uh I'm thinking like I just need

step. Uh I'm thinking like I just need to explain like the research parts and not this like boilerplate.

And then in evaluation.py, we will just evaluate model. I have a bunch of

evaluate model. I have a bunch of imports.

uh we will initialize these things here before we evaluate and we don't need gradients

we will just uh get the data let's go let's go uh auto cast to cuda the the floats this

will be automatically managed and then um doing the logits from the model because model will generate the logits And oh wait that's it that's it that's

it that's it in init.py Pi we will just export all of

in init.py Pi we will just export all of this. So let me save everything. Let me

this. So let me save everything. Let me

commit everything to uh GitHub like this.

And let's finally create experiments folder.

Okay. And inside we can create um muon_us_adam folder.

And you can also maybe make it like experiment one muon versus Adam

like that.

Let's create a new file in that new folder.

Uh run Adam learning rate sweep.

The main idea here is to uh find the best learning rate for Adam optimizer for our neural network.

So let me just go through all of this.

So I'm going to just say where the uh different paths for the script and project root are for this folder that

where we will import stuff like from uh run multiple run experiments.

I'm going to import run multiple experiments but we will define this. So

I'm going to say I'm going to print all of the learning rays that we will do experiments with.

And these are the names of the experiments.

And I'll just go quickly because this is just like running the experiments and then um just printing out the result, the

summary and then comparing them. And we

can actually get the winner here and printing the winner and then running it all. That's it. So

these names will look up. So it will use these names to look up the configs. We

define configs for the experiment somewhere else. So inside of this

somewhere else. So inside of this experiment one muon folder muon versus Adam. Let's make a new file and it

Adam. Let's make a new file and it should be experiment configs/ experimentconfigs.py.

experimentconfigs.py.

So this will create a new folder experiment configs as well. And since

this file is absolutely huge, I recommend you copy it. And here we have all of the configs.

And so look how many different experiments we have. So many. So I just I'm just going to copy paste it here.

So we have all of the names here. For

example, muon step decay. This is the name, description, optimizer type. We're

going to use this config to u select.

And then maximum steps, muon learning rate, Adam learning rate. So we have all of these configs like that. So you can use actually AI to help you figure this out. You need to think like what you

out. You need to think like what you want to experiment, etc. Now let's make a new file in the experiment one muon versus Adam folder. And it's going to be

again it's going to be experiment training experiment trainer.py. And at

this point um all of this code should be familiar.

So I'm again going to just copy all of the code because it's this is uh we explained all of this throughout the course. It's all about like uh learning

course. It's all about like uh learning rate scheduling setting up parameters m optimizer setting up Adam optimizer plotting

stuff. So this is just uh repeated what

stuff. So this is just uh repeated what we already explained. So tell me below if you if you don't like me just copy pasting it. If you think I should like

pasting it. If you think I should like go through this then tell me in the comments.

And so in both of these uh new folders experiment configs experiment training.

I'm going to add this in it but it's going to be empty maybe for later use if we need it. You'll see that after you do all of these experiments you're going to get a bunch of these files which are

just reports and plots of everything.

But you can you don't need to necessarily uh generate these folders I should say.

Next to run Adam learning rate sweep I'm going to create a new file run Adam optimization suit uh suite how to say and uh this is still going to load from

the configs and run the configs that we have. So I'm going to copy paste again

have. So I'm going to copy paste again because it's basically the same. It just

gonna load stuff from the uh configs from this run multiple experiments.

These are all of the names of the experiments of the configs it's going to run and then plot and that's it. Then we

will make this uh run experiments.py.

So copy paste this as well because we've explained all this stuff. So it will just it's able to run to prepare data run single experiment run multiple

experiments compare experiments and all this good stuff and then main uh we have bunch of arguments defined here

and we also have a new file run optimal muon suite so here we are just running these experiments

for the muon optimizer and that's Everything else is same. Now I will commit everything. Save all and commit

commit everything. Save all and commit to GitHub. Then I want to go to

to GitHub. Then I want to go to extensions in my uh VS code type of and I want to make sure like so I have

cursor anti-gravity VS code whatever if you have the some VS code type of editor you can install this extension Google Collab.

But this does make it kind of difficult to run. So instead I'm going to create a

to run. So instead I'm going to create a new uh Google Collab notebook and I'm going to change runtime type to

be GPU. Save and connect. Then I'm going

be GPU. Save and connect. Then I'm going to go and copy like clone this repository that I I just made.

So I'm just going to uh get clone like that.

And then I need to CD like with this CD into that into the folder name but with this percentage sign. And then pip

install-r requirements.txt.

requirements.txt.

Let me install requirements.

First we should be able to just run python train.

Let's see if that runs. But uh this is the base of your of your like large language model. So now you want to do

language model. So now you want to do some research. Now you want to maybe fix

some research. Now you want to maybe fix some bugs. I do recommend using uh

some bugs. I do recommend using uh coding ID like anti-gravity by Google is actually free right now. So you can fix any bugs if you find or you can talk to it to help you. Now I'm going to debug

this. See if there are any bugs. But

this. See if there are any bugs. But

because we have the main uh scaffolding, the main structure, it's going to be easy for you to use AI code AI editor to fix bugs as well if there are any. And

then uh this is your base. You can think about research you want to do and do the research from here on. Okay. So it looks like it's actually training. So there is

no bugs and uh this may be like too many steps.

This may take a lot of long time but it's actually training so there is no bug so everything works here in my repository blueberry lm if

you go uh to this experiment 9 muon versus Adam you can get some instructions on how to run these experiments how to think about them.

So like for example I can uh run this command to just run this experiment but I first need to cd into uh these uh

proper folders. So if you look at this

proper folders. So if you look at this path um previously it was just noptims and then trying to access uh this immediately this file but that doesn't

exist. So I need to cd into experiments

exist. So I need to cd into experiments and then uh this folder or I can I think I can just add it here as well like this. So it knows where to look uh for

this. So it knows where to look uh for this and it looks like the experiments are running

and it's training. So there is no bugs.

So step 10, step 20, there is no bugs.

For writing the paper itself you will go to overleaf.com and create a new paper. So on this side you have latex code

or latex text and then it's rendered here. Now you need to know the format

here. Now you need to know the format but maybe I recommend you first write your paper in markdown file and then you can um get with the help of AI convert

that to latex.

But uh be careful when writing uh paper.

No AI slop, no AI generated uh stuff like be very careful that every sentence has its place that every sentence is meaningful. Actually archive stopped

meaningful. Actually archive stopped accepting reviews in computer science I believe because it was like all AI generated by people. So you need to uh

think about your data, analyze it, uh see what's happening there, understand what's happening there. You can upload images to a folder like this and then uh

link the images using these like figures and you specify the path. Uh I think uh you can ask CH GP how to link the image

and I think this is not so difficult. So

uh introduction you want to just be short and concise.

You don't want your paper to be too long. It's better to explain with as

long. It's better to explain with as little pages and text as possible.

And we can you can do some uh background and related work if you want and then explain your methodology.

So you can just look at what I wrote here and then edit it change it based on your experiments and you can also show the results. Now

here I should have colored uh different colors. So now it's same color so you

colors. So now it's same color so you can't see. So I can uh fix this and I

can't see. So I can uh fix this and I explained here also like the results and the ablations and stuff.

I think it's pretty straightforward to write the research paper. Although you

will spend like uh a lot of time thinking about related work but don't get overwhelmed. Don't read too many

get overwhelmed. Don't read too many papers. Don't uh waste too much time

papers. Don't uh waste too much time trying to understand like different papers. You need to balance it. Uh all

papers. You need to balance it. Uh all

I'm trying to say is this related work could be overwhelming. So uh make sure you don't like overwhelm yourself. But I

think that's it. Writing the paper is quite simple. You've done that in school

quite simple. You've done that in school with presentations with homeworks. You

just analyze, you keep it concise, you explain, you keep it clear, professional.

Uh don't say I did something. You can

say the experiment was done, not I did the experiment. Or you can also say we

the experiment. Or you can also say we did the experiment. You can say that if you want, but I think there is not much else I can say about writing the paper

itself. Just analyzing the the results.

itself. Just analyzing the the results.

The most difficult part is like coming up with the idea of the research and executing on it like knowing which questions you want to

answer. So I think that's it uh for this

answer. So I think that's it uh for this time. You can find this latex code in

time. You can find this latex code in the GitHub repository below. So

paper.ext text and paper PDF as well.

Join my school to become AI researcher.

We have additional courses here that are not available uh on YouTube. So from

math, PyTorch, neural networks, transformers, large language models from scratch, etc. And the community, you can ask questions and get support. Link

below 7-day free trial and then uh just $9 per month. You can check this video that explains more what's contained in here.

Loading...

Loading video analysis...