TLDW logo

explaining backpropagation until i go bananas

By Elliotcodes

Summary

## Key takeaways - **Backprop powers LLMs like GPT-4**: Models like ChatGPT, GPT-4, Claude 3, and Claude Sonnet 3.5 learn under the hood using backpropagation: feed text through the network, compute how close the output is to the expected result, backpropagate gradients, and update weights via gradient descent, improving ~0.1% per iteration. [00:00], [00:33] - **Loss amplifies early gradients**: A big loss from high error like 40% confidence on a hot dog image means amplified gradients, causing fast initial adjustments in the network; over time, loss drops with randomness then plateaus and slowly decreases. [03:37], [05:07] - **Neurons: dot product plus activation**: A neuron computes X1*W1 + X2*W2 + X3*W3 + X4*W4 for scalar output Y, then passes through nonlinearity like tanh which squashes between -1 and 1. [06:50], [08:17] - **Tensors via matrix-vector multiply**: Abstract to tensors with matrix W (3x2) times vector X (1x3): inner dimensions match (3=3), output shape is outer dimensions (1x2) as Y1 Y2 horizontal. [09:16], [11:26] - **Backprop starts at loss derivative**: Start backward from MSE loss derivative like (predicted - expected), e.g., 1.0 vs 0.4 gives 0.6; chain rule flows this gradient back, multiplying for ops like multiply, passing straight for add. [13:29], [15:24] - **Weight gradients: X transpose dot Y_grad**: Compute weight gradients by transposing input X and dot with output Y gradients, matching shapes like 3x1 dot 1x2 gives 3x2; subtract learning rate times these from weights in gradient descent. [19:20], [22:27]

Topics Covered

  • Tensors Abstract Scalar Backprop
  • Backprop Starts at Loss Derivatives
  • Transpose Computes Weight Gradients
  • Batches Maximize GPU Efficiency

Full Transcript

hi everyone Elliot here um I created this video to help break down some of the intuition behind back propagation so models like chat GPT GPT 4 the recent

Claude 3 uh the the Claude Sonet 3.5 model all of these um this is how they learn under the hood they use back propagation they feed some some text

through the network and then they um in in the output they say how close were we to the expected result and they back propagate through it and they update all of these weights or or their or their

gradients rather uh and then they do gradient descent and then the network gets you know a little bit better like maybe 0.1% or something better uh each iteration um and this is this is

essentially what I'm going to be breaking down so we're going to use this whiteboard here maybe some examples on the screen uh but by the end of this you should understand how this thing works under the hood at the level of scalers

and tensors as a quick note uh I made this mainly because I couldn't quite conceptualize backprop completely with carpo's microgr lectures or three blue

on Browns lecture on YouTube uh not because they weren't well explained but more so because it didn't abstract things up to the level of tensors so it's very easy to cover something like

the calculus behind it um and some of the simple operations like a neuron um but when you abstract it up to tensors that are like 2 three four dimensional

it starts to get really hard to understand what how things are flowing under the hood um you know when you're designing neural network architectures you have to know how these how these gradients are flowing right you have to

understand sort of under the hood how is how is this thing being optimized so when you jump up in terms of dimensionality and size it helps when you can uh you know understand at the

level of tensors so that's also what this attempts to do so I have a Discord server where you can come in and ask questions chat with the community link for that will be in the

description let's pop over to the Whiteboard here uh just to illustrate from the very beginning you have a neural network composed of a bunch of knobs or neurons or whatever whatever

you want to think of if you will um I'm going to draw just a simple diagram to illustrate this so you have maybe an image right an image

X it's a terrible X um and you pass this through a Network um and then you get an

output and then you find a loss right so you do this you do this forward pass through the network you get your loss and then you backward and you you update all the weights inside of the network

all these little parameters these little knobs and you fix them and you you you you optimize them and then the model does better so let's actually go and a little bit deeper and see like what is

this so think about think about X some maybe some image right maybe it's maybe we're trying to predict if an image is a

picture of a hot dog or it's not a hot dog so if it's a one hot dog zero not a hot dog and what we do is we pass the image through pass the image through the

network we get an output and to say the output is like oh we we pass it an image uh and it is a hot dog and it says oh I

think it's a 40% chance or 0.4 that it's a hot dog right that's a lot of error 40% is is really far off right we don't

want that so we find we we use this loss function to essentially calculate the difference or the error there and this loss function will grow as that error

increases so if it's like 0.1 in one then that loss function is going to be massive where whereas if it's like 0.99 and one it's be like ah you should update a little bit but you're you're

you're kind of okay um that's what the loss is going to do and then we use this loss let's say the loss is um let's say the loss is

like we just say the loss is Big this means that we're going to have to you know Traverse this network and all of the uh

all the gradients are going to be Amplified these little these these things called gradients or you can think of them as as knobs in the network they're going to actually be bigger and amped up because we have such a big Arrow right so that means there's going

to be a lot of adjustment at first the ear Network's going to do a lot of fast fast moving and fast adjustments then it's going to slowly Con well it's going to quickly converge through random movement it's going to figure out which

ones are the best one to move and how much um and you're going to see this loss you know this is time and then this is loss it's going to drop right so it's going to be little maybe little

Randomness figuring stuff out and then it's going to and then it's going to kind of plateau and it's going to slowly decrease

um that is this thing right here this is called gradient descent so you're essentially just optimizing these little gradient values in the network and making it better so now that we

understand on on a high level like what the network is doing um let's go into like what it looks like on a on a more technical level so

up here you'll see I have a uh a neuron so you have some uh we're not going to use an image for this instance because

uh images are a little hard to work with but uh say we just have four X's right and we're we're we're entering these x's

in like Arrow Arrow arrow and arrow me this is like X1 all the way to X4

um and then we have a little Nur so they're going to All Connect like this you're going to have all all four

X's so you might have another neuron down here and another down here we're just going to worry about this one for now um all that's going to happen is

you're going to have your your X1 multiply the um W one so it's a we it's going to be structured like like this

W1 W2 W3 and W4 right um and it's essentially just going

to dot product with the uh the input so X1 all the way to X4 so what I mean by dotproduct is each

of these are just going to multiply so W1 * X1 W2 * X2 W3 * X3 and Etc and they're all just that those are going to

add together so it ends up looking like it ends up looking like this you have you know all these weights they multiply um they multiply with an x value and then they add together and that gives

you your output so it's this single scalar value that's the output of this neuron right um and then typically after you get that output you might pass it through something like a like an

activation function an activation just essentially is like a nonlinear change in the output so um for example you could use like s goes

like this or you could use tan H which is more common t h is like a squashing function like

this so it'll Plateau out at one and to a negative one it just squashes it so it's like a nice a nice one between one and negative one uh then you have like

sigmoid which is going to squash between a one and one and zero doesn't look like an s um between one and uh zero so you have a

bunch of these activation functions right and so what'll happen is you'll you'll pass uh you you'll get an output of neuron these will these will dot product together you'll get this output

and you'll pass it through say uh 10 H which is uh this guy here

so that's that's like on a low level how these IND idual okay so X1 * W1 plus X2

* W2 plus X3 * W3 plus X4 * W4 is equal to Y that's that's a neural pretty simple right um it's just add and multiplies

so how do you how do you abstract this up to the level of tensors right you might you might ask and I can actually answer that

fairly simply so let's just take I already Drew this out preemptively so it makes it this kind of easy to go

faster take this as an example you have inputs X1 X2 X3 1 2 and three um and then you have a weight Matrix right but

this is a this is a vector right it's like onedimensional it has that's like a line of numbers then this is 2D so it's a matrix so it's a matrix Vector multiplication so how this goes is

you're essentially just doing the same Dot product thing but you're adding more Dimensions to your to your uh number structure I guess you could say um so what you're going to do

is you have this first uh you have this first neuron it has to interact with all the inputs so what you do is you take the is you multiply these together and

how you do that is you take you know this this this Vector first then this Matrix and you take that this um W1 uh

w11 W12 W13 that's like the first the first n on and you you rotate it this is how you multiply um vectors and matrices is you take this and you rotate it so

you flip it this way and then you do your iterative do and that will give you your y output so notice how we do uh this one this one

times this one this one times this and then this and this and this is all written out here so that'll give you your y1 and then these will go and and

interact with these this will flip over and interact with the X values byy two from that um and the way we know this works is just because of shapes so this is

something you learn in your linear algebra class A lot of this is like assumed knowledge of course um but all you have to make sure and this is extremely

simple you look at the inner numbers this is a this is a one one by horizontal and then this is a a three by two

horizont right and you look at your inner numbers so you have a three and three those are the same check box and you look at these other two these outer ones you have one and two and that means

you're going to get an output shape of 1 by two so as long as these match up these um these inner ones your output shape is going to be the outut so one

and two and we'll notice that we get prise that so it's not it's not 2x one so like y1 Y2 it's it's 1 by two so it's

horizontal so y1 Y2 horizontally and uh that's that's pretty much the uh multi-layer perceptron so

this this this thing we were talking about before this this neuron you essentially take this up to the level of uh like many many neurons meaning like you know

maybe billions of you have really big networks um and you just you just uh you have your you have your neurons and a

nonlinearity like CH tan and then another bunch of neurons and then T neurons tan and you just go you just make a really deep Network that way

so it it'll essentially look like this you'll have your inputs it may be like this and you can think of each of these as just these little these uh these like

way columns right that that's that's essentially all they are um so you might ask how do you how do you back propagate you understand how to go you understand

how to go forward through this network so how do you actually back propagate through it I drew some stuff to help us illustrate what's going on and to make my job a little easier when when writing things out but um when you're back

propagating through the network you essentially have to start with the last node so you know you have an X you have net and then you

have and you go to the L right so this is where you start from you start from the loss um let's say our loss function is like MSE loss right

this that's what this is right here so you take you know your you're uh predicted and you're expected um and then you you square you square that

difference and then you add those together and you you you divide by the number of results that there are so it's essentially like a

you you you do a you do a subtract you square uh and then you average right that's what msse loss is and so the

derivative of that um let's say if I have like two elements it's going to be you know 1/ two and then you sum all these together that are squared and then the the power from Calculus if you if

you remember this this two will actually go here and you'll end up with just this you'll end up with just this part so if

you know if if uh my predicted was like um you know 1.0 and my expected was like 0.4 then it's going to be like 0.6 is

the derivative of the loss and you would have that as your first node 0.6 and then you would distribute that

accordingly so if you got um if you had like two neurons say um I don't know like like two numbers and say they and

and say they added to make this they did like a I don't know like a plus or something like an add operation um you just jump back to uh differentiation in

calculus right so for example if I'm doing multiplying like y = 2x the derivative of that is going to be two uh the rate of change is two right so it's

like the slope is two so I just write the slope as like a function and that's constant right so it changes it changes there but if you're just doing um if you're just if you're just doing like

add right like there's no there's nothing to really differentiate there there's no rate of change because it's constant so this 0.6 will just flow to both of these there's no multiplicative

action happening there it'll just flow uh and then you can go right to back propagation and it's really important to cover this part because um you know you have to start with that you have to start with the loss you say what are we

starting what is like the first node of this Vector p and then you go and you do your matrix multiplication math right or your or vector Matrix Matrix Vector multiplication that's what this is and

pytorch is really really fast doing this like blazingly Fast um there's like billions of dollars that go into research as to how uh matrix multiplication can be optimized under

the hood on different gpus and they the major sees that they are multiplying they don't look like this they're massive like they have like millions of numbers in them so it's nothing that you

can store in your brain but a computer can store it on chip uh and it can it can just compute these really fast right um and that that's like the the GPT

architecture in chat GPT um and like a lot of the a lot of the operations in it like attention and the the multilayer percep like a lot of it is matrix

multiplication so they purposely optimize for that um and that's why so much money has gone into this algorithm and making it faster on Hardware uh is because it literally makes the coolest

tools in the world uh faster right so that's a little bit of background on the whole matrix multiplication thing um you might hear the word mat moo that's just short for matrix multiplication that's

what that's what nerds that's essentially nerd speak for just like making this algorithm faster you you you'll see it everywhere the M all

search it up um but going back to back so you get these 0.6 values and you can think of these as the gradients for your output or your maybe your second last

notes right so you have like a you have a data attribute which is like as you're going forward through it you you eventually get this these data outputs and you feed them into the loss function

um but when you're going backwards through it the point is to compute all the derivatives the local uh the local gradients of all of these different nodes right and you Traverse this

backward because this one is going to rely on the previous layer gradients in order to make it work right um and this is just a thing from chain wheel and

Calculus if you recollect that um I'll show you a really good video uh after this part for recollecting a lot of this stuff and and understanding it intuitively this video is more so just

to you know say that this Matrix multi uh matrix multiplication is fast and that this is how you can do it um without you know without getting too

confused um and then you you have this data attribute so it's going forward and then you have the you have the grad attribute going backward and that's what we're Computing here so you have these

you have these 0.6 values and see these are gradients right so what you can then do is you can say okay I need to figure out how much I need to up those weights

I need to figure out how much I need to update the last layers you know whatever weights whatever you want to call them

um and in order to do this you're going to need both these these output gradients as well as the inut so the x.

data attribute right the forward forward part I mean the inputs you don't really need uh gradients for those by the time you get to the end but you you do it you do it because you need to compute all of

these um you need to you need to Traverse backwards that's why you compute these these input

gradients um so for some context it's just the same thing as this except you're flipping some things around

so what you're going to do is literally just I say here dot product the transpose of of X and Y so you have your

y gradient say it's I don't know I'll just use these shapes for instance so if we jump back to here we have this y1 and Y2 computed from this which we covered

earlier um and you have these this this horizontal thing of X right these horizontal X numbers um and so what you do is is you just transpose these and

what that means is if you had like a like say a 2 by uh if you had a 2x3 Matrix the

transpose of that the dot T would be a 3X two so if I have like a a thing that I populate with zeros um let's say it's

this is going to be like a 3X two three three High two long

0 0 the transpose of this transpose is is going to be 0 0 0 0 0 so you're

just you're you're just flipping these so instead of instead of X1 X2 X3 you have X1 X2 X3 and it keeps that it keeps that order those IND the index

indexed order um and so you essentially just you just multiply these so you have a um you have a 3X one

and you have a 1 by two so you end up getting as an output um you know these match up and these are going to be your

outputs so you get a 3X two output um and remember going back to this our uh weight Matrix populated with all

these all these weight.

data this is also shap 3x two so what you can do is you can say okay I have these gradients now I have I have pretty much how these um how these weights are

affecting the final output the amount of eror that they're that they're giving so you can take this

you you take the um you take the do data

attributes and you you m you you essentially subtract or or change them or even add depends depends how you look

at it you could just do like subtract 0.1 U of the of the grad attribute that is how much error it's doing if you just subtract like all the

error it's going to it's going to change in a massive amount so if you only adjust it a little bit and you do it over a lot of iterations meaning like hundreds or thousands or hundreds of thousands or Millions um you'll just

you'll just end up in a really optimal spot right so these weights and I update them I update them based on each um I I

update them based on their error and then they just improve a little bit this this is literally what gradient descent is so when I when I was talking about earlier and you had like your loss and

then your time over over it over uh training steps this would drop right and then it would Plateau out so this gradient Descent Part is literally just

that you have like you your data let's say it's um a b

c d e f and you subtract by 0.1 say like maybe we'll like these are

different numbers say two two two two and then maybe this is like you're getting Mass these are all like four four four

four so 0.1 * all of these is going to give you 0.4 for each for each of these numbers and then you're just going to element wise um sub so it's going to all

these are just going to give you zero it's going to give you 1.6 each for each of these values uh and your weight Matrix is going to be updated because that

so literally all back all the all the forward paths is is you just you have your input X you do uh you do

essentially y = x um x * W and then you can optionally add a bias in there too bias is like

just a common term you use in this in this like essentially group it's linear layer if you will it's just an extra thing that you add on and it it can it can provide benefits um actually bias will make language

models more expressive because it'll lean more a certain way right so bias is a cool term but it's besides the point you do this uh matrix multiplication or

Matrix Matrix Vector multiplication uh to go forward and then when you're doing backward you just transpose the X you you do that you do

the flipping on it you go um from one two 3 to one two 3 uh and then you you you multiply that uh you you do a matrix

multiplication with that the transpose X and the Y gradients and that will give you your um your your weight

gradients and then you do this part gradient descent um so that's that's pretty much how back propagation works on hood on the level of tensors of course you can abstract it to the level

of batches so you you you can right now I'm just working with two Dimensions so like 3x two right but what you can do is you can make these instead of just a single like a like a window pane you can

make it 3D so it's a cube and each of those panes are going to be a batch and so well you can do is on on on

these gpus we have these Nvidia gpus you can run models really really fast on is you can uh you can you actually have so many cores that you can you actually can

run a model more efficiently if you use batches if I just pass in these these small window panes um it's only going to be able to use a certain amount of the cores right whereas if I do it on the

level of batches it can use all the cores and it can capture more information about the data that's being passed through it if there's like more error or there's or there's more to the

story I guess if there's more more to the story across more samples of data um you can actually learn more and you can you can optim better at that level um

it's also just like the parallel architecture of gpus that allows you to just have massive throughput like that so that that's a Side Story I love

ranting about that stuff but uh I'll just showcase a little bit of uh some other videos that you can watch some other videos that I found extremely useful but didn't quite go to the level

of uh matrices tensors and and and batch processing and all that stuff so here it is so here's um Andre Kathy's lecture he he did some work at Tesla and open AI he's

a really smart guy he did he has a whole lecture series on this stuff I highly suggest you watch his channel uh he he does really good content one of the best I've seen um and a lot of what I covered

today uh on the Whiteboard here he actually covers in code uh and he does everything from scratch and does that whole you know backwards traversal thing

but without doing Matrix multiplications he does it on a on a more scaler level so that's that's was kind of the point of this is if you missed the whole point there and how to abstract it to matri and tensors it's like this is this is

the whole point of this video um but he does an extremely good job he does 2 hours and 30 minutes of it um so

neurons um bugs T right so that's that's that's pretty much it and then I know three blue and brown has

some stuff as well three blue one brown so what is back cation really doing he has some good videos on this um so yeah hopefully you found that

useful so before I leave there's actually two other things I wanted to cover so one I actually completely missed during this whole part was the activation functions and differentiating those so this is an extremely simple

part um I left this out probably because I thought it was too easy and trivial but pretty much all this is is saying like you have an activation function

like t h like this uh and you find the derivative of that function and you use that in back propagation so if I have a

matrix like a like say a I don't know 2 two two two two two and this is like my output and I that or maybe maybe

even like we'll just say like why so y1 and Y2 right y1 and Y2 and maybe we'll say these are

like I won't even give them the point is you find the derivative of 10 H um and you can just you could just pass these through that um you can you can just

pass these through the derivative of 10 H and you'll just get the next node from that right so that's that's a little something I thought I should add because uh these activations or nonlinearities

cuz it's it's like a curve these are absolutely essential in back promp and then the last one um if I just if I just zoom out here again

to uh to here this is something worth looking into high torch

autograd so this is a little bit more complicated to read but based on the background I've given you the Andre gathy lectures as well as three blue and

brown and the content I've covered here today um this shouldn't be too hard to read so um you know gradient descent I

talked about this differentiation and autograd uh just recollection of calculus um and then the computational graph so

you have all these different nodes like power backward or mole backward or sub backward all all these um and it pretty much just goes into the details behind that you can you know there's

introduction to autograd fundamentals of autograd um so they have a they have a a YouTube video on that and there's just some more more stuff they talk about this one is especially good um I'm going

to put this one in the description actually because of how good it is um they they go in quite good depth into how autograd Works under the hood in uh

a framework like pytorch you might be familiar with pytorch it's a it's essentially how how um companies like open AI do really fast AI research um they they write their neural networks

they write their data loaders and whatever else and all their hyperparameters and optimize those and they just experiment rapidly uh pytorch runs extremely fast on gpus it has a

bunch of tools um you know I don't even know all the tools pytorch has yet because it's so expansive um but I yeah highly recommend looking through this they go through steps in code and and

they they give you the outputs and it's it's very intuitive to look through this so um that's it

Loading...

Loading video analysis...