explaining backpropagation until i go bananas
By Elliotcodes
Summary
## Key takeaways - **Backprop powers LLMs like GPT-4**: Models like ChatGPT, GPT-4, Claude 3, and Claude Sonnet 3.5 learn under the hood using backpropagation: feed text through the network, compute how close the output is to the expected result, backpropagate gradients, and update weights via gradient descent, improving ~0.1% per iteration. [00:00], [00:33] - **Loss amplifies early gradients**: A big loss from high error like 40% confidence on a hot dog image means amplified gradients, causing fast initial adjustments in the network; over time, loss drops with randomness then plateaus and slowly decreases. [03:37], [05:07] - **Neurons: dot product plus activation**: A neuron computes X1*W1 + X2*W2 + X3*W3 + X4*W4 for scalar output Y, then passes through nonlinearity like tanh which squashes between -1 and 1. [06:50], [08:17] - **Tensors via matrix-vector multiply**: Abstract to tensors with matrix W (3x2) times vector X (1x3): inner dimensions match (3=3), output shape is outer dimensions (1x2) as Y1 Y2 horizontal. [09:16], [11:26] - **Backprop starts at loss derivative**: Start backward from MSE loss derivative like (predicted - expected), e.g., 1.0 vs 0.4 gives 0.6; chain rule flows this gradient back, multiplying for ops like multiply, passing straight for add. [13:29], [15:24] - **Weight gradients: X transpose dot Y_grad**: Compute weight gradients by transposing input X and dot with output Y gradients, matching shapes like 3x1 dot 1x2 gives 3x2; subtract learning rate times these from weights in gradient descent. [19:20], [22:27]
Topics Covered
- Tensors Abstract Scalar Backprop
- Backprop Starts at Loss Derivatives
- Transpose Computes Weight Gradients
- Batches Maximize GPU Efficiency
Full Transcript
hi everyone Elliot here um I created this video to help break down some of the intuition behind back propagation so models like chat GPT GPT 4 the recent
Claude 3 uh the the Claude Sonet 3.5 model all of these um this is how they learn under the hood they use back propagation they feed some some text
through the network and then they um in in the output they say how close were we to the expected result and they back propagate through it and they update all of these weights or or their or their
gradients rather uh and then they do gradient descent and then the network gets you know a little bit better like maybe 0.1% or something better uh each iteration um and this is this is
essentially what I'm going to be breaking down so we're going to use this whiteboard here maybe some examples on the screen uh but by the end of this you should understand how this thing works under the hood at the level of scalers
and tensors as a quick note uh I made this mainly because I couldn't quite conceptualize backprop completely with carpo's microgr lectures or three blue
on Browns lecture on YouTube uh not because they weren't well explained but more so because it didn't abstract things up to the level of tensors so it's very easy to cover something like
the calculus behind it um and some of the simple operations like a neuron um but when you abstract it up to tensors that are like 2 three four dimensional
it starts to get really hard to understand what how things are flowing under the hood um you know when you're designing neural network architectures you have to know how these how these gradients are flowing right you have to
understand sort of under the hood how is how is this thing being optimized so when you jump up in terms of dimensionality and size it helps when you can uh you know understand at the
level of tensors so that's also what this attempts to do so I have a Discord server where you can come in and ask questions chat with the community link for that will be in the
description let's pop over to the Whiteboard here uh just to illustrate from the very beginning you have a neural network composed of a bunch of knobs or neurons or whatever whatever
you want to think of if you will um I'm going to draw just a simple diagram to illustrate this so you have maybe an image right an image
X it's a terrible X um and you pass this through a Network um and then you get an
output and then you find a loss right so you do this you do this forward pass through the network you get your loss and then you backward and you you update all the weights inside of the network
all these little parameters these little knobs and you fix them and you you you you optimize them and then the model does better so let's actually go and a little bit deeper and see like what is
this so think about think about X some maybe some image right maybe it's maybe we're trying to predict if an image is a
picture of a hot dog or it's not a hot dog so if it's a one hot dog zero not a hot dog and what we do is we pass the image through pass the image through the
network we get an output and to say the output is like oh we we pass it an image uh and it is a hot dog and it says oh I
think it's a 40% chance or 0.4 that it's a hot dog right that's a lot of error 40% is is really far off right we don't
want that so we find we we use this loss function to essentially calculate the difference or the error there and this loss function will grow as that error
increases so if it's like 0.1 in one then that loss function is going to be massive where whereas if it's like 0.99 and one it's be like ah you should update a little bit but you're you're
you're kind of okay um that's what the loss is going to do and then we use this loss let's say the loss is um let's say the loss is
like we just say the loss is Big this means that we're going to have to you know Traverse this network and all of the uh
all the gradients are going to be Amplified these little these these things called gradients or you can think of them as as knobs in the network they're going to actually be bigger and amped up because we have such a big Arrow right so that means there's going
to be a lot of adjustment at first the ear Network's going to do a lot of fast fast moving and fast adjustments then it's going to slowly Con well it's going to quickly converge through random movement it's going to figure out which
ones are the best one to move and how much um and you're going to see this loss you know this is time and then this is loss it's going to drop right so it's going to be little maybe little
Randomness figuring stuff out and then it's going to and then it's going to kind of plateau and it's going to slowly decrease
um that is this thing right here this is called gradient descent so you're essentially just optimizing these little gradient values in the network and making it better so now that we
understand on on a high level like what the network is doing um let's go into like what it looks like on a on a more technical level so
up here you'll see I have a uh a neuron so you have some uh we're not going to use an image for this instance because
uh images are a little hard to work with but uh say we just have four X's right and we're we're we're entering these x's
in like Arrow Arrow arrow and arrow me this is like X1 all the way to X4
um and then we have a little Nur so they're going to All Connect like this you're going to have all all four
X's so you might have another neuron down here and another down here we're just going to worry about this one for now um all that's going to happen is
you're going to have your your X1 multiply the um W one so it's a we it's going to be structured like like this
W1 W2 W3 and W4 right um and it's essentially just going
to dot product with the uh the input so X1 all the way to X4 so what I mean by dotproduct is each
of these are just going to multiply so W1 * X1 W2 * X2 W3 * X3 and Etc and they're all just that those are going to
add together so it ends up looking like it ends up looking like this you have you know all these weights they multiply um they multiply with an x value and then they add together and that gives
you your output so it's this single scalar value that's the output of this neuron right um and then typically after you get that output you might pass it through something like a like an
activation function an activation just essentially is like a nonlinear change in the output so um for example you could use like s goes
like this or you could use tan H which is more common t h is like a squashing function like
this so it'll Plateau out at one and to a negative one it just squashes it so it's like a nice a nice one between one and negative one uh then you have like
sigmoid which is going to squash between a one and one and zero doesn't look like an s um between one and uh zero so you have a
bunch of these activation functions right and so what'll happen is you'll you'll pass uh you you'll get an output of neuron these will these will dot product together you'll get this output
and you'll pass it through say uh 10 H which is uh this guy here
so that's that's like on a low level how these IND idual okay so X1 * W1 plus X2
* W2 plus X3 * W3 plus X4 * W4 is equal to Y that's that's a neural pretty simple right um it's just add and multiplies
so how do you how do you abstract this up to the level of tensors right you might you might ask and I can actually answer that
fairly simply so let's just take I already Drew this out preemptively so it makes it this kind of easy to go
faster take this as an example you have inputs X1 X2 X3 1 2 and three um and then you have a weight Matrix right but
this is a this is a vector right it's like onedimensional it has that's like a line of numbers then this is 2D so it's a matrix so it's a matrix Vector multiplication so how this goes is
you're essentially just doing the same Dot product thing but you're adding more Dimensions to your to your uh number structure I guess you could say um so what you're going to do
is you have this first uh you have this first neuron it has to interact with all the inputs so what you do is you take the is you multiply these together and
how you do that is you take you know this this this Vector first then this Matrix and you take that this um W1 uh
w11 W12 W13 that's like the first the first n on and you you rotate it this is how you multiply um vectors and matrices is you take this and you rotate it so
you flip it this way and then you do your iterative do and that will give you your y output so notice how we do uh this one this one
times this one this one times this and then this and this and this is all written out here so that'll give you your y1 and then these will go and and
interact with these this will flip over and interact with the X values byy two from that um and the way we know this works is just because of shapes so this is
something you learn in your linear algebra class A lot of this is like assumed knowledge of course um but all you have to make sure and this is extremely
simple you look at the inner numbers this is a this is a one one by horizontal and then this is a a three by two
horizont right and you look at your inner numbers so you have a three and three those are the same check box and you look at these other two these outer ones you have one and two and that means
you're going to get an output shape of 1 by two so as long as these match up these um these inner ones your output shape is going to be the outut so one
and two and we'll notice that we get prise that so it's not it's not 2x one so like y1 Y2 it's it's 1 by two so it's
horizontal so y1 Y2 horizontally and uh that's that's pretty much the uh multi-layer perceptron so
this this this thing we were talking about before this this neuron you essentially take this up to the level of uh like many many neurons meaning like you know
maybe billions of you have really big networks um and you just you just uh you have your you have your neurons and a
nonlinearity like CH tan and then another bunch of neurons and then T neurons tan and you just go you just make a really deep Network that way
so it it'll essentially look like this you'll have your inputs it may be like this and you can think of each of these as just these little these uh these like
way columns right that that's that's essentially all they are um so you might ask how do you how do you back propagate you understand how to go you understand
how to go forward through this network so how do you actually back propagate through it I drew some stuff to help us illustrate what's going on and to make my job a little easier when when writing things out but um when you're back
propagating through the network you essentially have to start with the last node so you know you have an X you have net and then you
have and you go to the L right so this is where you start from you start from the loss um let's say our loss function is like MSE loss right
this that's what this is right here so you take you know your you're uh predicted and you're expected um and then you you square you square that
difference and then you add those together and you you you divide by the number of results that there are so it's essentially like a
you you you do a you do a subtract you square uh and then you average right that's what msse loss is and so the
derivative of that um let's say if I have like two elements it's going to be you know 1/ two and then you sum all these together that are squared and then the the power from Calculus if you if
you remember this this two will actually go here and you'll end up with just this you'll end up with just this part so if
you know if if uh my predicted was like um you know 1.0 and my expected was like 0.4 then it's going to be like 0.6 is
the derivative of the loss and you would have that as your first node 0.6 and then you would distribute that
accordingly so if you got um if you had like two neurons say um I don't know like like two numbers and say they and
and say they added to make this they did like a I don't know like a plus or something like an add operation um you just jump back to uh differentiation in
calculus right so for example if I'm doing multiplying like y = 2x the derivative of that is going to be two uh the rate of change is two right so it's
like the slope is two so I just write the slope as like a function and that's constant right so it changes it changes there but if you're just doing um if you're just if you're just doing like
add right like there's no there's nothing to really differentiate there there's no rate of change because it's constant so this 0.6 will just flow to both of these there's no multiplicative
action happening there it'll just flow uh and then you can go right to back propagation and it's really important to cover this part because um you know you have to start with that you have to start with the loss you say what are we
starting what is like the first node of this Vector p and then you go and you do your matrix multiplication math right or your or vector Matrix Matrix Vector multiplication that's what this is and
pytorch is really really fast doing this like blazingly Fast um there's like billions of dollars that go into research as to how uh matrix multiplication can be optimized under
the hood on different gpus and they the major sees that they are multiplying they don't look like this they're massive like they have like millions of numbers in them so it's nothing that you
can store in your brain but a computer can store it on chip uh and it can it can just compute these really fast right um and that that's like the the GPT
architecture in chat GPT um and like a lot of the a lot of the operations in it like attention and the the multilayer percep like a lot of it is matrix
multiplication so they purposely optimize for that um and that's why so much money has gone into this algorithm and making it faster on Hardware uh is because it literally makes the coolest
tools in the world uh faster right so that's a little bit of background on the whole matrix multiplication thing um you might hear the word mat moo that's just short for matrix multiplication that's
what that's what nerds that's essentially nerd speak for just like making this algorithm faster you you you'll see it everywhere the M all
search it up um but going back to back so you get these 0.6 values and you can think of these as the gradients for your output or your maybe your second last
notes right so you have like a you have a data attribute which is like as you're going forward through it you you eventually get this these data outputs and you feed them into the loss function
um but when you're going backwards through it the point is to compute all the derivatives the local uh the local gradients of all of these different nodes right and you Traverse this
backward because this one is going to rely on the previous layer gradients in order to make it work right um and this is just a thing from chain wheel and
Calculus if you recollect that um I'll show you a really good video uh after this part for recollecting a lot of this stuff and and understanding it intuitively this video is more so just
to you know say that this Matrix multi uh matrix multiplication is fast and that this is how you can do it um without you know without getting too
confused um and then you you have this data attribute so it's going forward and then you have the you have the grad attribute going backward and that's what we're Computing here so you have these
you have these 0.6 values and see these are gradients right so what you can then do is you can say okay I need to figure out how much I need to up those weights
I need to figure out how much I need to update the last layers you know whatever weights whatever you want to call them
um and in order to do this you're going to need both these these output gradients as well as the inut so the x.
data attribute right the forward forward part I mean the inputs you don't really need uh gradients for those by the time you get to the end but you you do it you do it because you need to compute all of
these um you need to you need to Traverse backwards that's why you compute these these input
gradients um so for some context it's just the same thing as this except you're flipping some things around
so what you're going to do is literally just I say here dot product the transpose of of X and Y so you have your
y gradient say it's I don't know I'll just use these shapes for instance so if we jump back to here we have this y1 and Y2 computed from this which we covered
earlier um and you have these this this horizontal thing of X right these horizontal X numbers um and so what you do is is you just transpose these and
what that means is if you had like a like say a 2 by uh if you had a 2x3 Matrix the
transpose of that the dot T would be a 3X two so if I have like a a thing that I populate with zeros um let's say it's
this is going to be like a 3X two three three High two long
0 0 the transpose of this transpose is is going to be 0 0 0 0 0 so you're
just you're you're just flipping these so instead of instead of X1 X2 X3 you have X1 X2 X3 and it keeps that it keeps that order those IND the index
indexed order um and so you essentially just you just multiply these so you have a um you have a 3X one
and you have a 1 by two so you end up getting as an output um you know these match up and these are going to be your
outputs so you get a 3X two output um and remember going back to this our uh weight Matrix populated with all
these all these weight.
data this is also shap 3x two so what you can do is you can say okay I have these gradients now I have I have pretty much how these um how these weights are
affecting the final output the amount of eror that they're that they're giving so you can take this
you you take the um you take the do data
attributes and you you m you you essentially subtract or or change them or even add depends depends how you look
at it you could just do like subtract 0.1 U of the of the grad attribute that is how much error it's doing if you just subtract like all the
error it's going to it's going to change in a massive amount so if you only adjust it a little bit and you do it over a lot of iterations meaning like hundreds or thousands or hundreds of thousands or Millions um you'll just
you'll just end up in a really optimal spot right so these weights and I update them I update them based on each um I I
update them based on their error and then they just improve a little bit this this is literally what gradient descent is so when I when I was talking about earlier and you had like your loss and
then your time over over it over uh training steps this would drop right and then it would Plateau out so this gradient Descent Part is literally just
that you have like you your data let's say it's um a b
c d e f and you subtract by 0.1 say like maybe we'll like these are
different numbers say two two two two and then maybe this is like you're getting Mass these are all like four four four
four so 0.1 * all of these is going to give you 0.4 for each for each of these numbers and then you're just going to element wise um sub so it's going to all
these are just going to give you zero it's going to give you 1.6 each for each of these values uh and your weight Matrix is going to be updated because that
so literally all back all the all the forward paths is is you just you have your input X you do uh you do
essentially y = x um x * W and then you can optionally add a bias in there too bias is like
just a common term you use in this in this like essentially group it's linear layer if you will it's just an extra thing that you add on and it it can it can provide benefits um actually bias will make language
models more expressive because it'll lean more a certain way right so bias is a cool term but it's besides the point you do this uh matrix multiplication or
Matrix Matrix Vector multiplication uh to go forward and then when you're doing backward you just transpose the X you you do that you do
the flipping on it you go um from one two 3 to one two 3 uh and then you you you multiply that uh you you do a matrix
multiplication with that the transpose X and the Y gradients and that will give you your um your your weight
gradients and then you do this part gradient descent um so that's that's pretty much how back propagation works on hood on the level of tensors of course you can abstract it to the level
of batches so you you you can right now I'm just working with two Dimensions so like 3x two right but what you can do is you can make these instead of just a single like a like a window pane you can
make it 3D so it's a cube and each of those panes are going to be a batch and so well you can do is on on on
these gpus we have these Nvidia gpus you can run models really really fast on is you can uh you can you actually have so many cores that you can you actually can
run a model more efficiently if you use batches if I just pass in these these small window panes um it's only going to be able to use a certain amount of the cores right whereas if I do it on the
level of batches it can use all the cores and it can capture more information about the data that's being passed through it if there's like more error or there's or there's more to the
story I guess if there's more more to the story across more samples of data um you can actually learn more and you can you can optim better at that level um
it's also just like the parallel architecture of gpus that allows you to just have massive throughput like that so that that's a Side Story I love
ranting about that stuff but uh I'll just showcase a little bit of uh some other videos that you can watch some other videos that I found extremely useful but didn't quite go to the level
of uh matrices tensors and and and batch processing and all that stuff so here it is so here's um Andre Kathy's lecture he he did some work at Tesla and open AI he's
a really smart guy he did he has a whole lecture series on this stuff I highly suggest you watch his channel uh he he does really good content one of the best I've seen um and a lot of what I covered
today uh on the Whiteboard here he actually covers in code uh and he does everything from scratch and does that whole you know backwards traversal thing
but without doing Matrix multiplications he does it on a on a more scaler level so that's that's was kind of the point of this is if you missed the whole point there and how to abstract it to matri and tensors it's like this is this is
the whole point of this video um but he does an extremely good job he does 2 hours and 30 minutes of it um so
neurons um bugs T right so that's that's that's pretty much it and then I know three blue and brown has
some stuff as well three blue one brown so what is back cation really doing he has some good videos on this um so yeah hopefully you found that
useful so before I leave there's actually two other things I wanted to cover so one I actually completely missed during this whole part was the activation functions and differentiating those so this is an extremely simple
part um I left this out probably because I thought it was too easy and trivial but pretty much all this is is saying like you have an activation function
like t h like this uh and you find the derivative of that function and you use that in back propagation so if I have a
matrix like a like say a I don't know 2 two two two two two and this is like my output and I that or maybe maybe
even like we'll just say like why so y1 and Y2 right y1 and Y2 and maybe we'll say these are
like I won't even give them the point is you find the derivative of 10 H um and you can just you could just pass these through that um you can you can just
pass these through the derivative of 10 H and you'll just get the next node from that right so that's that's a little something I thought I should add because uh these activations or nonlinearities
cuz it's it's like a curve these are absolutely essential in back promp and then the last one um if I just if I just zoom out here again
to uh to here this is something worth looking into high torch
autograd so this is a little bit more complicated to read but based on the background I've given you the Andre gathy lectures as well as three blue and
brown and the content I've covered here today um this shouldn't be too hard to read so um you know gradient descent I
talked about this differentiation and autograd uh just recollection of calculus um and then the computational graph so
you have all these different nodes like power backward or mole backward or sub backward all all these um and it pretty much just goes into the details behind that you can you know there's
introduction to autograd fundamentals of autograd um so they have a they have a a YouTube video on that and there's just some more more stuff they talk about this one is especially good um I'm going
to put this one in the description actually because of how good it is um they they go in quite good depth into how autograd Works under the hood in uh
a framework like pytorch you might be familiar with pytorch it's a it's essentially how how um companies like open AI do really fast AI research um they they write their neural networks
they write their data loaders and whatever else and all their hyperparameters and optimize those and they just experiment rapidly uh pytorch runs extremely fast on gpus it has a
bunch of tools um you know I don't even know all the tools pytorch has yet because it's so expansive um but I yeah highly recommend looking through this they go through steps in code and and
they they give you the outputs and it's it's very intuitive to look through this so um that's it
Loading video analysis...