Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 5 - Recurrent Neural Networks
By Stanford Online
Summary
## Key takeaways - **Deep Nets Stuck 15 Years**: Although people had neural networks and backpropagation, for a very long period of time, about 15 years, things seemed completely stuck; you couldn't get deep neural networks to work in practice, even though in theory they seemed promising. [06:03], [06:26] - **Modern Overfitting Embraced**: With modern large neural networks, we train our models so that they overfit to the training data almost completely, getting near zero loss, but providing you've done regularization well, the model will also generalize well to different data. [12:24], [13:05] - **Dropout Forces Robustness**: At training time with Dropout, you randomly throw away some inputs inside the middle layers using a random mask, training the model to be robust and make as much use of every input as it can since any component might randomly disappear. [14:15], [15:10] - **N-gram Sparsity Hack: Add Delta**: For unseen n-grams causing zero probability counts, people hacked the counts by adding a little delta like 0.25 to counts so things never seen got a count of 0.25, ensuring no zeros and everything possible. [35:47], [36:08] - **RNNs Reuse Weights Over Time**: Recurrent neural networks use one set of weights applied through successive moments in time, updating a hidden state at each time step that stores memory of everything seen so far, processing any length of context without exponential parameter growth. [51:44], [53:43] - **Teacher Forcing Trains RNNs**: In RNN language model training, at each step you generate the probability distribution for next words using the actual prefix from the text, compute the negative log likelihood loss for the actual next word, then feed in the actual word as input for the next step. [01:01:51], [01:02:21]
Topics Covered
- Full Video
Full Transcript
okay um let me get started for today so for today um first of all I'm going to spend a few minutes talking about a
couple more new net Concepts including actually a couple of the concepts that turn up in assignment two um um then the
bulk of today is then going to be moving on to introducing what are language models um and then after introducing
language models we're going to um introduce a new kind of newal network which is one way to build language models which is recurrent newal networks
um they're important thing to know about and we use them in assignment three um but they're certainly not the only way to build language models in fact probably a lot of you already know that
there's this other kind of new Network called Transformers and we'll get on to those after we've done recurrent new Nets um Talk a bit about problems with
um Rec current new networks and well if I have time I'll get onto the um recap um before getting into the content of the class I thought I could just spend a minute on giving you the stats of who is
in um cs224n um who's in cs224n kind of looks like the pie charts they show in CS 106a these days um
except more grad students I guess um so the four big groups are the computer science undergrads the computer science grads um the Undeclared
undergraduates and the um ndo grads so this is a large portion of the scpd students though um some of them are Under Computer Science grads so that
makes up about 60% of the audience um and if you're not in one of those four big groups um you're in the other 40% and everybody is somewhere so there are
lots of other interesting groups down here so you know the the bright orange down here that's where the math and physics phds are um up here I mean
interestingly we now have more statistics grad students and there are ssis undergrads it didn't used to be that way around in NLP classes um and
you know one of my favorite groups um the little um magenta group down here um these are the humanity undergrads yay
Humanity's undergrads um in terms of years it breaks down like this um first year grad students are the biggest group
tons of Juniors and seniors and a couple of Brave FR are any Brave FR here today [Laughter]
yeah okay um welcome yeah so um modern newal networks especially language models are
enormous um this chart's sort of out of date because it only goes up to 2022 but it's sort of actually hard to make an accurate chart for 2024 because
in the last couple of years um the biggest language model makers have in general stopped saying how large their language models are in terms of parameters but at any rate they're
clearly um huge models which um have over a 100 billion parameters um and so large and then deep in terms of
very many layers new Nets are a Cornerstone of modern NLP systems we're going to be um pretty quickly working
our way up to look at those kind of deep models but I just sort of for starting off with something simpler you know I did just want to kind of um key you in
for a few minutes into a little bit of History right um so um the last time neural Nets were popular was in the 80s
and 90s and that was when people worked out the back propagation algorithm Jeff Hinton and colleagues um made famous the back propagation algorithm that we've
looked at and that allowed the training of neural Nets with hidden layers um and so but in those days pretty much all the new du Nets with
hidden layers that were trained were trained with one hidden layer you had the input the hidden layer and the output and that's all that there was and
the reason for that was for a very very long time people couldn't really get things to work um with more hidden
layers so that only started to change in the Resurgence of what often got called Deep learning but anyway back to neural
Nets um started around 2006 and this was um one of the influential papers at the time um greedy layerwise training of deep neural networks by Yoshua Benjo and
colleagues and so right at the beginning of that paper they observed this the the problem however until recently it was believed too difficult to train deep
multi-layer newal networks empirically deep networks were generally found to be not better and often worse than new networks with one or two hidden layers
um Jerry Torro um was actually a faculty member who worked very early on autonomous driving with new networks um as this is a negative result there's not
been much report in the machine learning literature um so that really you know although people had newal networks and back propagation and current newal
networks we're going to talk about today that for a very long period of time you know 15 years or so things seemed
completely stuck in that you couldn't although in theory it seemed like deep neural network should be promising in practice um they didn't work and so it
really then took some new developments that happened in the late 2000s decade and then more profoundly in the 2010s decade to actually figure out how we
could have deep neural networks that actually worked working far better than the shallow neural networks and leading into the networks that um we have today
and you know we're going to be starting to talk about some of those things in this class and come um coming up with
classes and I mean I I think you know the tendency when you see the things that got new networks to work much
better like the the the natural action is to sort of shrug and be underwhelmed and think oh is this all there is to it
this doesn't exactly seem like difficult science um and in some sense that's true they're fairly little introductions of
new ideas and tweaks of things but nevertheless a handful of little ideas and tweaks of things turn things around
from a field that was sort of stuck for 15 years going nowhere and which nearly everyone had abandoned because of that to suddenly turning around and there
being the ability to train these deeper neural networks which then behaved amazingly better as machine learning systems than other things that had
preceded them and dominated for the for the intervening time so that took a lot of time so what are these things um one
of them which you can greet with a bit of a yawn in some sense is doing better regularization of neural Nets so
regularization is the idea that Beyond just having a loss that we want to minimize in terms of describing the data
um we want to in some other ways manipulate what parameters we learn so that our models work better and so
normally we have some more complex loss function that does some regularization the most common way of doing this is what's called L2 loss where you add on
this um parameter squared term at the end and this regularization says you know it would be kind of good to find a
model with small parameter weights so you should be finding the smallest parameter weights um that will explain your data well and there's a lot you can
say about um regularization these kind of losses they get talked about a lot more in other classes um like cs229
machine learning and so um I'm not going to say very much about it this is in the machine learning theory class um but I
do just want to sort of um put in one note that's sort of very relevant um to um what's happened in recent new
networks work um so the classic view of regularization was we needed this kind of regularization to prevent our networks from
overfitting meaning that they would do a very good job at modeling the training data but then they would generalize
badly to new data that was shown and so the picture that you got shown was this that as you train on some training data
your error necessarily goes down however after some point you start learning specific properties of things
that happen to turn up in those training examples and that you're learning things that are only good for the training examples and so they won't generalize
well to different pieces of data you see at test time so if you have a separate validation set or a final test test set
you would and you traced out the error or loss on that um validation or test set that after some point it would start
to go up again this is a quirk in my bad PowerPoint it's just meant to go up um and the fact that it goes up um is then you have overfit your training data and
have making the parameters numerically small is meant to lessen the extent to which you overfit on your training data
um this is not a picture um that modern newal Network people believe at all instead the picture has changed like
this um we don't believe that overfitting exists anymore but what we are concerned about is models that will
generalize well to different data um so that when we train you know So In classical statistics the idea that you
could train billions of parameters like large neuron Nets now have um would be seen as ridiculous because you could not possibly estimate those parameters well
um and so you just have all of this noisy mess um but what's actually been found is that yeah it's true you can't estimate the numbers well but what you
get is a kind of an interesting averaging function from all these Myriad numbers and if you do it right what
happens is as you go on training that for a while it might look like you're starting to overfit but if you keep on training in a huge Network
um not only will your training loss continue to go down very infinitesimally but your validation loss will go down as
well and so that huge on huge networks these days is um we train our models so that they overfit to the training data
almost completely right so that if you train a huge Network now on a training set you can essentially train them to
get zero loss you know maybe it's 0.007 loss or something but you can train them to get zero loss because you've got such Rich models you can
perfectly fit memorize the entire training set now classically that would have been seen as a disaster because you've overfit the training data with modern large neural networks it's not
seen as a disaster because providing you've done regularization well that your model will also generalize well to different
data however the flip side of that is normally this kind of L2 regularization or similar ones like L1 regularization aren't strong enough regularization to
achieve that effect and so neural network people have turned to other methods of regularization of which everyone's favorite is Dropout so this
is one of the things that's on the assignment and um at this point I should
uh um apologize or something because the way the way Dropout is um done the way Dropout is presented here is sort of the
original formulation the way Dropout is presented on the assignment is the way now normally done in deep learning packages so um the there are a couple of
details that vary a bit and let me just present the main idea here and um not worry too much about the details of the math so the idea of Dropout is at
training time um every time you are doing a piece of training with an example what you're going to do is inside the middle layers of the neural
network you're just going to throw away some of the inputs and so technically the way you do this is you have a random mask that you sample each time of zeros
and ones you do a hadam mod product of that with the data so some of the data items go to zero and you have different
Mass each time so for the next um thing you know I've now masked out um something different this time and so you're just sort of Random ly throwing
away the inputs and the effect of this is that you're training the model that it has to be robust and work well and
make as much use of every input as it can it can't decide that can be extremely reliant on you know component 17 of the vector because sometimes it's
just going to randomly disappear so if there are other features that you could use instead that would let you work out what to do next you should also know how
to make use of those features so at training time you randomly delete things at test time sort of for efficiency but also quality of the answer um you don't
delete anything you keep all of your weights but you just rescale things to make up for the fact that you used to be dropping things okay um so what there are several
ways that you can think of explaining this one motivation that's often given is is that this prevents feature co- adaptation so rather than a model being
able to learn complex functions of feature 7 8 and 11 can help me predict this it knows that some of the features might be missing so it has to sort of
make use of things in a more flexible way another way of thinking of it is that there's been a lot of work on model ensembles where you can sort of mix
together um different models and improve your your results if you're training with Dropout it's kind of like you're training with a huge model Ensemble because you're training with the
Ensemble of the power set the exponential number of every possible Dropout of features all at once and that gives you a a very good model um so
there are different ways of thinking about it I mean if you've seen na bays and logistic regression models before you know I kind of think a nice way to think of it is that it gives a sort of a
middle ground between the two because for naive based models you're waiting each feature independently just based on the data statistics doesn't matter what other features are there in a logistic
regression weights are set in the context of all the other features and with Dropout you're somewhere in between you're seeing the weights in the context of some of the other features but
different ones will disappear at different times um but you know following work that was done um at Stanford by Stefan varer and others that
generally these PE days people regard Dropout as a form of feature dependent regularization and he shows some theoretical results as to why to think of it that
way okay I think we've implicitly seen this one um but vectorization is the
idea no for Loops always use vectors matrices and tensors right the entire success and speed of deep learning works
from the fact that we can do things with vectors matrices and tensors so you know if you're writing for Loops in any language but especially in Python things
run really slowly if you can do things with vectors and matrices even on CPU things run at least an order of magnitude faster and well what everyone
really wants to move to doing in deep learning is running things on gpus or sometimes now newal processing units and then you're getting you know two three
orders of magnitude of speed up so um do always think about I should be doing things with vectors and matrices if I'm writing a for Loop for anything that
isn't some very superficial bit of input processing I've almost certainly made a mistake and I should be working out um how to do things um with vectors and
metries and you know that's kind of thing like Dropout you don't want to write a for Loop um that goes through all the positions and sets some of them to zero you want to be sort of using a
vector operation with your mask um two more I think um parameter initialization I mean this one might not
be obvious but when we start training our neural networks in almost all cases it's vital that we
initialize the parameters of our matrices to some random numbers and the reason for this is if we just start with
the um if we just start with our matrices all zero or some other constant normally the case is that we have um
symmetry so it's sort of like in this picture when you're starting on this Saddle Point um that you know it's symmetric to the left and the right and
um or whatever forward and backwards and left and right and so is you sort of don't know where to go and you might be sort of stuck and stay in the one place
I mean normally a way to think about is the operations that you're doing to all the elements in The Matrix are sort of the same so rather than having you know
a whole Vector of features if all of them have the same value initially often it's sort of like you only have one feature and you've just got a lot of
copies of it so to initialize learning and have things work well we almost always want to set all the weights to
very small random numbers um and so at that point you know CL when I say very small we sort of want to make them an A
Range so that they don't disappear to zero if we make them a bit smaller and um they don't sort of start blowing up into huge numbers when we multipli them
by things and doing this initialization at the right scale was used to be seen as something pretty important and there were particular methods that had a basis of sort of thinking of what happens once
you do Matrix multiplies that people had worked out and often used one of these was this harier initialization which was sort of working
out um what variance of your uniform distribution to be um variant of your distribution to be using based on the sort of number of inputs and outputs of
a layer and things like that the specifics of that um you know I think we still use to initialize things in assignment two but we'll see later that
they go away because people have come up with clever methods um in particular doing layer normalization which sort of obviates the need to be so careful on
the initialization but you still need to initialize things to something okay then the fast the final one which I'll is also something that appears in
the second assignment that I just want to say a word about um was optimizers so we talked about in class um stochastic gradient descent and did the basic um
equations for stochastic gradient descent and you know to a first approximation there's nothing wrong with stochastic gradient descent and if you fiddle around enough you can usually get
stochastic gradients send actually to work well for almost any problem but um getting it to work well is very dependent on getting the scales of
things right of sort of having the right step size and often you have to have a learning rate schedule with decreasing step sizes and various other complications so people have come up
with um more sophisticated optimizers for newal networks and for complex Nets sometimes these seem kind of necessary
to get them to learn well and at any rate they give you sort of lots of margins of safety since they're much less dependent on you setting different
hyperparameters right and the idea of all of the well all the methods I mentioned and the most commonly used methods is that for each parameter
they're accumulating a measure of what the gradient has been in the past and they've got some idea of the scale of the gradient the slope for a particular
parameter and then they're using that to decide how much you move the learning rate at each time step so the simplest method that was come up with this one
called adrad if you know John duci and E he was one of the co-inventors of this um you know it's simple and nice enough but it tends to stall early then people
came up with different methods Adam's the one that's on assignment two it's a really good safe place to start art um but in a way um sort of our word vectors
have a special property because of their spareness that you know you're very sparely updating them because particular words only turn up occasionally so people have actually come up with
particular um optimizers that sort of have special properties for things like word vectors and so these ones with the w at the end can sometimes be good to
try and then you know again there's a whole family of extra ideas that people have used to improve optimizers and if you want to learn about that you can go
off and do an optimization class like Conex optimization but there ideas like momentum and nesterov acceleration and things like that and all of those things
people also variously try to use um but Adam is a good name to remember um if you remember nothing else okay that took longer than I hoped
but I'll get on now to language models okay language models so you know um in some sense language model is just two
English words but when in NLP we say language models we mean it as a technical term that has a particular
meaning um so the idea of a language model is something that can predict well what word is going to come
next or more precisely it's going to put a probability distribution over what words come next so the students open there what words are likely to come
next bags laptops laptops notbook notebooks notebooks yeah um I have some of those
at least okay um yeah I mean so right so so these are kind of likely words and if on top of those we put a probability on
each one then we have a language model so formally we've got a context of proceeding items we're putting a
probability distribution over the next item which means that the sum of the estimates of this for the um items in the vocabulary will sum to one and if
we've defined a p like this that predicts probabilities of next words that is called a language
model as it says here um an alternative way um that you can think of a language model is that a language model is a
system that assigns a probability to a piece of text and um so we can say that a language model can can take any piece
of text and give it a probability and the reason we can do that is we can use the chain rule so I want to know the probability of any stretch of text I say
given my previous definition of language model easy I can do that probability of X1 with a null preceding context times
the probability of X2 given X1 Etc along I can do this chain rule decomposition and then the terms of that decomposition
are precisely what the language model as I defined it previously provides okay so language models are this essential
technology for NLP just just about everything from the simplest places forward um where people do things with human language and computers people use
language models in particular you know they weren't something that got invented in 2022 with chat GPT language models
have been Central to NLP at least since the 80s the idea of them goes back to at least the 50s um so anytime you're typing on your phone and it's making
suggestions of next words regardless of whether you like those suggestions or not um those suggestions are being generated by a language Model A Very uh
traditionally a compact not very good language model so it can run sort of quickly and very little memory in your keyboard application um if you go on
Google and you start typing some stuff and it's telling you stuff that could come after it um to complete your query well again that's being generated by a
language model so how can you build a language model so before getting into new language models I've got just a few slides to tell you about the old days of
language modeling so this is sort of how language models were built um from 1975 until you know effectively around
about 2012 um so we want to put probabilities on these
sequences um and the way we're going to do it is we're going to build what's called an engram language model um and so this is meaning we're going to look
at Short word subsequences and use them to predict so N is a variable describing how short are the word sequences that we're going to
use to predict um so if we just look at the probabilities of individual words we have a unigram language model if we look
at probabilities of pairs of words Byram language model um uh probabilities of three words trigram language models probabilities of more than three words
they get called four gr language models five gr language models six gr language models um so for people with a Classics education this is horrific of course um
in particular not even these ones are correct um because Graham is a Greek root so it should really have Greek numbers in front here um so you should
have Monograms and diagrams um and you know actually so the first person who introduced the idea of engram models was actually Claude Shannon when he was working out information Theory the same
guy that did cross entropy and all of that and if you look at his 1951 paper he uses diagrams um but the idea died about there and everyone else this is
what people say in practice um It's Kind kind of cute I like it a nice you know practical notation um so to build these
models the idea is look we're just going to count how often different engrs appear in text and use those to build
our probability estimates um and in particular our trick is that we make a mark of assumption so that if we're
predicting the next word based on a long context we say ah tell you what we're not going to use all of it we're only
going to use the most recent n minus one words so we have this big context and we throw most of it away and so if we're
predicting word XT + one based on simply the preceding n minus one words well then we can make the prediction using
NRS um for let's whatever it is if we use n is three would' have a triam here and
normalized by a Bagram down here and that that would give us relative um frequencies of the different
terms um so we can do that simply by counting how often NRS occur in a large amount of text and simply divid through
by the counts and that gives us a relative frequency estimate of the probability of different continuations does that make sense yeah
that's a way to do it okay um so suppose we're um learning a a forr language model right and we've got a piece of text as the Proctor started the clock
the students open there so well to estimate things we are going to throw away all but the preceding three words so we're going to estimate based on
students open there and so we're going to work out the probabilities by looking for counts of students open their W and
counts of students open there um so we might have in a corpus that students open there occurred a thousand times students open their books occurred 400
times and so we'd say the probability estimate is simply 0.4 for B if exams occurred 100 times the probability estimate is
0.1 for exams um and well you can sort of see that this is bad it's not terrible because if
you are going to try and predict the next word in a simple way looking at the immediately prior words is are the most helpful words to look at but it's it's
clearly sort of primitive because you know if you known the prior text was as the Proctor started the clock that makes it sound likely that the words should have been exams where since you're
estimating just based on students open theirs well you'd be more likely to choose books because it's more common so
it's a kind of a crude estimate but it's a decent enough place to start um it's a crude estimate that could be problematic
in other ways I mean why why else might we kind of get in Troubles by using this our probability estimate
yeah so there are a lot of engrams yeah so there are a lot of words and therefore there are a lot of lot of engrams yeah so that's a problem we'll come to it later uh anything else maybe
up the back um like the word w might not even show up in the training data so you might just have a count zero for that yeah yeah so um so if we're counting
over any reasonable size Corpus there are lots of words that we just are not going to have seen right that they never
happen to occur in the text that we counted over you know so if you start thinking students open there you know there are lots of things that you could
put there you know students open their accounts or if the students are doing dissections in a biology class maybe students open their frogs I don't know
um you know that there are lots of words that in some context you know would actually be possible and lots of them that we won't have seen and so it give
them a probability estimate of zero and that tends to be an especially bad thing to do with probabilities because once we have a probability estimate of zero any computations that we do that involve
that will instantly go to zero so we have to deal with some of these problems so for that sparity problem right yeah
that we could have the word never occurred in the numerator and so simply done we get a probability estimate of
zero the way that was dealt with was that people just hacked the counts a little to make it non zero so there are lots of ways that are explored but the
easiest way is you just sort of added a little Delta like you know 0.25 to counts so things that you never saw got
a count of 0 .25 in total and things you saw once got to count of 1.25 and then there are no zeros anymore everything is possible um you could think then there's
a second problem that wait you might never have seen stupid students open there before and so that means your denominator is just undefined and you
don't have any counts in the numerator either so you sort of need to do something different there and the standard trick was used then was that you um did back off so if you couldn't
estimate words coming after students open there you just worked out the estimates for come words coming after open there and if you couldn't estimate
that you just use the estimate of words coming after there so you used less and less context until you could get an estimate that you could
use um but you know something to note is that we''ve got these conflicting pressures now so that on the one hand you know if you want to come up with a
better estimate that you would like to use more context I to have a larger engr
but on the other hand as you make use more more conditioning words well the storage size problem someone mentioned
gets worse and worse because the number of NRS that you have to know about is going up exponen eventally with the size of the context but also your spareness
problems are getting way way worse and you're almost necessarily going to be ending up seeing zeros and so because of that you know in practice where things
tended to um sort of max out was five and occasionally people use six gr and seven G but most of the time you know between the sort of spareness and the
cost of storage 5 G was the large thing people dealt with um so um a famous resource from back in the
2000s decade that Google released um was Google engrs which was built on a a trillion word web Corpus and had counts
of n g and it gave counts of n g up to nals 5 and that is where they stopped okay well we've sort of said the storage problem the storage problem is
well to do this you need to store the these counts the number of counts is going up exponentially in the amount of Contex size um okay um but you know
what's good about engram language models they're really easy to build you can um build one yourself in a few minutes when you've got want to have a bit of fun on
the weekend um you know all you have to do is start sort of storing um these counts for engrams and you can use them to predict things so you know for if at
least if you do it over a small Corpus like a couple of million words of text um you know you can build an engram language model in seconds on your laptop
or you have to buil write the software okay a few minutes to write the software but building the model takes seconds because you know there's no training in your network all you do is count how
often um engrams occur and so once you've done that you can then run an engram language model to generate text you know we could do text generation
before chat GPT right so if I have a trigram language model I can start off with some words today the and I could
look at my stored engrams and get a probability distribution over next words and here they are you know note the
strong patterning of these um these um probabilities cuz remember they're all der from counts right that are being normalized so really these are words
that occurred once these are words that occurred twice these are words that occurred four times in this context right so they're sort of in some sense crude when you look at them more
carefully but so what we could do is then at this point you know we roll a a die and get a random number from 0er to
one and we can use that sample from this distribution um sorry yeah
um so we sample from this distribution and so that if we sort of um generate so of as our random number something
like 35 if we go down from the top we'd say okay we've sampled the word price today the price and then we repeat over we
condition on that we probability distribution of the next word um we generate a random numbered and use it to sample from the distribution um we say
generate 2 and so we choose of um we now condition on that we get a probability distribution we generate a random number
which is 0.5 or something and so we get gold coming out and we can say today the price of gold and we can keep on doing this and generate some text and so
here's some text generated um from 2 million words training data using a trigram language model today the price
of gold per ton while production of shoe lasts and shoe industry the bank intervened just after it considered and rejected an IMF demand to rebuild
depleted European stocks September 3rd in primary 76 cents a share um now okay that text isn't great um but
you know I actually want people to you know be in a positive of mood today um and you know actually it's not so bad
right it's sort of surprisingly grammatical I mean in particular like I lowercased everything so this is the IMF that should be capitalized of the
international monetary fund right you know there are big pieces of this that even make sense right the bank intervened just after it considered and rejected an if IMF demand you know
that's pretty much making sense as a piece of text um right so it's mostly grammatical it looks like you know English text I mean it it makes
no sense right it's sort of really incoherent so there there's work to do but you know what was already you could
see that even these simple engram models you could from a very low level you could kind of approach what text and
human language worked like in from below and you know I could easily make this better even with the engram language model because you know rather than two million words of text if I trained on 10
million words of text would be better if I then rather than a trigram model could go to a forr model get better and You' sort of start getting better and better
um approximations of text um and so this is essentially what people um did until about
2012 and and you know really uh the same story that people um tell today that scale will solve everything is exactly the same story that people used to tell
in the early 2010s with these engram language models if you weren't getting a good enough results with your 10 million words of text and a trigram language model the
answer was that if you had a 100 million words of text and a for gram language model you'd do better and then if you had a trillion words of text in a five gr language model You' do better and gee
wouldn't it be good if we could collect 10 trillion words of text so we could train an even better engram language model same strategy um but it turns out
that sometimes you can do better with better models as well as simply scale um and so things got reinvented and started
again with building neural language models um so how can we build a neural language model um so you know we've got
the same task of having a sequence of words and we want to put a probability estimate over what word comes next um and so the simplest way you could do
that which you'll hopefully all have thought of because it connects what we did in earlier classes look we already had this idea that we could have
represented context by the concatenation of some word vectors and we could put that into a neural network and we could
use that to predict something and in the example I did in the last couple of classes what we used it to predict was is the center word a location or not a
location just a binary choice but that's not the only thing we could predict we could have predicted lots of things with this new network we could have predicted whether the piece of text was positive or negative we could have predicted
whether it was written in English or Japanese you know we can predict lots of things so one thing we could choose to predict is we could choose to predict what word is going to come next after
this window of text we'd have a model just like this one apart from up the top instead of doing this binary classification we' do a many many way
classification over what is the next word that is going to appear in the piece of text and that would then give us a neural language model in particular
it give us a fixed window neural language model so that we'd do the same Markoff assumption trick of throwing
away the further back context and so for the fixed window um we'll you know um use word embeddings which you can
concatenate we'll put it through a hidden layer and then we'll take the output of that hidden layer um multiply it by by another layer say and then put
that through a soft Max and get an output distribution and so this gives us a sort of a fixed window neural language model and you know apart from the fact
that we're now doing a classification over many many many classes this is exactly like what we did um last week so it should look kind of familiar it's
also kind of like what you're doing for assignment two um and so this is essentially the first kind of new language model that was
proposed um so in particular um yosua Benjo um really sort of right at the beginning of the 21st century suggested that you could do this that rather than
using an engram language model you could use a fixed window neural language model and you know even at that point um he
and colleagues were able to get some positive results from this model but you know at the time it wasn't widely noticed it didn't really take off that
much and you know it was sort of for a combination of reasons when it was only a fixed window it was sort of not that different to engrs in some sense and although the new network could give
better generalization it could be argued rather than using counts I mean in practice you know new Nets were still hard to run
without gpus and people felt and I think in general this was the case that you could get more oomph by doing the scale story and um collecting your engram
counts on hundreds of billions of words of text um rather than trying to make a new network out of it and so it didn't really sort of especially take off at
that time but you know in principle it seemed a nice thing it you know got rid of the spasy problem um it got rid of the storage costs you no longer have to
store all observed engrs you just have to store the parameters of your newal network but it didn't solve all the problems that we'd like to solve so in
particular we still have this problem of the Markoff assumption that we're just using a small fixed context beforehand to predict
from um and there are some disadvantages to enlarging that window and you know there's no fixed window that's ever big enough um there's another there's
another thing that if you look technically at this model that might sort of make you suspicious of it which
is you know when we have words in different positions that those words and different positions will be treated by
completely different subp parts of this Matrix W so you might think that you know know okay for predicting that books
comes next you know the fact that this is a student um is important but it doesn't matter so much exactly where the word student occurs right you know the
context could have been the students slowly open there um and it's still the same students we've just got a bit different linguistic structure where
this W Matrix would be using completely separate parameters to be learning stuff about student here versus student in this position so that seems kind of
inefficient and wrong um and so that suggested that we kind of need a different kind of neural architecture
that can process any length of input and can use the same parameters to say hey I saw the word student that's evidence that things like books exams homework
will be turning up regardless of where it occurs and so that then led to exploration of this different neural network architecture um called recurrent
neural networks which is what I'll go on to next but before I do is everyone basically okay with what a language model is yeah no
questions okay um we're current newal networks um so recurrent newal networks is a different
family of newal networks so effectively in this class we see several neural network architectures um so in some sense the
first architecture we saw was word to V it's a sort of a very simple um encoder decoder architecture um the second
family we saw was feed forward Network or fully connected layer classic neural networks and the third family we're going to see is recurrent neural
networks which have different kinds and then we'll go on and go on to Transformer models okay so the idea of a recurrent newal network is that you've
got one set of Weights that are going to be applied through successive moments in time I successive positions in the text
and as you do that you're going to update the parameters as you go um we'll go through this in quite a bit of detail but you know here's the idea of it so
we've got the students open there and we want to predict with that and the way that we're going to do it okay I've still got four words in my
example so I can put stuff down the left side of the slide but there could have been 24 words with recurrent new networks because they can deal with any
length of context okay so as before our words start off as just words or one hot vectors and we can look up their word embeddings just like
before okay but now to compute probabilities for the next word we're going to do something different so our
hidden layer is going to be recurrent and by recurrent it means we're going to sort of change a hidden State at each
time step as we proceed through the text from left to right um so we're going to start off with an h0 which is the initial hidden state which can actually
just be all zeros um and then at each time step what we're going to do is we're going to multiply the previous hidden state by a weight M
Matrix we're going to take a word embedding and multiply it by a weight Matrix and then we're going to sum the results of those two things and that's going to
give us a new hidden state so that hidden state will then sort of store a memory of everything that's been seen so far so we'll do that and then we'll
continue along so we'll multiply the next word vector by the same weight Matrix we we store the previous multiply
the previous hidden state by the same weight M Matrix wa each and we add them together and get a new representation um I've only sort of said
this bit so I've left out a bit commonly there are two other things you're doing you're adding on a biased term because we usually separate out a bias term and you're putting things through a
nonlinearity so I should make sure I mention that and for recurrent neural networks most commonly this nonlinearity has actually been the tan H function so
it's sort of balanced on the positive negative side and so you keep on doing that through each step and so the idea
is once we've got to here this H4 hidden state is a hidden state that in some sense has read the text up until now it's seen all of the students open there
and if the word students occurred in any of these positions it will have been multiplied by the same we Matrix and added into the hidden state so it's kind
of got a cleaner low parameter way of incorporating in the information that seen so now I want to predict the next word and to predict
the next word I'm then going to do based on the final hidden State the same thing I did kind of thing I did before so I'm going to multiply that hidden state by
matrix and add another bias and stick that through a soft Max and use that to um sample from that soft Max well the softmax will give me a language model of
probability over all next words and I can sample from it to generate the next word that make
sense okay recurrent new um networks um okay um so for current newal networks we can now process any length of
preceding content next and we'll just put more and more stuff in our hidden State um the so our computation is using
information from many steps back um our model Size Doesn't increase for having a long context right you know we have to
do more computation for a long context but our representation of that long context just remains this fixed size hidden Vector h of whatever dimension it
is so there's no exponential blowout anymore um there's the same weights appli in every time step so there's a symmetry and how inputs are processed um
there are some catches um the biggest catch in practice is that recurrent computation is slow so
for the feed forward layer we just had you know our input Vector we multiply it by matrix we multiply it by matrix however many times and then at the end
we're done whereas here we've sort of stuck with this sequentiality that you have to be doing one hidden Vector at a time in fact this is going against what
I said at the beginning of class because essentially here you're doing a for Loop um you're going through for time equals 1 to T and then you're generating and
term each hidden vector and that's one of the big problems with rnns that have led them to fall out of favor um there's another
problem that we'll look at more is that in theory this is perfect you're just incorporating all of the past context in in your hidden Vector in practice it
tends not to work perfectly because you know although stuff you saw back here is in some sense still alive in the hidden
Vector as you come across here that your memory of it gets more and more distant and it's the words that you saw recently that dominate The Hidden State now in
some sense that's right because the recent stuff is the most important stuff that's freshest in your mind you know it's the same with human beings um they tend to forget stuff from further back
as well um but rnns especially in the simple form that I've just explained forget stuff from further back um rather
too quickly and we'll come back to that again um into in Thursday's class okay so for training an RNN language
model um the starting off point is we get a big Corpus of text again um and then we're going to compute um for each
time step a prediction of the probability of next words and then there's going to be an actual next word and we're going to use you know that as
the basis of our loss um so our loss function is the cross entropy between the predicted probability and what the actual next
word that we saw is which again as in the example I showed before is just the the negative log likelihood of the actual next word ideally you'd like to
predict the actual next word with probability one which means the negative log of one would be zero and there'd be no loss but in practice if you give it
an estimate of 0 five there's only a little bit of loss and so on and so um to get our overall objective function we
work out the average loss the average negative log likelihood of predicting each word in turn so showing that as pictures if our Corpus is the students
open their exams we're first of all going to be trying to predict um you know what
comes after the and we will predict some word with um different probabilities and then we'll say oh the actual next word is students okay you gave that a
probability of 0.05 say because all you know was the first word was the okay there's a loss for that um the negative
log prob given to students we then go on and generate the probability estimate over the next words and then we say well the actual word is opened what
probability estimate did you give to that we get a negative probability loss keep on running this along and then we
sum all of those losses and we average them per word and that's our sort of average per word loss and we want to
make that as small as possible and so that's our training mechanism and it's important to to know no that you know
for generating this loss we're not just doing free generation we're not just saying to the model go off and generate a sentence um what we're actually doing
is at each step we're effectively saying okay the prefix is the students open what probability distribution do you put on next words after that um generate it
with our current new network and then say ask for the actual next word what probability estimate did you give to there and that's our loss but then what
we do is stick there into our current new network the right answer so we always go back to the right answer generate probability distribution for
next words and then ask okay what probability did you give to the actual next word exams and then again we use the actual next word so we do one step
of generation then we pull it back to what was actually gener ated what was what was actually in the text and then we ask it for guesses over the next word
and repeat forever and so the fact that we don't do free generation but we pull it back to the actual piece of text each
time um makes things simple because we sort of know what an actual author used for the next word um and that process is
called teacher forcing and so the most common way to language models is using this kind of teacher forcing method I mean it's not
perfect in all respects cuz you know we're not actually exploring different things the model might want to generate on its own and seeing what comes after them we're only doing the tell me the
next word from some human generated piece of text okay um so that's how we get losses um and then after that um we want to as
before use these losses to update the parameters of a newal network okay um and how do we do that um
well in principle you know we just have all of the texts that we've collected which you could think of as just a really long sequence of okay we've got a
billion words of text here it is right so in theory you could just run your um recurrent newal network over your billion words of text updating the
context as you go um but that would make it very difficult to train a model because you'd be accumulating these
losses for a billion steps and you'd have to store them um and then You' be you'd have to store hidden States so you could update parameters and it just
wouldn't work so what we actually do is we cut our training data into segments of a reasonable length and then we're
going to sort of run our recurrent newal Network on those segments and then we're going to compute a loss for each segment
and then we're going to update the parameters of the recurrent new network based on the losses that we found for that segment um I I describe it here as
the segments being sentences or documents which seems a linguistically nice thing it turns out that um in recent practice when you're wanting to
scale most efficiently on gpus people don't bother with those linguistic niceties they just say a segment is 100 words just cut every 100 Words and the
reason why that's really convenient is you can then create a batch of segments all of which are 100 words long and
stick those in a matrix and do um vectorized training more efficiently um and things go great for you okay but there's still a few more things that we
need to know um to get things to work great for you I was try and get a bit bit more through this before um today ends so we sort of need to know about
how to work out the derivative of our loss with respect to um the parameters of our recurrent newal
Network and the interesting case here is you know these wh parameters are sort of being used everywhere through the neural
network at each stage as are the we ones so they appear at many places in the network so how do we work out the partial derivatives of the loss with
respect to the repeated weight matrices and and the answer to that is oh it's really simple um you can just
sort of pretend that those wh's in each position are different and work out the partials with respect to them at one
position and then to get the partials with respect to wh you just sum whatever you found in the different
positions and so um that is sort of okay the gradient with respect to repeated weight is the sum of the gradient with respect to each time it appears and the
reason why that is it sort of follows what I talked about in lecture three um that we talk or you know you can also
think about it in terms of what you might remember um from you know multivariable chain roles but you know the way I introduced in lecture three is
that gradient at outward branches and so what you can think about it in a case like this is that you've got uh wh
Matrix which is being copied by identity to wh1 wh2 wh3 wh4 Etc at each time step
and so since those are identity copies they have um a partial derivative with respect to each other of one
and so then we apply the multivariable chain roll to these copies um and so we've then got an outward branching node
and you're just summing the gradients um to get the total gradient of each time for The Matrix okay
um yeah I mean there's one other trick that's perhaps worth knowing I mean if you've got sort of segments that are 100 long um a common speed up is to say oh
maybe we don't actually have to run back propagation for 100 time steps maybe we could just run it for 20 times steps and stop which is referred to as truncated
back propagation through time I mean in practice that tends to be sufficient note in particular you're still on the forward path updating your hidden State
using your full context but in the back propagation you're just sort of cutting it short um to speed up
training okay um so just as I did before with an engram language model we can use uh RNN language model to generate text
and it's pretty much the same idea except now we're sort of um rather than just using counts of engrs we're using the hidden state of our neural network
to give us the input to a a probability distribution that we can then sample from so I can start with the initial hidden State um I can use the start of
sentence symbol I mean the example I had before I started immediately with the um hoping that that was less confusing the first time but what you should have
asked is wait a minute where did the the come from um so normally what we actually do is is use a special start of
sequence symbol like this angle bracketed s and so we sort of feed it in as a pseudo word which has a word embedding and then we based on this will
be generating first words of the text um so we end up with some representation from which we can sample and get the first word so now we don't
have any actual text so what we're going to do is take that generated word that we generated and copy it down as the next
input and then we're going to run the next stage of newal network um sample from the probability distribution and next word favorite copy it down as the
next word of the input and keep on generating and so this is referred to as a roll out that you're kind of continuing to roll the dice and generate
forward and generate a piece of text and so um and normally you want to stop at some point and the way we can stop it
some point is we can have a second special symbol um the angle bracket SLS which um says end of um your sequence so
we can generate an end of sequence symbol and then we can um stop and so using this we can sort of generate pieces of text and essentially you know
this is exactly what's happening if you use something like chat GPT right that the model is a more complicated model that we've haven't yet gotten to but
it's generating the response to you by doing this kind of process of generating a word at the time treating it as an input and generating the next word and
generating this sort of roll out and that's why and it's done probabilistically so if you do it multiple times um you can get different answers we haven't yet gone to chat GPT
but we can have a little bit of fun um so you can take this simple recurrent newal Network that we've just built here and you can train it on any piece of
text and get it to generate stuff so for example I can train it on Barack Obama's speeches so that's a small Corpus right you know he didn't talk that much right
I've only got a few hundred thousand words of text it's not a huge Corpus I'll just show this and then I can answer the question um but you know I can generate from it and I get something
like the United States will step up to the cost of a new challenges of the American people that will share the fact that we created the problem they were attacked and so that they have to say
that all the task of the final days of war that I will not be able to get this done um yeah well maybe that's slightly better than my engram language model
still not perfect you might say but somewhat better maybe did you have a question uh yeah so since we're like training the mod like truncated set of
the Corpus that impose some kind of like limitation on like how much we can like produce and like still have some cency
like meaning like foring um so yeah so I suggested we're going to chunk the S chunk the text into
100w units so you know that's the limit of the amount of Prior context that we're going to use so I mean that's a fair amount 100 words that's typically
several sentences but to the extent that you wanted to know even more about the further back context you wouldn't be able to and you know certainly that's
one of the ways in which modern large language models are using far bigger context than that they're now using thousands of words of Prior context yeah
absolutely it's a limit on how much far back context so in some sense actually even though in theory a current newal Network can feed in an arbitrary length
context as soon as I say oh practically we cut it into segments you know actually that means we are making a Markoff assumption again and we're
saying the further back context doesn't matter yeah okay uh couple more examples um so instead of Barack Obama I can feed
in Harry Potter which is a somewhat bigger Corpus of text actually and generate from that and so I can get um sorry Harry shouted panicking I'll leave
those brooms in London are they no idea said nearly headless Nick casting low close by Cedric carrying the last bit of trial charms from Harry's shoulder and
to answer him the common room perched upon it forearms held a shining knob from when the spider hadn't felt it seamed he reached the teams
too well there you are um you can do other things as well um so you can train it on recipes and
generate a recipe um this one's a recipe I don't suggest you try and cook um but it looks sort of like a recipe if you
don't look very hard um chocolate Ranch Barbecue categories game casseroles cookies cookies yield six servings two P
two tablespoons of Parmesan cheese chopped um one cup of Co coconut milk and three eggs beaten um Place each pasture over layers of lumps shape
mixture into the moderate oven and simmer until firm serve hot and bodied fresh mustard orange and cheese combine
the cheese and salt together the dough in a large Skillet and the ingredients and stir in the chocolate and pepper H
um yeah it's not exactly very consistent recipe when it comes down to it it sort of has a langage of a recipe but it's Absolut maybe if I had scaled it more
and had a bigger Corpus it would have done a bit better um but it's definitely not using the ingredients there are um
let's see it's almost um time today so maybe about all I can do um is uh do H I can do one more fun example and then
after that oh yeah I probably should that bit at the start next time um so as a variant of building RNN language
models I mean so far we've been building them over words so our you know token time steps over which you build it as
words I mean actually you can use the idea of recurrent new networks over any other size unit and people have used them for other things so people have
used them in bioinformatics for things like um DNA for sort of having Gene sequencing or protein sequencing or anything like that but even staying with
language um instead of building them over words you can build them over characters so that my I'm generating at
a a letter at a time rather than a word at a time and so that can sometimes be useful because it allows us to sort of generate things um that sort of look
like Words um and perhaps have the structure of English words um and so and so similarly there are other things that you can do so before I
initialized the hidden State I said oh you just have an initial hidden State you can make it zeros if you want well sometimes we're going to build a
contextual RNN where we can initialize the hidden State um with something else so in particular I can initialize the
hidden state with the RGB values of a color and so I can have initialized the hidden state with the color and generate
character at a time the name of paint colors and I can train a model based on um a paint company's catalog of names of
colors and their um RGB of their colors and then I can give it different different paint colors and it'll come up with names for them and it actually does
an excellent job this one worked really well look at this um this one here is
gasty pink Power gray Naval tan bco white hble gray Home Star Brown now couldn't you just imagine finding all of
these in a paint catalog I mean some of them are there's some really good ones over here in the bottom right this color here is
dope and then um this Stoner blue purple s stinky bean and Turley now I think I've got a a
real business opportunity here in the Paint Company Market um for my recurrent new network okay I'll stop there for today and do more of the science of new
networks next time
Loading video analysis...