Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 5 - Recurrent Neural Networks

By Stanford Online

Summary

## Key takeaways - **Deep Nets Stuck 15 Years**: Although people had neural networks and backpropagation, for a very long period of time, about 15 years, things seemed completely stuck; you couldn't get deep neural networks to work in practice, even though in theory they seemed promising. [06:03], [06:26] - **Modern Overfitting Embraced**: With modern large neural networks, we train our models so that they overfit to the training data almost completely, getting near zero loss, but providing you've done regularization well, the model will also generalize well to different data. [12:24], [13:05] - **Dropout Forces Robustness**: At training time with Dropout, you randomly throw away some inputs inside the middle layers using a random mask, training the model to be robust and make as much use of every input as it can since any component might randomly disappear. [14:15], [15:10] - **N-gram Sparsity Hack: Add Delta**: For unseen n-grams causing zero probability counts, people hacked the counts by adding a little delta like 0.25 to counts so things never seen got a count of 0.25, ensuring no zeros and everything possible. [35:47], [36:08] - **RNNs Reuse Weights Over Time**: Recurrent neural networks use one set of weights applied through successive moments in time, updating a hidden state at each time step that stores memory of everything seen so far, processing any length of context without exponential parameter growth. [51:44], [53:43] - **Teacher Forcing Trains RNNs**: In RNN language model training, at each step you generate the probability distribution for next words using the actual prefix from the text, compute the negative log likelihood loss for the actual next word, then feed in the actual word as input for the next step. [01:01:51], [01:02:21]

Topics Covered

Full Video

Full Transcript

okay um let me get started for today so for today um first of all I'm going to spend a few minutes talking about a

couple more new net Concepts including actually a couple of the concepts that turn up in assignment two um um then the

bulk of today is then going to be moving on to introducing what are language models um and then after introducing

language models we're going to um introduce a new kind of newal network which is one way to build language models which is recurrent newal networks

um they're important thing to know about and we use them in assignment three um but they're certainly not the only way to build language models in fact probably a lot of you already know that

there's this other kind of new Network called Transformers and we'll get on to those after we've done recurrent new Nets um Talk a bit about problems with

um Rec current new networks and well if I have time I'll get onto the um recap um before getting into the content of the class I thought I could just spend a minute on giving you the stats of who is

in um cs224n um who's in cs224n kind of looks like the pie charts they show in CS 106a these days um

except more grad students I guess um so the four big groups are the computer science undergrads the computer science grads um the Undeclared

undergraduates and the um ndo grads so this is a large portion of the scpd students though um some of them are Under Computer Science grads so that

makes up about 60% of the audience um and if you're not in one of those four big groups um you're in the other 40% and everybody is somewhere so there are

lots of other interesting groups down here so you know the the bright orange down here that's where the math and physics phds are um up here I mean

interestingly we now have more statistics grad students and there are ssis undergrads it didn't used to be that way around in NLP classes um and

you know one of my favorite groups um the little um magenta group down here um these are the humanity undergrads yay

Humanity's undergrads um in terms of years it breaks down like this um first year grad students are the biggest group

tons of Juniors and seniors and a couple of Brave FR are any Brave FR here today [Laughter]

yeah okay um welcome yeah so um modern newal networks especially language models are

enormous um this chart's sort of out of date because it only goes up to 2022 but it's sort of actually hard to make an accurate chart for 2024 because

in the last couple of years um the biggest language model makers have in general stopped saying how large their language models are in terms of parameters but at any rate they're

clearly um huge models which um have over a 100 billion parameters um and so large and then deep in terms of

very many layers new Nets are a Cornerstone of modern NLP systems we're going to be um pretty quickly working

our way up to look at those kind of deep models but I just sort of for starting off with something simpler you know I did just want to kind of um key you in

for a few minutes into a little bit of History right um so um the last time neural Nets were popular was in the 80s

and 90s and that was when people worked out the back propagation algorithm Jeff Hinton and colleagues um made famous the back propagation algorithm that we've

looked at and that allowed the training of neural Nets with hidden layers um and so but in those days pretty much all the new du Nets with

hidden layers that were trained were trained with one hidden layer you had the input the hidden layer and the output and that's all that there was and

the reason for that was for a very very long time people couldn't really get things to work um with more hidden

layers so that only started to change in the Resurgence of what often got called Deep learning but anyway back to neural

Nets um started around 2006 and this was um one of the influential papers at the time um greedy layerwise training of deep neural networks by Yoshua Benjo and

colleagues and so right at the beginning of that paper they observed this the the problem however until recently it was believed too difficult to train deep

multi-layer newal networks empirically deep networks were generally found to be not better and often worse than new networks with one or two hidden layers

um Jerry Torro um was actually a faculty member who worked very early on autonomous driving with new networks um as this is a negative result there's not

been much report in the machine learning literature um so that really you know although people had newal networks and back propagation and current newal

networks we're going to talk about today that for a very long period of time you know 15 years or so things seemed

completely stuck in that you couldn't although in theory it seemed like deep neural network should be promising in practice um they didn't work and so it

really then took some new developments that happened in the late 2000s decade and then more profoundly in the 2010s decade to actually figure out how we

could have deep neural networks that actually worked working far better than the shallow neural networks and leading into the networks that um we have today

and you know we're going to be starting to talk about some of those things in this class and come um coming up with

classes and I mean I I think you know the tendency when you see the things that got new networks to work much

better like the the the natural action is to sort of shrug and be underwhelmed and think oh is this all there is to it

this doesn't exactly seem like difficult science um and in some sense that's true they're fairly little introductions of

new ideas and tweaks of things but nevertheless a handful of little ideas and tweaks of things turn things around

from a field that was sort of stuck for 15 years going nowhere and which nearly everyone had abandoned because of that to suddenly turning around and there

being the ability to train these deeper neural networks which then behaved amazingly better as machine learning systems than other things that had

preceded them and dominated for the for the intervening time so that took a lot of time so what are these things um one

of them which you can greet with a bit of a yawn in some sense is doing better regularization of neural Nets so

regularization is the idea that Beyond just having a loss that we want to minimize in terms of describing the data

um we want to in some other ways manipulate what parameters we learn so that our models work better and so

normally we have some more complex loss function that does some regularization the most common way of doing this is what's called L2 loss where you add on

this um parameter squared term at the end and this regularization says you know it would be kind of good to find a

model with small parameter weights so you should be finding the smallest parameter weights um that will explain your data well and there's a lot you can

say about um regularization these kind of losses they get talked about a lot more in other classes um like cs229

machine learning and so um I'm not going to say very much about it this is in the machine learning theory class um but I

do just want to sort of um put in one note that's sort of very relevant um to um what's happened in recent new

networks work um so the classic view of regularization was we needed this kind of regularization to prevent our networks from

overfitting meaning that they would do a very good job at modeling the training data but then they would generalize

badly to new data that was shown and so the picture that you got shown was this that as you train on some training data

your error necessarily goes down however after some point you start learning specific properties of things

that happen to turn up in those training examples and that you're learning things that are only good for the training examples and so they won't generalize

well to different pieces of data you see at test time so if you have a separate validation set or a final test test set

you would and you traced out the error or loss on that um validation or test set that after some point it would start

to go up again this is a quirk in my bad PowerPoint it's just meant to go up um and the fact that it goes up um is then you have overfit your training data and

have making the parameters numerically small is meant to lessen the extent to which you overfit on your training data

um this is not a picture um that modern newal Network people believe at all instead the picture has changed like

this um we don't believe that overfitting exists anymore but what we are concerned about is models that will

generalize well to different data um so that when we train you know So In classical statistics the idea that you

could train billions of parameters like large neuron Nets now have um would be seen as ridiculous because you could not possibly estimate those parameters well

um and so you just have all of this noisy mess um but what's actually been found is that yeah it's true you can't estimate the numbers well but what you

get is a kind of an interesting averaging function from all these Myriad numbers and if you do it right what

happens is as you go on training that for a while it might look like you're starting to overfit but if you keep on training in a huge Network

um not only will your training loss continue to go down very infinitesimally but your validation loss will go down as

well and so that huge on huge networks these days is um we train our models so that they overfit to the training data

almost completely right so that if you train a huge Network now on a training set you can essentially train them to

get zero loss you know maybe it's 0.007 loss or something but you can train them to get zero loss because you've got such Rich models you can

perfectly fit memorize the entire training set now classically that would have been seen as a disaster because you've overfit the training data with modern large neural networks it's not

seen as a disaster because providing you've done regularization well that your model will also generalize well to different

data however the flip side of that is normally this kind of L2 regularization or similar ones like L1 regularization aren't strong enough regularization to

achieve that effect and so neural network people have turned to other methods of regularization of which everyone's favorite is Dropout so this

is one of the things that's on the assignment and um at this point I should

uh um apologize or something because the way the way Dropout is um done the way Dropout is presented here is sort of the

original formulation the way Dropout is presented on the assignment is the way now normally done in deep learning packages so um the there are a couple of

details that vary a bit and let me just present the main idea here and um not worry too much about the details of the math so the idea of Dropout is at

training time um every time you are doing a piece of training with an example what you're going to do is inside the middle layers of the neural

network you're just going to throw away some of the inputs and so technically the way you do this is you have a random mask that you sample each time of zeros

and ones you do a hadam mod product of that with the data so some of the data items go to zero and you have different

Mass each time so for the next um thing you know I've now masked out um something different this time and so you're just sort of Random ly throwing

away the inputs and the effect of this is that you're training the model that it has to be robust and work well and

make as much use of every input as it can it can't decide that can be extremely reliant on you know component 17 of the vector because sometimes it's

just going to randomly disappear so if there are other features that you could use instead that would let you work out what to do next you should also know how

to make use of those features so at training time you randomly delete things at test time sort of for efficiency but also quality of the answer um you don't

delete anything you keep all of your weights but you just rescale things to make up for the fact that you used to be dropping things okay um so what there are several

ways that you can think of explaining this one motivation that's often given is is that this prevents feature co- adaptation so rather than a model being

able to learn complex functions of feature 7 8 and 11 can help me predict this it knows that some of the features might be missing so it has to sort of

make use of things in a more flexible way another way of thinking of it is that there's been a lot of work on model ensembles where you can sort of mix

together um different models and improve your your results if you're training with Dropout it's kind of like you're training with a huge model Ensemble because you're training with the

Ensemble of the power set the exponential number of every possible Dropout of features all at once and that gives you a a very good model um so

there are different ways of thinking about it I mean if you've seen na bays and logistic regression models before you know I kind of think a nice way to think of it is that it gives a sort of a

middle ground between the two because for naive based models you're waiting each feature independently just based on the data statistics doesn't matter what other features are there in a logistic

regression weights are set in the context of all the other features and with Dropout you're somewhere in between you're seeing the weights in the context of some of the other features but

different ones will disappear at different times um but you know following work that was done um at Stanford by Stefan varer and others that

generally these PE days people regard Dropout as a form of feature dependent regularization and he shows some theoretical results as to why to think of it that

way okay I think we've implicitly seen this one um but vectorization is the

idea no for Loops always use vectors matrices and tensors right the entire success and speed of deep learning works

from the fact that we can do things with vectors matrices and tensors so you know if you're writing for Loops in any language but especially in Python things

run really slowly if you can do things with vectors and matrices even on CPU things run at least an order of magnitude faster and well what everyone

really wants to move to doing in deep learning is running things on gpus or sometimes now newal processing units and then you're getting you know two three

orders of magnitude of speed up so um do always think about I should be doing things with vectors and matrices if I'm writing a for Loop for anything that

isn't some very superficial bit of input processing I've almost certainly made a mistake and I should be working out um how to do things um with vectors and

metries and you know that's kind of thing like Dropout you don't want to write a for Loop um that goes through all the positions and sets some of them to zero you want to be sort of using a

vector operation with your mask um two more I think um parameter initialization I mean this one might not

be obvious but when we start training our neural networks in almost all cases it's vital that we

initialize the parameters of our matrices to some random numbers and the reason for this is if we just start with

the um if we just start with our matrices all zero or some other constant normally the case is that we have um

symmetry so it's sort of like in this picture when you're starting on this Saddle Point um that you know it's symmetric to the left and the right and

um or whatever forward and backwards and left and right and so is you sort of don't know where to go and you might be sort of stuck and stay in the one place

I mean normally a way to think about is the operations that you're doing to all the elements in The Matrix are sort of the same so rather than having you know

a whole Vector of features if all of them have the same value initially often it's sort of like you only have one feature and you've just got a lot of

copies of it so to initialize learning and have things work well we almost always want to set all the weights to

very small random numbers um and so at that point you know CL when I say very small we sort of want to make them an A

Range so that they don't disappear to zero if we make them a bit smaller and um they don't sort of start blowing up into huge numbers when we multipli them

by things and doing this initialization at the right scale was used to be seen as something pretty important and there were particular methods that had a basis of sort of thinking of what happens once

you do Matrix multiplies that people had worked out and often used one of these was this harier initialization which was sort of working

out um what variance of your uniform distribution to be um variant of your distribution to be using based on the sort of number of inputs and outputs of

a layer and things like that the specifics of that um you know I think we still use to initialize things in assignment two but we'll see later that

they go away because people have come up with clever methods um in particular doing layer normalization which sort of obviates the need to be so careful on

the initialization but you still need to initialize things to something okay then the fast the final one which I'll is also something that appears in

the second assignment that I just want to say a word about um was optimizers so we talked about in class um stochastic gradient descent and did the basic um

equations for stochastic gradient descent and you know to a first approximation there's nothing wrong with stochastic gradient descent and if you fiddle around enough you can usually get

stochastic gradients send actually to work well for almost any problem but um getting it to work well is very dependent on getting the scales of

things right of sort of having the right step size and often you have to have a learning rate schedule with decreasing step sizes and various other complications so people have come up

with um more sophisticated optimizers for newal networks and for complex Nets sometimes these seem kind of necessary

to get them to learn well and at any rate they give you sort of lots of margins of safety since they're much less dependent on you setting different

hyperparameters right and the idea of all of the well all the methods I mentioned and the most commonly used methods is that for each parameter

they're accumulating a measure of what the gradient has been in the past and they've got some idea of the scale of the gradient the slope for a particular

parameter and then they're using that to decide how much you move the learning rate at each time step so the simplest method that was come up with this one

called adrad if you know John duci and E he was one of the co-inventors of this um you know it's simple and nice enough but it tends to stall early then people

came up with different methods Adam's the one that's on assignment two it's a really good safe place to start art um but in a way um sort of our word vectors

have a special property because of their spareness that you know you're very sparely updating them because particular words only turn up occasionally so people have actually come up with

particular um optimizers that sort of have special properties for things like word vectors and so these ones with the w at the end can sometimes be good to

try and then you know again there's a whole family of extra ideas that people have used to improve optimizers and if you want to learn about that you can go

off and do an optimization class like Conex optimization but there ideas like momentum and nesterov acceleration and things like that and all of those things

people also variously try to use um but Adam is a good name to remember um if you remember nothing else okay that took longer than I hoped

but I'll get on now to language models okay language models so you know um in some sense language model is just two

English words but when in NLP we say language models we mean it as a technical term that has a particular

meaning um so the idea of a language model is something that can predict well what word is going to come

next or more precisely it's going to put a probability distribution over what words come next so the students open there what words are likely to come

next bags laptops laptops notbook notebooks notebooks yeah um I have some of those

at least okay um yeah I mean so right so so these are kind of likely words and if on top of those we put a probability on

each one then we have a language model so formally we've got a context of proceeding items we're putting a

probability distribution over the next item which means that the sum of the estimates of this for the um items in the vocabulary will sum to one and if

we've defined a p like this that predicts probabilities of next words that is called a language

model as it says here um an alternative way um that you can think of a language model is that a language model is a

system that assigns a probability to a piece of text and um so we can say that a language model can can take any piece

of text and give it a probability and the reason we can do that is we can use the chain rule so I want to know the probability of any stretch of text I say

given my previous definition of language model easy I can do that probability of X1 with a null preceding context times

the probability of X2 given X1 Etc along I can do this chain rule decomposition and then the terms of that decomposition

are precisely what the language model as I defined it previously provides okay so language models are this essential

technology for NLP just just about everything from the simplest places forward um where people do things with human language and computers people use

language models in particular you know they weren't something that got invented in 2022 with chat GPT language models

have been Central to NLP at least since the 80s the idea of them goes back to at least the 50s um so anytime you're typing on your phone and it's making

suggestions of next words regardless of whether you like those suggestions or not um those suggestions are being generated by a language Model A Very uh

traditionally a compact not very good language model so it can run sort of quickly and very little memory in your keyboard application um if you go on

Google and you start typing some stuff and it's telling you stuff that could come after it um to complete your query well again that's being generated by a

language model so how can you build a language model so before getting into new language models I've got just a few slides to tell you about the old days of

language modeling so this is sort of how language models were built um from 1975 until you know effectively around

about 2012 um so we want to put probabilities on these

sequences um and the way we're going to do it is we're going to build what's called an engram language model um and so this is meaning we're going to look

at Short word subsequences and use them to predict so N is a variable describing how short are the word sequences that we're going to

use to predict um so if we just look at the probabilities of individual words we have a unigram language model if we look

at probabilities of pairs of words Byram language model um uh probabilities of three words trigram language models probabilities of more than three words

they get called four gr language models five gr language models six gr language models um so for people with a Classics education this is horrific of course um

in particular not even these ones are correct um because Graham is a Greek root so it should really have Greek numbers in front here um so you should

have Monograms and diagrams um and you know actually so the first person who introduced the idea of engram models was actually Claude Shannon when he was working out information Theory the same

guy that did cross entropy and all of that and if you look at his 1951 paper he uses diagrams um but the idea died about there and everyone else this is

what people say in practice um It's Kind kind of cute I like it a nice you know practical notation um so to build these

models the idea is look we're just going to count how often different engrs appear in text and use those to build

our probability estimates um and in particular our trick is that we make a mark of assumption so that if we're

predicting the next word based on a long context we say ah tell you what we're not going to use all of it we're only

going to use the most recent n minus one words so we have this big context and we throw most of it away and so if we're

predicting word XT + one based on simply the preceding n minus one words well then we can make the prediction using

NRS um for let's whatever it is if we use n is three would' have a triam here and

normalized by a Bagram down here and that that would give us relative um frequencies of the different

terms um so we can do that simply by counting how often NRS occur in a large amount of text and simply divid through

by the counts and that gives us a relative frequency estimate of the probability of different continuations does that make sense yeah

that's a way to do it okay um so suppose we're um learning a a forr language model right and we've got a piece of text as the Proctor started the clock

the students open there so well to estimate things we are going to throw away all but the preceding three words so we're going to estimate based on

students open there and so we're going to work out the probabilities by looking for counts of students open their W and

counts of students open there um so we might have in a corpus that students open there occurred a thousand times students open their books occurred 400

times and so we'd say the probability estimate is simply 0.4 for B if exams occurred 100 times the probability estimate is

0.1 for exams um and well you can sort of see that this is bad it's not terrible because if

you are going to try and predict the next word in a simple way looking at the immediately prior words is are the most helpful words to look at but it's it's

clearly sort of primitive because you know if you known the prior text was as the Proctor started the clock that makes it sound likely that the words should have been exams where since you're

estimating just based on students open theirs well you'd be more likely to choose books because it's more common so

it's a kind of a crude estimate but it's a decent enough place to start um it's a crude estimate that could be problematic

in other ways I mean why why else might we kind of get in Troubles by using this our probability estimate

yeah so there are a lot of engrams yeah so there are a lot of words and therefore there are a lot of lot of engrams yeah so that's a problem we'll come to it later uh anything else maybe

up the back um like the word w might not even show up in the training data so you might just have a count zero for that yeah yeah so um so if we're counting

over any reasonable size Corpus there are lots of words that we just are not going to have seen right that they never

happen to occur in the text that we counted over you know so if you start thinking students open there you know there are lots of things that you could

put there you know students open their accounts or if the students are doing dissections in a biology class maybe students open their frogs I don't know

um you know that there are lots of words that in some context you know would actually be possible and lots of them that we won't have seen and so it give

them a probability estimate of zero and that tends to be an especially bad thing to do with probabilities because once we have a probability estimate of zero any computations that we do that involve

that will instantly go to zero so we have to deal with some of these problems so for that sparity problem right yeah

that we could have the word never occurred in the numerator and so simply done we get a probability estimate of

zero the way that was dealt with was that people just hacked the counts a little to make it non zero so there are lots of ways that are explored but the

easiest way is you just sort of added a little Delta like you know 0.25 to counts so things that you never saw got

a count of 0 .25 in total and things you saw once got to count of 1.25 and then there are no zeros anymore everything is possible um you could think then there's

a second problem that wait you might never have seen stupid students open there before and so that means your denominator is just undefined and you

don't have any counts in the numerator either so you sort of need to do something different there and the standard trick was used then was that you um did back off so if you couldn't

estimate words coming after students open there you just worked out the estimates for come words coming after open there and if you couldn't estimate

that you just use the estimate of words coming after there so you used less and less context until you could get an estimate that you could

use um but you know something to note is that we''ve got these conflicting pressures now so that on the one hand you know if you want to come up with a

better estimate that you would like to use more context I to have a larger engr

but on the other hand as you make use more more conditioning words well the storage size problem someone mentioned

gets worse and worse because the number of NRS that you have to know about is going up exponen eventally with the size of the context but also your spareness

problems are getting way way worse and you're almost necessarily going to be ending up seeing zeros and so because of that you know in practice where things

tended to um sort of max out was five and occasionally people use six gr and seven G but most of the time you know between the sort of spareness and the

cost of storage 5 G was the large thing people dealt with um so um a famous resource from back in the

2000s decade that Google released um was Google engrs which was built on a a trillion word web Corpus and had counts

of n g and it gave counts of n g up to nals 5 and that is where they stopped okay well we've sort of said the storage problem the storage problem is

well to do this you need to store the these counts the number of counts is going up exponentially in the amount of Contex size um okay um but you know

what's good about engram language models they're really easy to build you can um build one yourself in a few minutes when you've got want to have a bit of fun on

the weekend um you know all you have to do is start sort of storing um these counts for engrams and you can use them to predict things so you know for if at

least if you do it over a small Corpus like a couple of million words of text um you know you can build an engram language model in seconds on your laptop

or you have to buil write the software okay a few minutes to write the software but building the model takes seconds because you know there's no training in your network all you do is count how

often um engrams occur and so once you've done that you can then run an engram language model to generate text you know we could do text generation

before chat GPT right so if I have a trigram language model I can start off with some words today the and I could

look at my stored engrams and get a probability distribution over next words and here they are you know note the

strong patterning of these um these um probabilities cuz remember they're all der from counts right that are being normalized so really these are words

that occurred once these are words that occurred twice these are words that occurred four times in this context right so they're sort of in some sense crude when you look at them more

carefully but so what we could do is then at this point you know we roll a a die and get a random number from 0er to

one and we can use that sample from this distribution um sorry yeah

um so we sample from this distribution and so that if we sort of um generate so of as our random number something

like 35 if we go down from the top we'd say okay we've sampled the word price today the price and then we repeat over we

condition on that we probability distribution of the next word um we generate a random numbered and use it to sample from the distribution um we say

generate 2 and so we choose of um we now condition on that we get a probability distribution we generate a random number

which is 0.5 or something and so we get gold coming out and we can say today the price of gold and we can keep on doing this and generate some text and so

here's some text generated um from 2 million words training data using a trigram language model today the price

of gold per ton while production of shoe lasts and shoe industry the bank intervened just after it considered and rejected an IMF demand to rebuild

depleted European stocks September 3rd in primary 76 cents a share um now okay that text isn't great um but

you know I actually want people to you know be in a positive of mood today um and you know actually it's not so bad

right it's sort of surprisingly grammatical I mean in particular like I lowercased everything so this is the IMF that should be capitalized of the

international monetary fund right you know there are big pieces of this that even make sense right the bank intervened just after it considered and rejected an if IMF demand you know

that's pretty much making sense as a piece of text um right so it's mostly grammatical it looks like you know English text I mean it it makes

no sense right it's sort of really incoherent so there there's work to do but you know what was already you could

see that even these simple engram models you could from a very low level you could kind of approach what text and

human language worked like in from below and you know I could easily make this better even with the engram language model because you know rather than two million words of text if I trained on 10

million words of text would be better if I then rather than a trigram model could go to a forr model get better and You' sort of start getting better and better

um approximations of text um and so this is essentially what people um did until about

2012 and and you know really uh the same story that people um tell today that scale will solve everything is exactly the same story that people used to tell

in the early 2010s with these engram language models if you weren't getting a good enough results with your 10 million words of text and a trigram language model the

answer was that if you had a 100 million words of text and a for gram language model you'd do better and then if you had a trillion words of text in a five gr language model You' do better and gee

wouldn't it be good if we could collect 10 trillion words of text so we could train an even better engram language model same strategy um but it turns out

that sometimes you can do better with better models as well as simply scale um and so things got reinvented and started

again with building neural language models um so how can we build a neural language model um so you know we've got

the same task of having a sequence of words and we want to put a probability estimate over what word comes next um and so the simplest way you could do

that which you'll hopefully all have thought of because it connects what we did in earlier classes look we already had this idea that we could have

represented context by the concatenation of some word vectors and we could put that into a neural network and we could

use that to predict something and in the example I did in the last couple of classes what we used it to predict was is the center word a location or not a

location just a binary choice but that's not the only thing we could predict we could have predicted lots of things with this new network we could have predicted whether the piece of text was positive or negative we could have predicted

whether it was written in English or Japanese you know we can predict lots of things so one thing we could choose to predict is we could choose to predict what word is going to come next after

this window of text we'd have a model just like this one apart from up the top instead of doing this binary classification we' do a many many way

classification over what is the next word that is going to appear in the piece of text and that would then give us a neural language model in particular

it give us a fixed window neural language model so that we'd do the same Markoff assumption trick of throwing

away the further back context and so for the fixed window um we'll you know um use word embeddings which you can

concatenate we'll put it through a hidden layer and then we'll take the output of that hidden layer um multiply it by by another layer say and then put

that through a soft Max and get an output distribution and so this gives us a sort of a fixed window neural language model and you know apart from the fact

that we're now doing a classification over many many many classes this is exactly like what we did um last week so it should look kind of familiar it's

also kind of like what you're doing for assignment two um and so this is essentially the first kind of new language model that was

proposed um so in particular um yosua Benjo um really sort of right at the beginning of the 21st century suggested that you could do this that rather than

using an engram language model you could use a fixed window neural language model and you know even at that point um he

and colleagues were able to get some positive results from this model but you know at the time it wasn't widely noticed it didn't really take off that

much and you know it was sort of for a combination of reasons when it was only a fixed window it was sort of not that different to engrs in some sense and although the new network could give

better generalization it could be argued rather than using counts I mean in practice you know new Nets were still hard to run

without gpus and people felt and I think in general this was the case that you could get more oomph by doing the scale story and um collecting your engram

counts on hundreds of billions of words of text um rather than trying to make a new network out of it and so it didn't really sort of especially take off at

that time but you know in principle it seemed a nice thing it you know got rid of the spasy problem um it got rid of the storage costs you no longer have to

store all observed engrs you just have to store the parameters of your newal network but it didn't solve all the problems that we'd like to solve so in

particular we still have this problem of the Markoff assumption that we're just using a small fixed context beforehand to predict

from um and there are some disadvantages to enlarging that window and you know there's no fixed window that's ever big enough um there's another there's

another thing that if you look technically at this model that might sort of make you suspicious of it which

is you know when we have words in different positions that those words and different positions will be treated by

completely different subp parts of this Matrix W so you might think that you know know okay for predicting that books

comes next you know the fact that this is a student um is important but it doesn't matter so much exactly where the word student occurs right you know the

context could have been the students slowly open there um and it's still the same students we've just got a bit different linguistic structure where

this W Matrix would be using completely separate parameters to be learning stuff about student here versus student in this position so that seems kind of

inefficient and wrong um and so that suggested that we kind of need a different kind of neural architecture

that can process any length of input and can use the same parameters to say hey I saw the word student that's evidence that things like books exams homework

will be turning up regardless of where it occurs and so that then led to exploration of this different neural network architecture um called recurrent

neural networks which is what I'll go on to next but before I do is everyone basically okay with what a language model is yeah no

questions okay um we're current newal networks um so recurrent newal networks is a different

family of newal networks so effectively in this class we see several neural network architectures um so in some sense the

first architecture we saw was word to V it's a sort of a very simple um encoder decoder architecture um the second

family we saw was feed forward Network or fully connected layer classic neural networks and the third family we're going to see is recurrent neural

networks which have different kinds and then we'll go on and go on to Transformer models okay so the idea of a recurrent newal network is that you've

got one set of Weights that are going to be applied through successive moments in time I successive positions in the text

and as you do that you're going to update the parameters as you go um we'll go through this in quite a bit of detail but you know here's the idea of it so

we've got the students open there and we want to predict with that and the way that we're going to do it okay I've still got four words in my

example so I can put stuff down the left side of the slide but there could have been 24 words with recurrent new networks because they can deal with any

length of context okay so as before our words start off as just words or one hot vectors and we can look up their word embeddings just like

before okay but now to compute probabilities for the next word we're going to do something different so our

hidden layer is going to be recurrent and by recurrent it means we're going to sort of change a hidden State at each

time step as we proceed through the text from left to right um so we're going to start off with an h0 which is the initial hidden state which can actually

just be all zeros um and then at each time step what we're going to do is we're going to multiply the previous hidden state by a weight M

Matrix we're going to take a word embedding and multiply it by a weight Matrix and then we're going to sum the results of those two things and that's going to

give us a new hidden state so that hidden state will then sort of store a memory of everything that's been seen so far so we'll do that and then we'll

continue along so we'll multiply the next word vector by the same weight Matrix we we store the previous multiply

the previous hidden state by the same weight M Matrix wa each and we add them together and get a new representation um I've only sort of said

this bit so I've left out a bit commonly there are two other things you're doing you're adding on a biased term because we usually separate out a bias term and you're putting things through a

nonlinearity so I should make sure I mention that and for recurrent neural networks most commonly this nonlinearity has actually been the tan H function so

it's sort of balanced on the positive negative side and so you keep on doing that through each step and so the idea

is once we've got to here this H4 hidden state is a hidden state that in some sense has read the text up until now it's seen all of the students open there

and if the word students occurred in any of these positions it will have been multiplied by the same we Matrix and added into the hidden state so it's kind

of got a cleaner low parameter way of incorporating in the information that seen so now I want to predict the next word and to predict

the next word I'm then going to do based on the final hidden State the same thing I did kind of thing I did before so I'm going to multiply that hidden state by

matrix and add another bias and stick that through a soft Max and use that to um sample from that soft Max well the softmax will give me a language model of

probability over all next words and I can sample from it to generate the next word that make

sense okay recurrent new um networks um okay um so for current newal networks we can now process any length of

preceding content next and we'll just put more and more stuff in our hidden State um the so our computation is using

information from many steps back um our model Size Doesn't increase for having a long context right you know we have to

do more computation for a long context but our representation of that long context just remains this fixed size hidden Vector h of whatever dimension it

is so there's no exponential blowout anymore um there's the same weights appli in every time step so there's a symmetry and how inputs are processed um

there are some catches um the biggest catch in practice is that recurrent computation is slow so

for the feed forward layer we just had you know our input Vector we multiply it by matrix we multiply it by matrix however many times and then at the end

we're done whereas here we've sort of stuck with this sequentiality that you have to be doing one hidden Vector at a time in fact this is going against what

I said at the beginning of class because essentially here you're doing a for Loop um you're going through for time equals 1 to T and then you're generating and

term each hidden vector and that's one of the big problems with rnns that have led them to fall out of favor um there's another

problem that we'll look at more is that in theory this is perfect you're just incorporating all of the past context in in your hidden Vector in practice it

tends not to work perfectly because you know although stuff you saw back here is in some sense still alive in the hidden

Vector as you come across here that your memory of it gets more and more distant and it's the words that you saw recently that dominate The Hidden State now in

some sense that's right because the recent stuff is the most important stuff that's freshest in your mind you know it's the same with human beings um they tend to forget stuff from further back

as well um but rnns especially in the simple form that I've just explained forget stuff from further back um rather

too quickly and we'll come back to that again um into in Thursday's class okay so for training an RNN language

model um the starting off point is we get a big Corpus of text again um and then we're going to compute um for each

time step a prediction of the probability of next words and then there's going to be an actual next word and we're going to use you know that as

the basis of our loss um so our loss function is the cross entropy between the predicted probability and what the actual next

word that we saw is which again as in the example I showed before is just the the negative log likelihood of the actual next word ideally you'd like to

predict the actual next word with probability one which means the negative log of one would be zero and there'd be no loss but in practice if you give it

an estimate of 0 five there's only a little bit of loss and so on and so um to get our overall objective function we

work out the average loss the average negative log likelihood of predicting each word in turn so showing that as pictures if our Corpus is the students

open their exams we're first of all going to be trying to predict um you know what

comes after the and we will predict some word with um different probabilities and then we'll say oh the actual next word is students okay you gave that a

probability of 0.05 say because all you know was the first word was the okay there's a loss for that um the negative

log prob given to students we then go on and generate the probability estimate over the next words and then we say well the actual word is opened what

probability estimate did you give to that we get a negative probability loss keep on running this along and then we

sum all of those losses and we average them per word and that's our sort of average per word loss and we want to

make that as small as possible and so that's our training mechanism and it's important to to know no that you know

for generating this loss we're not just doing free generation we're not just saying to the model go off and generate a sentence um what we're actually doing

is at each step we're effectively saying okay the prefix is the students open what probability distribution do you put on next words after that um generate it

with our current new network and then say ask for the actual next word what probability estimate did you give to there and that's our loss but then what

we do is stick there into our current new network the right answer so we always go back to the right answer generate probability distribution for

next words and then ask okay what probability did you give to the actual next word exams and then again we use the actual next word so we do one step

of generation then we pull it back to what was actually gener ated what was what was actually in the text and then we ask it for guesses over the next word

and repeat forever and so the fact that we don't do free generation but we pull it back to the actual piece of text each

time um makes things simple because we sort of know what an actual author used for the next word um and that process is

called teacher forcing and so the most common way to language models is using this kind of teacher forcing method I mean it's not

perfect in all respects cuz you know we're not actually exploring different things the model might want to generate on its own and seeing what comes after them we're only doing the tell me the

next word from some human generated piece of text okay um so that's how we get losses um and then after that um we want to as

before use these losses to update the parameters of a newal network okay um and how do we do that um

well in principle you know we just have all of the texts that we've collected which you could think of as just a really long sequence of okay we've got a

billion words of text here it is right so in theory you could just run your um recurrent newal network over your billion words of text updating the

context as you go um but that would make it very difficult to train a model because you'd be accumulating these

losses for a billion steps and you'd have to store them um and then You' be you'd have to store hidden States so you could update parameters and it just

wouldn't work so what we actually do is we cut our training data into segments of a reasonable length and then we're

going to sort of run our recurrent newal Network on those segments and then we're going to compute a loss for each segment

and then we're going to update the parameters of the recurrent new network based on the losses that we found for that segment um I I describe it here as

the segments being sentences or documents which seems a linguistically nice thing it turns out that um in recent practice when you're wanting to

scale most efficiently on gpus people don't bother with those linguistic niceties they just say a segment is 100 words just cut every 100 Words and the

reason why that's really convenient is you can then create a batch of segments all of which are 100 words long and

stick those in a matrix and do um vectorized training more efficiently um and things go great for you okay but there's still a few more things that we

need to know um to get things to work great for you I was try and get a bit bit more through this before um today ends so we sort of need to know about

how to work out the derivative of our loss with respect to um the parameters of our recurrent newal

Network and the interesting case here is you know these wh parameters are sort of being used everywhere through the neural

network at each stage as are the we ones so they appear at many places in the network so how do we work out the partial derivatives of the loss with

respect to the repeated weight matrices and and the answer to that is oh it's really simple um you can just

sort of pretend that those wh's in each position are different and work out the partials with respect to them at one

position and then to get the partials with respect to wh you just sum whatever you found in the different

positions and so um that is sort of okay the gradient with respect to repeated weight is the sum of the gradient with respect to each time it appears and the

reason why that is it sort of follows what I talked about in lecture three um that we talk or you know you can also

think about it in terms of what you might remember um from you know multivariable chain roles but you know the way I introduced in lecture three is

that gradient at outward branches and so what you can think about it in a case like this is that you've got uh wh

Matrix which is being copied by identity to wh1 wh2 wh3 wh4 Etc at each time step

and so since those are identity copies they have um a partial derivative with respect to each other of one

and so then we apply the multivariable chain roll to these copies um and so we've then got an outward branching node

and you're just summing the gradients um to get the total gradient of each time for The Matrix okay

um yeah I mean there's one other trick that's perhaps worth knowing I mean if you've got sort of segments that are 100 long um a common speed up is to say oh

maybe we don't actually have to run back propagation for 100 time steps maybe we could just run it for 20 times steps and stop which is referred to as truncated

back propagation through time I mean in practice that tends to be sufficient note in particular you're still on the forward path updating your hidden State

using your full context but in the back propagation you're just sort of cutting it short um to speed up

training okay um so just as I did before with an engram language model we can use uh RNN language model to generate text

and it's pretty much the same idea except now we're sort of um rather than just using counts of engrs we're using the hidden state of our neural network

to give us the input to a a probability distribution that we can then sample from so I can start with the initial hidden State um I can use the start of

sentence symbol I mean the example I had before I started immediately with the um hoping that that was less confusing the first time but what you should have

asked is wait a minute where did the the come from um so normally what we actually do is is use a special start of

sequence symbol like this angle bracketed s and so we sort of feed it in as a pseudo word which has a word embedding and then we based on this will

be generating first words of the text um so we end up with some representation from which we can sample and get the first word so now we don't

have any actual text so what we're going to do is take that generated word that we generated and copy it down as the next

input and then we're going to run the next stage of newal network um sample from the probability distribution and next word favorite copy it down as the

next word of the input and keep on generating and so this is referred to as a roll out that you're kind of continuing to roll the dice and generate

forward and generate a piece of text and so um and normally you want to stop at some point and the way we can stop it

some point is we can have a second special symbol um the angle bracket SLS which um says end of um your sequence so

we can generate an end of sequence symbol and then we can um stop and so using this we can sort of generate pieces of text and essentially you know

this is exactly what's happening if you use something like chat GPT right that the model is a more complicated model that we've haven't yet gotten to but

it's generating the response to you by doing this kind of process of generating a word at the time treating it as an input and generating the next word and

generating this sort of roll out and that's why and it's done probabilistically so if you do it multiple times um you can get different answers we haven't yet gone to chat GPT

but we can have a little bit of fun um so you can take this simple recurrent newal Network that we've just built here and you can train it on any piece of

text and get it to generate stuff so for example I can train it on Barack Obama's speeches so that's a small Corpus right you know he didn't talk that much right

I've only got a few hundred thousand words of text it's not a huge Corpus I'll just show this and then I can answer the question um but you know I can generate from it and I get something

like the United States will step up to the cost of a new challenges of the American people that will share the fact that we created the problem they were attacked and so that they have to say

that all the task of the final days of war that I will not be able to get this done um yeah well maybe that's slightly better than my engram language model

still not perfect you might say but somewhat better maybe did you have a question uh yeah so since we're like training the mod like truncated set of

the Corpus that impose some kind of like limitation on like how much we can like produce and like still have some cency

like meaning like foring um so yeah so I suggested we're going to chunk the S chunk the text into

100w units so you know that's the limit of the amount of Prior context that we're going to use so I mean that's a fair amount 100 words that's typically

several sentences but to the extent that you wanted to know even more about the further back context you wouldn't be able to and you know certainly that's

one of the ways in which modern large language models are using far bigger context than that they're now using thousands of words of Prior context yeah

absolutely it's a limit on how much far back context so in some sense actually even though in theory a current newal Network can feed in an arbitrary length

context as soon as I say oh practically we cut it into segments you know actually that means we are making a Markoff assumption again and we're

saying the further back context doesn't matter yeah okay uh couple more examples um so instead of Barack Obama I can feed

in Harry Potter which is a somewhat bigger Corpus of text actually and generate from that and so I can get um sorry Harry shouted panicking I'll leave

those brooms in London are they no idea said nearly headless Nick casting low close by Cedric carrying the last bit of trial charms from Harry's shoulder and

to answer him the common room perched upon it forearms held a shining knob from when the spider hadn't felt it seamed he reached the teams

too well there you are um you can do other things as well um so you can train it on recipes and

generate a recipe um this one's a recipe I don't suggest you try and cook um but it looks sort of like a recipe if you

don't look very hard um chocolate Ranch Barbecue categories game casseroles cookies cookies yield six servings two P

two tablespoons of Parmesan cheese chopped um one cup of Co coconut milk and three eggs beaten um Place each pasture over layers of lumps shape

mixture into the moderate oven and simmer until firm serve hot and bodied fresh mustard orange and cheese combine

the cheese and salt together the dough in a large Skillet and the ingredients and stir in the chocolate and pepper H

um yeah it's not exactly very consistent recipe when it comes down to it it sort of has a langage of a recipe but it's Absolut maybe if I had scaled it more

and had a bigger Corpus it would have done a bit better um but it's definitely not using the ingredients there are um

let's see it's almost um time today so maybe about all I can do um is uh do H I can do one more fun example and then

after that oh yeah I probably should that bit at the start next time um so as a variant of building RNN language

models I mean so far we've been building them over words so our you know token time steps over which you build it as

words I mean actually you can use the idea of recurrent new networks over any other size unit and people have used them for other things so people have

used them in bioinformatics for things like um DNA for sort of having Gene sequencing or protein sequencing or anything like that but even staying with

language um instead of building them over words you can build them over characters so that my I'm generating at

a a letter at a time rather than a word at a time and so that can sometimes be useful because it allows us to sort of generate things um that sort of look

like Words um and perhaps have the structure of English words um and so and so similarly there are other things that you can do so before I

initialized the hidden State I said oh you just have an initial hidden State you can make it zeros if you want well sometimes we're going to build a

contextual RNN where we can initialize the hidden State um with something else so in particular I can initialize the

hidden state with the RGB values of a color and so I can have initialized the hidden state with the color and generate

character at a time the name of paint colors and I can train a model based on um a paint company's catalog of names of

colors and their um RGB of their colors and then I can give it different different paint colors and it'll come up with names for them and it actually does

an excellent job this one worked really well look at this um this one here is

gasty pink Power gray Naval tan bco white hble gray Home Star Brown now couldn't you just imagine finding all of

these in a paint catalog I mean some of them are there's some really good ones over here in the bottom right this color here is

dope and then um this Stoner blue purple s stinky bean and Turley now I think I've got a a

real business opportunity here in the Paint Company Market um for my recurrent new network okay I'll stop there for today and do more of the science of new

networks next time

Loading...

Loading video analysis...