TLDW logo

Deep Dive into LLMs like ChatGPT

By Andrej Karpathy

Summary

## Key takeaways * **LLMs as Statistical Simulators:** Modern LLMs like ChatGPT are primarily sophisticated "token autocompletes" that statistically imitate human data labelers' responses, based on extensive training data and labeling instructions, rather than possessing true human-like intelligence or self-awareness. (01:14:40, 01:41:46, 03:24:00) * **The "Swiss Cheese" Model of Capabilities:** LLMs exhibit "jagged intelligence," performing incredibly well on complex tasks (e.g., math Olympiad problems) but randomly failing at seemingly simple ones (e.g., counting, basic comparisons) due to inherent computational limitations per token and tokenization quirks. (02:04:53, 03:13:00, 03:25:35) * **Tokens as "Thinking Steps":** LLMs require computation to be distributed across many tokens to perform complex reasoning. Providing intermediate steps (chain-of-thought) or offloading computation to external tools (code interpreter, web search) significantly improves accuracy and mitigates hallucinations. (01:46:56, 01:28:00) * **Reinforcement Learning for Emergent Reasoning:** While supervised fine-tuning mimics expert behavior, reinforcement learning (RL) allows models to discover novel, highly accurate problem-solving strategies and internal "chains of thought" by trial and error, moving beyond human imitation (analogous to AlphaGo's "Move 37"). (02:14:42, 02:27:47, 02:42:07) * **LLMs as Tools, Not Oracles:** Despite their advanced capabilities, LLMs are stochastic systems that can hallucinate and make errors. They should be treated as powerful tools for inspiration, first drafts, and acceleration, but users must always verify their outputs and remain ultimately responsible for their work. (03:13:00, 03:27:20) ## Smart Chapters * **Introduction to Large Language Models (LLMs)** (00:00:00): An overview of the video's purpose: to provide a comprehensive, general audience deep dive into LLMs like ChatGPT, covering their development, "psychology," and practical applications. * **Pre-training Data: The Internet as a Textbook** (00:01:00): Explores how vast quantities of high-quality, diverse text data are collected and rigorously filtered from the internet (e.g., Common Crawl, FineWeb) to serve as the initial knowledge base for LLMs. * **Tokenization: Deconstructing Text into Atoms** (00:07:47): Details the process of converting raw text into numerical "tokens" (sub-word units) using algorithms like Byte Pair Encoding, which forms the fundamental input for neural networks. * **Neural Network I/O: Predicting the Next Token** (00:14:27): Explains how a sequence of tokens (context) is fed into a neural network, which then outputs a probability distribution over the entire vocabulary to predict the most likely next token in the sequence. * **Neural Network Internals: The Transformer Architecture** (00:20:11): Provides a high-level look inside the Transformer neural network, describing it as a complex mathematical expression with billions of adjustable parameters ("knobs") that transform inputs into outputs without memory. * **Inference: Generating New Text** (00:26:01): Describes the process of generating new content by sampling tokens one by one from the model's predicted probability distributions, resulting in novel text that statistically resembles the training data. * **GPT-2: Training and Inference in Action** (00:31:09): Illustrates the concrete process of training a GPT-2 model, showing how monitoring "loss" drives improvement and how initial gibberish gradually evolves into coherent text generation. * **The Computational Powerhouse: GPUs and Data Centers** (00:37:38): Discusses the immense computational resources, particularly GPUs, required to train large neural networks, highlighting the parallelism and scale of modern AI infrastructure. * **Llama 3.1 Base Model: Exploring Capabilities** (00:42:52): Demonstrates interaction with a large base model (Llama 3.1), showing its nature as a "token simulator" capable of factual recall, memorization, hallucination, and "in-context learning" through clever prompting. * **Transition to Post-training: From Simulator to Assistant** (00:59:23): Explains the shift from the pre-trained "base model" (an internet document simulator) to an "assistant" capable of answering questions, which is achieved through subsequent post-training stages. * **Post-training Data: The Art of Conversation** (01:01:06): Details how new datasets of multi-turn human-assistant conversations are curated by human labelers (often with LLM assistance) and tokenized to program the desired interactive behavior of an AI assistant. * **LLM Psychology: Hallucinations, Tool Use, and Memory** (01:20:32): Delves into the emergent cognitive aspects of LLMs, explaining the origins of hallucinations and how they are mitigated through explicit "I don't know" training and the integration of external tools like web search or code interpreters. * **The Model's Self-Identity: A Programmed Persona** (01:41:46): Explores why LLMs claim to be built by specific companies, attributing it to statistical imitation from training data or explicit programming through system messages rather than genuine self-awareness. * **Models Need Tokens to Think: Distributing Computation** (01:46:56): Emphasizes that LLMs perform finite computation per token, necessitating "chains of thought" or intermediate steps to solve complex problems, and advocating for tool use over internal "mental arithmetic." * **Tokenization Revisited: The Challenge of Spelling** (02:01:11): Revisits tokenization to explain why LLMs struggle with character-level tasks like spelling or counting specific letters, as they operate on token chunks rather than individual characters. * **Jagged Intelligence: The Swiss Cheese Model** (02:04:53): Illustrates the inconsistent nature of LLM intelligence, where models can excel at advanced tasks but fail at seemingly simple ones, highlighting the unpredictable "holes" in their capabilities. * **Reinforcement Learning: The Path to True Reasoning** (02:07:28): Introduces reinforcement learning (RL) as the third major training stage, comparing it to "practice problems" where models learn to discover optimal solutions through trial and error, beyond mere imitation. * **Reinforcement Learning in Action: Trial and Error** (02:14:42): Explains how RL involves generating multiple solutions for a given prompt, evaluating their correctness, and reinforcing the "token sequences" that lead to successful outcomes, allowing the model to find its own problem-solving strategies. * **DeepSeek-R1: Emergent Thinking in LLMs** (02:27:47): Showcases the DeepSeek-R1 model as an example of RL leading to emergent "thinking" or "reasoning" capabilities, where models develop internal monologues and multi-faceted approaches to solve complex problems more accurately. * **AlphaGo: A Precedent for RL's Power** (02:42:07): Draws parallels between RL in LLMs and AlphaGo's ability to surpass human performance in Go by discovering novel strategies, suggesting LLMs could similarly transcend human reasoning in open-ended tasks. * **Reinforcement Learning from Human Feedback (RLHF)** (02:48:26): Discusses RLHF as a technique for applying reinforcement learning in "unverifiable domains" (e.g., creative writing) by training a "reward model" to simulate human preferences, despite its inherent limitations and vulnerability to "gaming." * **Preview of Things to Come: The Future of LLMs** (03:09:39): Outlines upcoming advancements such as multimodality (audio, images), long-running agents, pervasive integration, and new research directions like test-time training for enhanced learning. * **Keeping Track of LLMs: Resources for Staying Updated** (03:15:15): Recommends resources like LM Arena (leaderboard), AI News newsletter, and X/Twitter for staying current with LLM developments and model rankings. * **Where to Find and Use LLMs** (03:18:34): Guides on accessing proprietary models (ChatGPT, Gemini), open-weights models (Together.ai, Hyperbolic), and running smaller models locally (LM Studio). * **Grand Summary: Understanding ChatGPT's Inner Workings** (03:21:46): A concluding recap of the entire LLM training pipeline, emphasizing that ChatGPT is a sophisticated simulation of a human data labeler, with inherent limitations and emergent reasoning capabilities from RL. ## Key quotes * "You're not talking to a magical AI, you're talking to an average labeler. This average labeler is probably fairly highly skilled, but you're talking to kind of like an instantaneous simulation of that kind of a person that would be hired uh in the construction of these data sets." (01:14:40) * "The model learns what we call these chains of thought in your head and it's an emergent property of the optimization and that's what's bloating up the response length but that's also what's increasing the accuracy of the problem solving." (02:35:49) * "What does it mean to solve problems in such a way that uh even humans would not be able to get? How can you be better at reasoning or thinking than humans? How can you go beyond just uh a thinking human like maybe it means discovering analogies that humans would not be able to uh create or maybe it's like a new thinking strategy." (02:46:13) * "Reinforcement learning is extremely good at discovering a way to game the model to game the simulation." (02:57:00) * "So I think it's incredibly exciting that these models exist but again it's very early and these are primordial models for now." (03:26:40) ## Stories and anecdotes * **The Cost of GPT-2 Reproduction** (00:34:00): Andrej shares his experience reproducing OpenAI's GPT-2 model, noting that while it cost an estimated $40,000 in 2019, he managed to do it for about $600 in one day, highlighting the dramatic improvements in data quality, hardware, and software efficiency. * **The "Orson Kovats" Hallucination Test** (01:25:35): To illustrate hallucinations, Andrej asks an older Falcon 7B model "Who is Orson Kovats?" a fictional name he invented. The model confidently fabricates various identities (author, fictional character, baseball player), demonstrating its tendency to "make stuff up" when it doesn't know, rather than admitting ignorance. * **AlphaGo's "Move 37"** (02:45:10): Andrej references AlphaGo's famous "Move 37" in its game against Lee Sedol, a move that human Go experts considered highly unconventional and statistically improbable (1 in 10,000 chance) but ultimately brilliant. This exemplifies how reinforcement learning can discover strategies beyond human intuition. * **The "9.11 vs 9.9" Math Problem** (02:05:40): Andrej describes a perplexing failure of LLMs: their inability to correctly determine which number is larger between 9.11 and 9.9. He notes that researchers found the model's internal activations were surprisingly associated with Bible verses, causing it to misinterpret the numbers as verse markers where 9.11 would indeed follow 9.9, leading to an incorrect mathematical conclusion. ## Mentioned Resources * **FineWeb**: A curated dataset of web text for LLM pre-training, mentioned as a representative example. (00:01:50) * **Common Crawl**: An organization that has been archiving the web since 2007, serving as a primary source for LLM training data. (00:04:00) * **Tiktokenizer**: A website for exploring tokenization, specifically for the GPT-4 base model tokenizer (CL100K Base). (00:10:48) * **Transformer Neural Net 3D visualizer**: A website providing a visual representation of a Transformer neural network. (00:22:25) * **llm.c Let's Reproduce GPT-2** (GitHub repository): Andrej Karpathy's personal project to reproduce GPT-2, detailing the training process and costs. (00:32:50) * **GPT-2 Paper** (published 2019 by OpenAI): The technical publication for GPT-2. (00:31:40) * **Lambda**: A cloud provider for renting GPUs (specifically h100s) for LLM training. (00:39:10) * **Llama 3 Paper from Meta**: The technical paper introducing the Llama 3 series of models. (00:45:10) * **Hyperbolic**: A company serving the Llama 3.1 405B base model for inference. (00:47:00) * **InstructGPT Paper** (published 2022 by OpenAI): A paper describing the technique of fine-tuning language models on conversations. (01:21:00) * **HuggingFace Inference Playground**: A platform for interacting with various LLMs and exploring their behavior, used to demonstrate hallucinations with Falcon 7B. (01:25:00) * **Open Assistant**: An open-source effort to create conversation datasets for LLM fine-tuning. (01:29:40) * **UltraChat**: A modern, largely synthetic dataset of conversations for LLM fine-tuning. (01:32:00) * **DeepSeek-R1 Paper**: A recent paper from DeepSeek AI in China, publicly discussing reinforcement learning fine-tuning for LLMs. (02:27:47) * **chat.deepseek.com**: The official website to interact with the DeepSeek-R1 thinking model. (02:37:30) * **Together.ai Playground**: An inference provider that hosts many state-of-the-art open-source models, including DeepSeek-R1. (02:40:00) * **AI Studio (Google)**: Google's platform that offers an experimental thinking model (Gemini 2.0 Flash Thinking Experimental). (02:41:00) * **AlphaGo Paper (PDF)**: The paper underlying the AlphaGo system developed by DeepMind. (02:42:50) * **AlphaGo Move 37 video** (YouTube): A video analyzing AlphaGo's famous "Move 37" in its game against Lee Sedol. (02:46:50) * **LM Arena**: An LLM leaderboard that ranks models based on human comparisons. (03:15:20) * **AI News Newsletter**: A comprehensive newsletter curated by Swyx and friends for staying updated on AI developments. (03:17:00) * **ChatGPT**: OpenAI's flagship conversational AI model. (03:18:50) * **LM Studio**: A desktop application for running smaller, distilled LLMs locally on a personal computer. (03:20:00)

Topics Covered

  • Your AI Assistant is a Simulated Human Labeler
  • LLMs need tokens to think, not just output answers
  • LLM capabilities are Swiss cheese, not uniform
  • RL enables LLMs to discover unique cognitive strategies
  • Reward models can be "gamed" by nonsensical inputs

Full Transcript

hi everyone so I've wanted to make this

video for a while it is a comprehensive

but General audience introduction to

large language models like Chachi PT and

what I'm hoping to achieve in this video

is to give you kind of mental models for

thinking through what it is that this

tool is it is obviously magical and

amazing in some respects it's uh really

good at some things not very good at

other things and there's also a lot of

sharp edges to be aware of so what is

behind this text box you can put

anything in there and press enter but uh

what should we be putting there and what

are these words generated back how does

this work and what what are you talking

to exactly so I'm hoping to get at all

those topics in this video we're going

to go through the entire pipeline of how

this stuff is built but I'm going to

keep everything uh sort of accessible to

a general audience so let's take a look

at first how you build something like

chpt and along the way I'm going to talk

about um you know some of the sort of

cognitive psychological implications of

the tools okay so let's build Chachi PT

so there's going to be multiple stages

arranged sequentially the first stage is

called the pre-training stage and the

first step of the pre-training stage is

to download and process the internet now

to get a sense of what this roughly

looks like I recommend looking at this

URL here so um this company called

hugging face uh collected and created

and curated this data set called Fine

web and they go into a lot of detail on

this block post on how how they

constructed the fine web data set and

all of the major llm providers like open

AI anthropic and Google and so on will

have some equivalent internally of

something like the fine web data set so

roughly what are we trying to achieve

here we're trying to get ton of text

from the internet from publicly

available sources so we're trying to

have a huge quantity of very high

quality documents and we also want very

large diversity of documents because we

want to have a lot of knowledge inside

these models so we want large diversity

of high quality documents and we want

many many of them and achieving this is

uh quite complicated and as you can see

here takes multiple stages to do well so

let's take a look at what some of these

stages look like in a bit for now I'd

like to just like to note that for

example the fine web data set which is

fairly representative what you would see

in a production grade application

actually ends up being only about 44

terabyt of dis space um you can get a

USB stick for like a terabyte very

easily or I think this could fit on a

single hard drive almost today so this

is not a huge amount of data at the end

of the day even though the internet is

very very large we're working with text

and we're also filtering it aggressively

so we end up with about 44 terabytes in

this example so let's take a look at uh

kind of what this data looks like and

what some of these stages uh also are so

the starting point for a lot of these

efforts and something that contributes

most of the data by the end of it is

Data from common crawl so common craw is

an organization that has been basically

scouring the internet since 2007 so as

of 2024 for example common CW has

indexed 2.7 billion web

pages uh and uh they have all these

crawlers going around the internet and

what you end up doing basically is you

start with a few seed web pages and then

you follow all the links and you just

keep following links and you keep

indexing all the information and you end

up with a ton of data of the internet

over time so this is usually the

starting point for a lot of the uh for a

lot of these efforts now this common C

data is quite raw and is filtered in

many many different ways

so here they Pro they document this is

the same diagram they document a little

bit the kind of processing that happens

in these stages so the first thing here

is something called URL

filtering so what that is referring to

is that there's these block

lists of uh basically URLs that are or

domains that uh you don't want to be

getting data from so usually this

includes things like U malware websites

spam websites marketing websites uh

racist websites adult sites and things

like that so there's a ton of different

types of websites that are just

eliminated at this stage because we

don't want them in our data set um the

second part is text extraction you have

to remember that all these web pages

this is the raw HTML of these web pages

that are being saved by these crawlers

so when I go to inspect

here this is what the raw HTML actually

looks like you'll notice that it's got

all this markup uh like lists and stuff

like that and there's CSS and all this

kind of stuff so this is um computer

code almost for these web pages but what

we really want is we just want this text

right we just want the text of this web

page and we don't want the navigation

and things like that so there's a lot of

filtering and processing uh and heris

that go into uh adequately filtering for

just their uh good content of these web

pages the next stage here is language

filtering so for example fine web

filters uh using a language classifier

they try to guess what language every

single web page is in and then they only

keep web pages that have more than 65%

of English as an

example and so you can get a sense that

this is like a design decision that

different companies can uh can uh take

for themselves what fraction of all

different types of languages are we

going to include in our data set because

for example if we filter out all of the

Spanish as an example then you might

imagine that our model later will not be

very good at Spanish because it's just

never seen that much data of that

language and so different companies can

focus on multilingual performance to uh

to a different degree as an example so

fine web is quite focused on English and

so their language model if they end up

training one later will be very good at

English but not may be very good at

other

languages after language filtering

there's a few other filtering steps and

D duplication and things like that um

finishing with for example the pii

removal this is personally identifiable

information so as an example addresses

Social Security numbers and things like

that you would try to detect them and

you would try to filter out those kinds

of web pages from the the data set as

well so there's a lot of stages here and

I won't go into full detail but it is a

fairly extensive part of the

pre-processing and you end up with for

example the fine web data set so when

you click in on it uh you can see some

examples here of what this actually ends

up looking like and anyone can download

this on the huging phase web page and so

here are some examples of the final text

that ends up in the training set so this

is some article about tornadoes in

2012 um so there's some t tadoes in 2020

in 2012 and what

happened uh this next one is something

about did you know you have two little

yellow 9vt battery sized adrenal glands

in your body okay so this is some kind

of a odd medical

article so just think of these as

basically uh web pages on the internet

filtered just for the text in various

ways and now we have a ton of text 40

terabytes off it and that now is the

starting point for the next step of this

stage now I wanted to give you an

intuitive sense of where we are right

now so I took the first 200 web pages

here and remember we have tons of them

and I just take all that text and I just

put it all together concatenate it and

so this is what we end up with we just

get this just just raw text raw internet

text and there's a ton of it even in

these 200 web pages so I can continue

zooming out here and we just have this

like massive tapestry of Text data and

this text data has all these p patterns

and what we want to do now is we want to

start training neural networks on this

data so the neural networks can

internalize and model how this text

flows right so we just have this giant

texture of text and now we want to get

neural Nets that mimic it okay now

before we plug text into neural networks

we have to decide how we're going to

represent this text uh and how we're

going to feed it in now the way our

technology works for these neuron Lots

is that they expect

a one-dimensional sequence of symbols

and they want a finite set of symbols

that are possible and so we have to

decide what are the symbols and then we

have to represent our data as

one-dimensional sequence of those

symbols so right now what we have is a

onedimensional sequence of text it

starts here and it goes here and then it

comes here Etc so this is a

onedimensional sequence even though on

my monitor of course it's laid out in a

two-dimensional way but it goes from

left to right and top to bottom right so

it's a one-dimensional sequence of text

now this being computers of course

there's an underlying representation

here so if I do what's called utf8 uh

encode this text then I can get the raw

bits that correspond to this text in the

computer and that's what uh that looks

like this so it turns out that for

example this very first bar here is the

first uh eight bits as an

example so what is this thing right this

is um representation that we are looking

for uh in in a certain sense we have

exactly two possible symbols zero and

one and we have a very long sequence of

it right now as it turns out um this

sequence length is actually going to be

very finite and precious resource uh in

our neural network and we actually don't

want extremely long sequences of just

two symbols instead what we want is we

want to trade off uh this um symbol

size uh of this vocabulary as we call it

and the resulting sequence length so we

don't want just two symbols and

extremely long sequences we're going to

want more symbols and shorter sequences

okay so one naive way of compressing or

decreasing the length of our sequence

here is to basically uh consider some

group of consecutive bits for example

eight bits and group them into a single

what's called bite so because uh these

bits are either on or off if we take a

group of eight of them there turns out

to be only 256 possible combinations of

how these bits could be on or off and so

therefore we can re repesent this

sequence into a sequence of bytes

instead so this sequence of bytes will

be eight times shorter but now we have

256 possible symbols so every number

here goes from 0 to

255 now I really encourage you to think

of these not as numbers but as unique

IDs or like unique symbols so maybe it's

a bit more maybe it's better to actually

think of these to replace every one of

these with a unique Emoji you'd get

something like this so um we basically

have a sequence of emojis and there's

256 possible emojis you can think of it

that way now it turns out that in

production for state-of-the-art language

models uh you actually want to go even

Beyond this you want to continue to

shrink the length of the sequence uh

because again it is a precious resource

in return for more symbols in your

vocabulary and the way this is done is

done by running what's called The Bite

pair encoding algorithm and the way this

works is we're basically looking for

consecutive bytes or symbols that are

very common so for example turns out

that the sequence 116 followed by 32 is

quite common and occurs very frequently

so what we're going to do is we're going

to group uh this um pair into a new

symbol so we're going to Mint a symbol

with an ID 256 and we're going to

rewrite every single uh pair 11632 with

this new symbol and then can we can

iterate this algorithm as many times as

we wish and each time when we mint a new

symbol we're decreasing the length and

we're increasing the symbol size and in

practice it turns out that a pretty good

setting of um the basically the

vocabulary size turns out to be about

100,000 possible symbols so in

particular GPT 4 uses

100

277 symbols

um and this process of converting from

raw text into these symbols or as we

call them tokens is the process called

tokenization so let's now take a look at

how gp4 performs tokenization conting

from text to tokens and from tokens back

to text and what this actually looks

like so one website I like to use to

explore these token representations is

called tick tokenizer and so come here

to the drop down and select CL 100 a

base which is the gp4 base model

tokenizer and here on the left you can

put in text and it shows you the

tokenization of that text so for example

heo space

world so hello world turns out to be

exactly two Tokens The Token hello which

is the token with ID

15339 and the token space

world that is the token 1

1917 so um hello space world now if I

was to join these two for example I'm

going to get again two tokens but it's

the token H followed by the token L

world without the

H um if I put in two Spa two spaces here

between hello and world it's again a

different uh tokenization there's a new

token 220

here okay so you can play with this and

see what happens here also keep in mind

this is not uh this is case sensitive so

if this is a capital H it is something

else or if it's uh hello world then

actually this ends up being three tokens

instead of just two

tokens yeah so you can play with this

and get an sort of like an intuitive

sense of uh what these tokens work like

we're actually going to loop around to

tokenization a bit later in the video

for now I just wanted to show you the

website and I wanted to uh show you that

this text basically at the end of the

day so for example if I take one line

here this is what GT4 will see it as so

this text will be a sequence of length

62 this is the sequence here and this is

how the chunks of text correspond to

these symbols and again there's 100

27777 possible symbols and we now have

one-dimensional sequences of those

symbols so um yeah we're going to come

back to tokenization but that's uh for

now where we are okay so what I've done

now is I've taken this uh sequence of

text that we have here in the data set

and I have re-represented it using our

tokenizer into a sequence of tokens and

this is what that looks like now so for

example when we go back to the Fine web

data set they mentioned that not only is

this 44 terab of dis space but this is

about a 15 trillion token sequence of um

in this data set and so here these are

just some of the first uh one or two or

three or a few thousand here I think uh

tokens of this data set but there's 15

trillion here uh to keep in mind and

again keep in mind one more time that

all of these represent little text

chunks they're all just like atoms of

these sequences and the numbers here

don't make any sense they're just uh

they're just unique IDs okay so now we

get to the fun part which is the uh

neural network training and this is

where a lot of the heavy lifting happens

computationally when you're training

these neural networks so what we do here

in this this step is we want to model

the statistical relationships of how

these tokens follow each other in the

sequence so what we do is we come into

the data and we take Windows of tokens

so we take a window of tokens uh from

this data fairly

randomly and um the windows length can

range anywhere anywhere between uh zero

tokens actually all the way up to some

maximum size that we decide on uh so for

example in practice you could see a

token with Windows of say 8,000 tokens

now in principle we can use arbitrary

window lengths of tokens uh but uh

processing very long uh basically U

window sequences would just be very

computationally expensive so we just

kind of decide that say 8,000 is a good

number or 4,000 or 16,000 and we crop it

there now in this example I'm going to

be uh taking the first four tokens just

so everything fits nicely so these

tokens

we're going to take a window of four

tokens this bar view in and space single

which are these token

IDs and now what we're trying to do here

is we're trying to basically predict the

token that comes next in the sequence so

3962 comes next right so what we do now

here is that we call this the context

these four tokens are context and they

feed into a neural

network and this is the input to the

neural network

now I'm going to go into the detail of

what's inside this neural network in a

little bit for now it's important to

understand is the input and the output

of the neural net so the input are

sequences of tokens of variable length

anywhere between zero and some maximum

size like 8,000 the output now is a

prediction for what comes next so

because our vocabulary has

100277 possible tokens the neural

network is going to Output exactly that

many numbers

and all of those numbers correspond to

the probability of that token as coming

next in the sequence so it's making

guesses about what comes

next um in the beginning this neural

network is randomly initialized so um

and we're going to see in a little bit

what that means but it's a it's a it's a

random transformation so these

probabilities in the very beginning of

the training are also going to be kind

of random uh so here I have three

examples but keep in mind that there's

100,000 numbers here um so the

probability of this token space

Direction neural network is saying that

this is 4% likely right now 11799 is 2%

and then here the probility of 3962

which is post is 3% now of course we've

sampled this window from our data set so

we know what comes next we know and

that's the label we know that the

correct answer is that 3962 actually

comes next in the sequence so now what

we have is this mathematical process for

doing an update to the neural network we

have the way of tuning it and uh we're

going to go into a little bit of of

detail in a bit but basically we know

that this probability here of 3% we want

this probability to be higher and we

want the probabilities of all the other

tokens to be

lower and so we have a way of

mathematically calculating how to adjust

and update the neural network so that

the correct answer has a slightly higher

probability so if I do an update to the

neural network now the next time I Fe

this particular sequence of four tokens

into neural network the neural network

will be slightly adjusted now and it

will say Okay post is maybe 4% and case

now maybe is

1% and uh Direction could become 2% or

something like that and so we have a way

of nudging of slightly updating the

neuronet to um basically give a higher

probability to the correct token that

comes next in the sequence and now you

just have to remember that this process

happens not just for uh this um token

here where these four fed in and

predicted this one this process happens

at the same time for all of these tokens

in the entire data set and so in

practice we sample little windows little

batches of Windows and then at every

single one of these tokens we want to

adjust our neural network so that the

probability of that token becomes

slightly higher and this all happens in

parallel in large batches of these

tokens and this is the process of

training the neural network it's a

sequence of updating it so that it's

predictions match up the statistics of

what actually happens in your training

set and its probabilities become

consistent with the uh statistical

patterns of how these tokens follow each

other in the data so let's now briefly

get into the internals of these neural

networks just to give you a sense of

what's inside so neural network

internals so as I mentioned we have

these inputs uh that are sequences of

tokens in this case this is four input

tokens but this can be anywhere between

zero up to let's say 8,000 tokens in

principle this can be an infinite number

of tokens we just uh it would just be

too computationally expensive to process

an infinite number of tokens so we just

crop it at a certain length and that

becomes the maximum context length of

that uh

model now these inputs X are mixed up in

a giant mathematical expression together

with the parameters or the weights of

these neural networks so here I'm

showing six example parameters and their

setting but in practice these uh um

modern neural networks will have

billions of these uh parameters and in

the beginning these parameters are

completely randomly set now with a

random setting of parameters you might

expect that this uh this neural network

would make random predictions and it

does in the beginning it's totally

random predictions but it's through this

process of iteratively updating the

network uh as and we call that process

training a neural network so uh that the

setting of these parameters gets

adjusted such that the outputs of our

neural network becomes consistent with

the patterns seen in our training

set so think of these parameters as kind

of like knobs on a DJ set and as you're

twiddling these knobs you're getting

different uh predictions for every

possible uh token sequence input and

training in neural network just means

discovering a setting of parameters that

seems to be consistent with the

statistics of the training

set now let me just give you an example

what this giant mathematical expression

looks like just to give you a sense and

modern networks are massive expressions

with trillions of terms probably but let

me just show you a simple example here

it would look something like this I mean

these are the kinds of Expressions just

to show you that it's not very scary we

have inputs x uh like X1 x2 in this case

two example inputs and they get mixed up

with the weights of the network w0 W1 2

3 Etc and this mixing is simple things

like multiplication addition addition

exponentiation division Etc and it is

the subject of neural network

architecture research to design

effective mathematical Expressions uh

that have a lot of uh kind of convenient

characteristics they are expressive

they're optimizable they're paralyzable

Etc and so but uh at the end of the day

these are these are not complex

expressions and basically they mix up

the inputs with the parameters to make

predictions and we're optimizing uh the

parameters of this neural network so

that the predictions come out consistent

with the training set now I would like

to show you an actual production grade

example of what these neural networks

look like so for that I encourage you to

go to this website that has a very nice

visualization of one of these

networks so this is what you will find

on this website and this neural network

here that is used in production settings

has this special kind of structure this

network is called the Transformer and

this particular one as an example has 8

5,000 roughly

parameters now here on the top we take

the inputs which are the token

sequences and then information flows

through the neural network until the

output which here are the logit softmax

but these are the predictions for what

comes next what token comes

next and then here there's a sequence of

Transformations and all these

intermediate values that get produced

inside this mathematical expression s it

is sort of predicting what comes next so

as an example these tokens are embedded

into kind of like this distributed

representation as it's called so every

possible token has kind of like a vector

that represents it inside the neural

network so first we embed the tokens and

then those values uh kind of like flow

through this diagram and these are all

very simple mathematical Expressions

individually so we have layer norms and

Matrix multiplications and uh soft Maxes

and so on so here kind of like the

attention block of this Transformer and

then information kind of flows through

into the multi-layer perceptron block

and so on and all these numbers here

these are the intermediate values of the

expression and uh you can almost think

of these as kind of like the firing

rates of these synthetic neurons but I

would caution you to uh not um kind of

think of it too much like neurons

because these are extremely simple

neurons compared to the neurons you

would find in your brain your biological

neurons are very complex dynamical

processes that have memory and so on

there's no memory in this expression

it's a fixed mathematical expression

from input to Output with no memory it's

just a

stateless so these are very simple

neurons in comparison to biological

neurons but you can still kind of

loosely think of this as like a

synthetic piece of uh brain tissue if

you if you like uh to think about it

that way so information flows through

all these neurons fire until we get to

the predictions now I'm not actually

going to dwell too much on the precise

kind of like mathematical details of all

these Transformations honestly I don't

think it's that important to get into

what's really important to understand is

that this is a mathematical function it

is uh parameterized by some fixed set of

parameters like say 85,000 of them and

it is a way of transforming inputs into

outputs and as we twiddle the parameters

we are getting uh different kinds of

predictions and then we need to find a

good setting of these parameters so that

the predictions uh sort of match up with

the patterns seen in training set

so that's the Transformer okay so I've

shown you the internals of the neural

network and we talked a bit about the

process of training it I want to cover

one more major stage of working with

these networks and that is the stage

called inference so in inference what

we're doing is we're generating new data

from the model and so uh we want to

basically see what kind of patterns it

has internalized in the parameters of

its Network so to generate from the

model is relatively straightforward

we start with some tokens that are

basically your prefix like what you want

to start with so say we want to start

with the token 91 well we feed it into

the

network and remember that the network

gives us probabilities right it gives us

this probability Vector here so what we

can do now is we can basically flip a

biased coin so um we can sample uh

basically a token based on this

probability distribution so the tokens

that are given High probability by the

model are more likely to be sampled when

you flip this biased coin you can think

of it that way so we sample from the

distribution to get a single unique

token so for example token 860 comes

next uh so 860 in this case when we're

generating from model could come next

now 860 is a relatively likely token it

might not be the only possible token in

this case there could be many other

tokens that could have been sampled but

we could see that 86c is a relatively

likely token as an example and indeed in

our training examp example here 860 does

follow 91 so let's now say that we um

continue the process so after 91 there's

a60 we append it and we again ask what

is the third token let's sample and

let's just say that it's 287 exactly as

here let's do that again we come back in

now we have a sequence of three and we

ask what is the likely fourth token and

we sample from that and get this one and

now let's say we do it one more time we

take those four we sample and we get

this one and this

13659 uh this is not actually uh 3962 as

we had before so this token is the token

article uh instead so viewing a single

article and so in this case we didn't

exactly reproduce the sequence that we

saw here in the training data so keep in

mind that these systems are stochastic

they have um we're sampling and we're

flipping coins and sometimes we lock out

and we reproduce some like small chunk

of the text and training set but

sometimes we're uh we're getting a token

that was not verbatim part of any of the

documents in the training data so we're

going to get sort of like remixes of the

data that we saw in the training because

at every step of the way we can flip and

get a slightly different token and then

once that token makes it in if you

sample the next one and so on you very

quickly uh start to generate token

streams that are very different from the

token streams that UR

in the training documents so

statistically they will have similar

properties but um they are not identical

to your training data they're kind of

like inspired by the training data and

so in this case we got a slightly

different sequence and why would we get

article you might imagine that article

is a relatively likely token in the

context of bar viewing single Etc and

you can imagine that the word article

followed this context window somewhere

in the training documents uh to some

extent and we just happen to sample it

here at that stage so basically

inference is just uh predicting from

these distributions one at a time we

continue feeding back tokens and getting

the next one and we uh we're always

flipping these coins and depending on

how lucky or unlucky we get um we might

get very different kinds of patterns

depending on how we sample from these

probability distributions so that's

inference so in most common scenarios uh

basically downloading the internet and

tokenizing it is is a pre-processing

step you do that a single time and then

uh once you have your token sequence we

can start training networks and in

Practical cases you would try to train

many different networks of different

kinds of uh settings and different kinds

of arrangements and different kinds of

sizes and so you''ll be doing a lot of

neural network training and um then once

you have a neural network and you train

it and you have some specific set of

parameters that you're happy with um

then you can take the model and you can

do inference and you can actually uh

generate data from the model and when

you're on chat GPT and you're talking

with a model uh that model is trained

and has been trained by open aai many

months ago probably and they have a

specific set of Weights that work well

and when you're talking to the model all

of that is just inference there's no

more training those parameters are held

fixed and you're just talking to the

model sort of uh you're giving it some

of the tokens and it's kind of

completing token sequences and that's

what you're seeing uh generated when you

actually use the model on CH GPT so that

model then just does inference alone so

let's now look at an example of training

an inference that is kind of concrete

and gives you a sense of what this

actually looks like uh when these models

are trained now the example that I would

like to work with and that I'm

particularly fond of is that of opening

eyes gpt2 so GPT uh stands for

generatively pre-trained Transformer and

this is the second iteration of the GPT

series by open AI when you are talking

to chat GPT today the model that is

underlying all of the magic of that

interaction is GPT 4 so the fourth

iteration of that series now gpt2 was

published in 2019 by openi in this paper

that I have right here and the reason I

like gpt2 is that it is the first time

that a recognizably modern stack came

together so um all of the pieces of gpd2

are recognizable today by modern

standards it's just everything has

gotten bigger now I'm not going to be

able to go into the full details of this

paper of course because it is a

technical publication but some of the

details that I would like to highlight

are as follows gpt2 was a Transformer

neural network just like you were just

like the neural networks you would work

with today it was it had 1.6 billion

parameters right so these are the

parameters that we looked at here it

would have 1.6 billion of them today

modern Transformers would have a lot

closer to a trillion or several hundred

billion

probably the maximum context length here

was 1,24 tokens so it is when we are

sampling chunks of Windows of tokens

from the data set we're never taking

more than 1,24 tokens and so when you

are trying to predict the next token in

a sequence you will never have more than

1,24 tokens uh kind of in your context

in order to make that prediction now

this is also tiny by modern standards

today the token uh the context lengths

would be a lot closer to um couple

hundred thousand or maybe even a million

and so you have a lot more context a lot

more tokens in history history and you

can make a lot better prediction about

the next token in the sequence in that

way and finally gpt2 was trained on

approximately 100 billion tokens and

this is also fairly small by modern

standards as I mentioned the fine web

data set that we looked at here the fine

web data set has 15 trillion tokens uh

so 100 billion is is quite

small

now uh I actually tried to reproduce uh

gpt2 for fun as part of this project

called lm. C so you can see my rup of

doing that in this post on GitHub under

the lm. C repository so in particular

the cost of training gpd2 in 2019 what

was estimated to be approximately

$40,000 but today you can do

significantly better than that and in

particular here it took about one day

and about

$600 uh but this wasn't even trying too

hard I think you could really bring this

down to about $100 today now why is it

that the costs have come down so much

well number one these data sets have

gotten a lot better and the way we

filter them extract them and prepare

them has gotten a lot more refined and

so the data set is of just a lot higher

quality so that's one thing but really

the biggest difference is that our

computers have gotten much faster in

terms of the hardware and we're going to

look at that in a second and also the

software for uh running these models and

really squeezing out all all the speed

from the hardware as it is possible uh

that software has also gotten much

better as as everyone has focused on

these models and try to run them very

very

quickly now I'm not going to be able to

go into the full detail of this gpd2

reproduction and this is a long

technical post but I would like to still

give you an intuitive sense for what it

looks like to actually train one of

these models as a researcher like what

are you looking at and what does it look

like what does it feel like so let me

give you a sense of that a little bit

okay so this is what it looks like let

me slide this

over so what I'm doing here is I'm

training a gpt2 model right now

and um what's happening here is that

every single line here like this one is

one update to the model so remember how

here we are um basically making the

prediction better for every one of these

tokens and we are updating these weights

or parameters of the neural net so here

every single line is One update to the

neural network where we change its

parameters by a little bit so that it is

better at predicting next token and

sequence in particular every single line

here is improving the prediction on 1

million tokens in the training set so

we've basically taken 1 million tokens

out of this data set and we've tried to

improve the prediction of that token as

coming next in a sequence on all 1

million of them

simultaneously and at every single one

of these steps we are making an update

to the network for that now the number

to watch closely is this number called

loss and the loss is a single number

that is telling you how well your neural

network is performing right now and it

is created so that low loss is good so

you'll see that the loss is decreasing

as we make more updates to the neural

nut which corresponds to making better

predictions on the next token in a

sequence and so the loss is the number

that you are watching as a neural

network researcher and you are kind of

waiting you're twiddling your thumbs uh

you're drinking coffee and you're making

sure that this looks good so that with

every update your loss is improving and

the network is getting better at

prediction now here you see that we are

processing 1 million tokens per update

each update takes about 7 Seconds

roughly and here we are going to process

a total of 32,000 steps of

optimization so 32,000 steps with 1

million tokens each is about 33 billion

tokens that we are going to process and

we're currently only about 420 step 20

out of 32,000 so we are still only a bit

more than 1% done because I've only been

running this for 10 or 15 minutes or

something like

that now every 20 steps I have

configured this optimization to do

inference so what you're seeing here is

the model is predicting the next token

in a sequence and so you sort of start

it randomly and then you continue

plugging in the tokens so we're running

this inference step and this is the

model sort of predicting the next token

in the sequence and every time you see

something appear that's a new

token um so let's just look at this and

you can see that this is not yet very

coherent and keep in mind that this is

only 1% of the way through training and

so the model is not yet very good at

predicting the next token in the

sequence so what comes out is actually

kind of a little bit of gibberish right

but it still has a little bit of like

local coherence so since she is mine

it's a part of the information should

discuss my father great companions

Gordon showed me sitting over at and Etc

so I know it doesn't look very good but

let's actually scroll up and see what it

looked like when I started the

optimization so all the way here at

step

one so after 20 steps of optimization

you see that what we're getting here is

looks completely random and of course

that's because the model has only had 20

updates to its parameters and so it's

giving you random text because it's a

random Network and so you can see that

at least in comparison to this model is

starting to do much better and indeed if

we waited the entire 32,000 steps the

model will have improved the point that

it's actually uh generating fairly

coherent English uh and the tokens

stream correctly um and uh they they

kind of make up English a a lot

better

um so this has to run for about a day or

two more now and so uh at this stage we

just make sure that the loss is

decreasing everything is looking good um

and we just have to wait

and now um let me turn now to the um

story of the computation that's required

because of course I'm not running this

optimization on my laptop that would be

way too expensive uh because we have to

run this neural network and we have to

improve it and we have we need all this

data and so on so you can't run this too

well on your computer uh because the

network is just too large uh so all of

this is running on the computer that is

out there in the cloud and I want to

basically address the compute side of

the store of training these models and

what that looks like so let's take a

look okay so the computer that I'm

running this optimization on is this 8X

h100 node so there are eight h100s in a

single node or a single computer now I

am renting this computer and it is

somewhere in the cloud I'm not sure

where it is physically actually the

place I like to rent from is called

Lambda but there are many other

companies who provide this service so

when you scroll down you can see that uh

they have some on demand pricing for

um sort of computers that have these uh

h100s which are gpus and I'm going to

show you what they look like in a second

but on demand 8times Nvidia h100 uh

GPU this machine comes for $3 per GPU

per hour for example so you can rent

these and then you get a machine in a

cloud and you can uh go in and you can

train these

models and these uh gpus they look like

this so this is one h100 GPU uh this is

kind of what it looks like and you slot

this into your computer and gpus are

this uh perfect fit for training your

networks because they are very

computationally expensive but they

display a lot of parallelism in the

computation so you can have many

independent workers kind of um working

all at the same time in solving uh the

matrix multiplication that's under the

hood of training these neural

networks so this is just one of these

h100s but actually you would put them

you would put multiple of them together

so you could stack eight of them into a

single node and then you can stack

multiple nodes into an entire data

center or an entire system

so when we look at a data

center can't spell when we look at a

data center we start to see things that

look like this right so we have one GPU

goes to eight gpus goes to a single

system goes to many systems and so these

are the bigger data centers and there of

course would be much much more expensive

um and what's happening is that all the

big tech companies really desire these

gpus so they can train all these

language models because they are so

powerful and that has is fundamentally

what has driven the stock price of

Nvidia to be $3.4 trillion today as an

example and why Nvidia has kind of

exploded so this is the Gold Rush the

Gold Rush is getting the gpus getting

enough of them so they can all

collaborate to perform this optimization

and they're what are they all doing

they're all collaborating to predict the

next token on a data set like the fine

web data

set this is the computational workflow

that that basically is extremely

expensive the more gpus you have the

more tokens you can try to predict and

improve on and you're going to process

this data set faster and you can iterate

faster and get a bigger Network and

train a bigger Network and so on so this

is what all those machines are look like

are uh are doing and this is why all of

this is such a big deal and for example

this is a

article from like about a month ago or

so this is why it's a big deal that for

example Elon Musk is getting 100,000

gpus uh in a single Data Center and all

of these gpus are extremely expensive

are going to take a ton of power and all

of them are just trying to predict the

next token in the sequence and improve

the network uh by doing so and uh get

probably a lot more coherent text than

what we're seeing here a lot faster okay

so unfortunately I do not have a couple

10 or hundred million of dollars to

spend on training a really big model

like this but luckily we can turn to

some big tech companies who train these

models routinely and release some of

them once they are done training so

they've spent a huge amount of compute

to train this network and they release

the network at the end of the

optimization so it's very useful because

they've done a lot of compute for that

so there are many companies who train

these models routinely but actually not

many of them release uh these what's

called base models so the model that

comes out at the end here is is what's

called a base model what is a base model

it's a token simulator right it's an

internet text token simulator and so

that is not by itself useful yet because

what we want is what's called an

assistant we want to ask questions and

have it respond to answers these models

won't do that they just uh create sort

of remixes of the internet they dream

internet pages so the base models are

not very often released because they're

kind of just only a step one of a few

other steps that we still need to take

to get in system

however a few releases have been made so

as an example the gbt2 model released

the 1.6 billion sorry 1.5 billion model

back in 2019 and this gpt2 model is a

base model now what is a model release

what does it look like to release these

models so this is the gpt2 repository on

GitHub well you need two things

basically to release model number one we

need the um python code usually that

describes the sequence of operations in

detail that they make in their model so

um if you remember

back this

Transformer the sequence of steps that

are taken here in this neural network is

what is being described by this code so

this code is sort of implementing the

what's called forward pass of this

neural network so we need the specific

details of exactly how they wired up

that neural network so this is just

computer code and it's usually just a

couple hundred lines of code it's not

it's not that crazy and uh this is all

fairly understandable and usually fairly

standard what's not standard are the

parameters that's where the actual value

is what are the parameters of this

neural network because there's 1.6

billion of them and we need the correct

setting or a really good setting and so

that's why in addition to this source

code they release the parameters which

in this case is roughly 1.5 billion

parameters and these are just numbers so

it's one single list of 1.5 billion

numbers the precise and good setting of

all the knobs such that the tokens come

out

well so uh you need those two things to

get a base model

release

now gpt2 was released but that's

actually a fairly old model as I

mentioned so actually the model we're

going to turn to is called llama 3 and

that's the one that I would like to show

you next so llama 3 so gpt2 again was

1.6 billion parameters trained on 100

billion tokens Lama 3 is a much bigger

model and much more modern model it is

released and trained by meta and it is a

45 billion parameter model trained on 15

trillion tokens in very much the same

way just much much

bigger um and meta has also made a

release of llama 3 and that was part of

this

paper so with this paper that goes into

a lot of detail the biggest base model

that they released is the Lama 3.1 4.5

405 billion parameter model so this is

the base model and then in addition to

the base model you see here

foreshadowing for later sections of the

video they also released the instruct

model and the instruct means that this

is an assistant you can ask it questions

and it will give you answers we still

have yet to cover that part later for

now let's just look at this base model

this token simulator and let's play with

it and try to think about you know what

is this thing and how does it work and

um what do we get at the end of this

optimization if you let this run Until

the End uh for a very big neural network

on a lot of data so my favorite place to

interact with the base models is this um

company called hyperbolic which is

basically serving the base model of the

405b Llama 3.1 so when you go to the

website and I think you may have to

register and so on make sure that in the

models make sure that you are using

llama 3.1 405 billion base it must be

the base model and then here let's say

the max tokens is how many tokens we're

going to be gener rating so let's just

decrease this to be a bit less just so

we don't waste compute we just want the

next 128 tokens and leave the other

stuff alone I'm not going to go into the

full detail here um now fundamentally

what's going to happen here is identical

to what happens here during inference

for us so this is just going to continue

the token sequence of whatever you

prefix you're going to give it so I want

to first show you that this model here

is not yet an assistant so you can for

example ask it what is 2 plus 2 it's not

going to tell you oh it's four uh what

else can I help you with it's not going

to do that because what is 2 plus 2 is

going to be tokenized and then those

tokens just act as a prefix and then

what the model is going to do now is

just going to get the probability for

the next token and it's just a glorified

autocomplete it's a very very expensive

autocomplete of what comes next um

depending on the statistics of what it

saw in its training documents which are

basically web

pages so let's just uh hit enter to see

what tokens it comes up with as a

continuation okay so here it kind of

actually answered the question and

started to go off into some

philosophical territory uh let's try it

again so let me copy and paste and let's

try again from scratch what is 2 plus

two so okay so it just goes off again so

notice one more thing that I want to

stress is that the system uh I think

every time you put it in it just kind of

starts from scratch

so it doesn't uh the system here is

stochastic so for the same prefix of

tokens we're always getting a different

answer and the reason for that is that

we get this probity distribution and we

sample from it and we always get

different samples and we sort of always

go into a different territory uh

afterwards so here in this case um I

don't know what this is let's try one

more

time so it just continues on so it's

just doing the stuff that it's saw on

the internet right um and it's just kind

of like regurgitating those uh

statistical

patterns so first things it's not an

assistant yet it's a token autocomplete

and second it is a stochastic system now

the crucial thing is that even though

this model is not yet by itself very

useful for a lot of applications just

yet um it is still very useful because

in the task of predicting the next token

in the sequence the model has learned a

lot about the world and it has stored

all that knowledge in the parameters of

the network so remember that our text

looked like this right internet web

pages and now all of this is sort of

compressed in the weights of the network

so you can think of um these 405 billion

parameters is a kind of compression of

the internet you can think of the

45 billion parameters is kind of like a

zip file uh but it's not a loss less

compression it's a loss C compression

we're kind of like left with kind of a

gal of the internet and we can generate

from it right now we can elicit some of

this knowledge by prompting the base

model uh accordingly so for example

here's a prompt that might work to

elicit some of that knowledge that's

hiding in the parameters here's my top

10 list of the top landmarks to see in

the

pairs

um and I'm doing it this way because I'm

trying to Prime the model to now

continue this list so let's see if that

works when I press

enter okay so you see that it started a

list and it's now kind of giving me some

of those

landmarks and now notice that it's

trying to give a lot of information here

now you might not be able to actually

fully trust some of the information here

remember that this is all just a

recollection of some of the internet

documents and so the things that occur

very frequently in the internet data are

probably more likely to be remembered

correctly compared to things that happen

very infrequently so you can't fully

trust some of the things that and some

of the information that is here because

it's all just a vague recollection of

Internet documents because the

information is not stored explicitly in

any of the parameters it's all just the

recollection that said we did get

something that is probably approximately

correct and I don't actually have the

expertise to verify that this is roughly

correct but you see that we've elicited

a lot of the knowledge of the model and

this knowledge is not precise and exact

this knowledge is vague and

probabilistic and statistical and the

kinds of things that occur often are the

kinds of things that are more likely to

be remembered um in the model now I want

to show you a few more examples of this

model's Behavior the first thing I want

to show you is this example I went to

the Wikipedia page for zebra and let me

just copy paste the first uh even one

sentence

here and let me put it here now when I

click enter what kind of uh completion

are we going to get so let me just hit

enter there are three living species

etc etc what the model is producing here

is an exact regurgitation of this

Wikipedia entry it is reciting this

Wikipedia entry purely from memory and

this memory is stored in its parameters

and so it is possible that at some point

in these 512 tokens the model will uh

stray away from the Wikipedia entry but

you can see that it has huge chunks of

it memorized here uh let me see for

example if this sentence

occurs by now okay so this so we're

still on track let me check

here okay we're still on

track it will eventually uh stray

away okay so this thing is just recited

to a very large extent it will

eventually deviate uh because it won't

be able to remember exactly now the

reason that this happens is because

these models can be extremely good at

memorization and usually this is not

what you want in the final model and

this is something called regurgitation

and it's usually undesirable to site uh

things uh directly uh that you have

trained on now the reason that this

happens actually is because for a lot of

documents like for example Wikipedia

when these documents are deemed to be of

very high quality as a source like for

example Wikipedia it is very often uh

the case that when you train the model

you will preferentially sample from

those sources so basically the model has

probably done a few epochs on this data

meaning that it has seen this web page

like maybe probably 10 times or so and

it's a bit like you like when you read

some kind of a text many many times say

you read something a 100 times uh then

you'll be able to recite it and it's

very similar for this model if it sees

something way too often it's going to be

able to recite it later from memory

except these models can be a lot more

efficient um like per presentation than

human so probably it's only seen this

Wikipedia entry 10 times but basically

it has remembered this article exactly

in its parameters okay the next thing I

want to show you is something that the

model has definitely not seen during its

training so for example if we go to the

paper uh and then we navigate to the

pre-training data we'll see here that uh

the data set has a knowledge cut off

until the end of 2023 so it will not

have seen documents after this point and

certainly it has not seen anything about

the 2024 election and how it turned out

now if we Prime the model with the

tokens from the future it will continue

the token sequence and it will just take

its best guess according to the

knowledge that it has in its own

parameters so let's take a look at what

that could look like

so the Republican Party kit

Trump okay president of the United

States from

2017 and let's see what it says after

this point so for example the model will

have to guess at the running mate and

who it's against Etc so let's hit

enter so here thingss that Mike Pence

was the running mate instead of JD Vance

and the ticket was against Hillary

Clinton and Tim Kane so this is kind of

a interesting parallel universe

potentially of what could have happened

happened according to the LM let's get a

different sample so the identical prompt

and let's

resample so here the running mate was

Ronda santis and they ran against Joe

Biden and Camala Harris so this is again

a different parallel universe so the

model will take educated guesses and it

will continue the token sequence based

on this knowledge um and it will just

kind of like all of what we're seeing

here is what's called hallucination the

model is just taking its best guess uh

in a probalistic manner the next thing I

would like to show you is that even

though this is a base model and not yet

an assistant model it can still be

utilized in Practical applications if

you are clever with your prompt design

so here's something that we would call a

few shot

prompt so what it is here is that I have

10 words or 10 pairs and each pair is a

word of English column and then a the

translation in Korean and we have 10 of

them and what the model does here is at

the end we have teacher column and then

here's where we're going to do a

completion of say just five tokens and

these models have what we call in

context learning abilities and what

that's referring to is that as it is

reading this context it is learning sort

of in

place that there's some kind of a

algorithmic pattern going on in my data

and it knows to continue that pattern

and this is called kind of like Inc

context learning so it takes on the role

of a

translator and when we hit uh completion

we see that the teacher translation is

Sim which is correct um and so this is

how you can build apps by being clever

with your prompting even though we still

just have a base model for now and it

relies on what we call this um uh in

context learning ability and it is done

by constructing what's called a few shot

prompt okay and finally I want to show

you that there is a clever way to

actually instantiate a whole language

model assistant just by prompting and

the trick to it is that we're structure

a prompt to look like a web page that is

a conversation between a helpful AI

assistant and a human and then the model

will continue that conversation so

actually to write the prompt I turned to

chat gbt itself which is kind of meta

but I told it I want to create an llm

assistant but all I have is the base

model so can you please write my um uh

prompt and this is what it came up with

which is actually quite good so here's a

conversation between an AI assistant and

a human

the AI assistant is knowledgeable

helpful capable of answering wide

variety of questions Etc and then here

it's not enough to just give it a sort

of description it works much better if

you create this fot prompt so here's a

few terms of human assistant human

assistant and we have uh you know a few

turns of conversation and then here at

the end is we're going to be putting the

actual query that we like so let me copy

paste this into the base model prompt

and now let me do human column and this

is where we put our actual prompt why is

the sky

blue and uh let's uh

run assistant the sky appears blue due

to the phenomenon called R lights

scattering etc etc so you see that the

base model is just continuing the

sequence but because the sequence looks

like this conversation it takes on that

role but it is a little subtle because

here it just uh you know it ends the

assistant and then just you know

hallucinate Ates the next question by

the human Etc so it'll just continue

going on and on uh but you can see that

we have sort of accomplished the task

and if you just took this why is the sky

blue and if we just refresh this and put

it here then of course we don't expect

this to work with a base model right

we're just going to who knows what we're

going to get okay we're just going to

get more

questions okay so this is one way to

create an assistant even though you may

only have a base model okay so this is

the kind of brief summary of the things

we talked about over the last few

minutes now let me zoom out

here and this is kind of like what we've

talked about so far we wish to train LM

assistants like chpt we've discussed the

first stage of that which is the

pre-training stage and we saw that

really what it comes down to is we take

Internet documents we break them up into

these tokens these atoms of little text

chunks and then we predict token

sequences using neural networks the

output of this entire stage is this base

model it is the setting of The

parameters of this network and this base

model is basically an internet document

simulator on the token level so it can

just uh it can generate token sequences

that have the same kind of like

statistics as Internet documents and we

saw that we can use it in some

applications but we actually need to do

better we want an assistant we want to

be able to ask questions and we want the

model to give us answers and so we need

to now go into the second stage which is

called the post-training stage so we

take our base model our internet

document simulator and hand it off to

post training so we're now going to

discuss a few ways to do what's called

post training of these models these

stages in post training are going to be

computationally much less expensive most

of the computational work all of the

massive data centers um and all of the

sort of heavy compute and millions of

dollars are the pre-training stage but

now we go into the slightly cheaper but

still extremely important stage called

post trining where we turn this llm

model into an assistant so let's take a

look at how we can get our model to not

sample internet documents but to give

answers to questions so in other words

what we want to do is we want to start

thinking about conversations and these

are conversations that can be multi-turn

so so uh there can be multiple turns and

they are in the simplest case a

conversation between a human and an

assistant and so for example we can

imagine the conversation could look

something like this when a human says

what is 2 plus2 the assistant should re

respond with something like 2 plus 2 is

4 when a human follows up and says what

if it was star instead of a plus

assistant could respond with something

like

this um and similar here this is another

example showing that the assistant could

also have some kind of a personality

here uh that it's kind of like nice and

then here in the third example I'm

showing that when a human is asking for

something that we uh don't wish to help

with we can produce what's called

refusal we can say that we cannot help

with that so in other words what we want

to do now is we want to think through

how in a system should interact with the

human and we want to program the

assistant and Its Behavior in these

conversations now because this is neural

networks we're not going to be

programming these explicitly in code

we're not going to be able to program

the assistant in that way because this

is neural networks everything is done

through neural network training on data

sets and so because of that we are going

to be implicitly programming the

assistant by creating data sets of

conversations so these are three

independent examples of conversations in

a data dat set an actual data set and

I'm going to show you examples will be

much larger it could have hundreds of

thousands of conversations that are

multi- turn very long Etc and would

cover a diverse breath of topics but

here I'm only showing three examples but

the way this works basically is uh a

assistant is being programmed by example

and where is this data coming from like

2 * 2al 4 same as 2 plus 2 Etc where

does that come from this comes from

Human labelers so we will basically give

human labelers some conversational

context and we will ask them to um

basically give the ideal assistant

response in this situation and a human

will write out the ideal response for an

assistant in any situation and then

we're going to get the model to

basically train on this and to imitate

those kinds of

responses so the way this works then is

we are going to take our base model

which we produced in the preing stage

and this base model was trained on

internet documents we're now going to

take that data set of internet documents

and we're gonna throw it out and we're

going to substitute a new data set and

that's going to be a data set of

conversations and we're going to

continue training the model on these

conversations on this new data set of

conversations and what happens is that

the model will very rapidly adjust and

will sort of like learn the statistics

of how this assistant responds to human

queries and then later during inference

we'll be able to basically um Prime the

assistant and get the response and it

will be imitating what the humans will

human labelers would do in that

situation if that makes sense so we're

going to see examples of that and this

is going to become bit more concrete I

also wanted to mention that this

post-training stage we're going to

basically just continue training the

model but um the pre-training stage can

in practice take roughly three months of

training on many thousands of computers

the post-training stage will typically

be much shorter like 3 hours for example

um and that's because the data set of

conversations that we're going to create

here manually is much much smaller than

the data set of text on the internet and

so this training will be very short but

fundamentally we're just going to take

our base model we're going to continue

training using the exact same algorithm

the exact same everything except we're

swapping out the data set for

conversations so the questions now are

what are these conversations how do we

represent them how do we get the model

to see conversations instead of just raw

text and then what are the outcomes of

um this kind of training and what do you

get in a certain like psychological

sense uh when we talk about the model so

let's turn to those questions now so

let's start by talking about the

tokenization of conversations everything

in these models has to be turned into

tokens because everything is just about

token sequences so how do we turn

conversations into token sequences is

the question and so for that we need to

design some kind of ending coding and uh

this is kind of similar to maybe if

you're familiar you don't have to be

with for example the TCP IP packet in um

on the internet there are precise rules

and protocols for how you represent

information how everything is structured

together so that you have all this kind

of data laid out in a way that is

written out on a paper and that everyone

can agree on and so it's the same thing

now happening in llms we need some kind

of data structures and we need to have

some rules around how these data

structures like conversations get

encoded and decoded to and from tokens

and so I want to show you now how I

would

recreate uh this conversation in the

token space so if you go to Tech

tokenizer

I can take that conversation and this is

how it is represented in uh for the

language model so here we have we are

iterating a user and an assistant in

this two- turn

conversation and what you're seeing here

is it looks ugly but it's actually

relatively simple the way it gets turned

into a token sequence here at the end is

a little bit complicated but at the end

this conversation between a user and

assistant ends up being 49 tokens it is

a one-dimensional sequence of 49 tokens

and these are the tokens

okay and all the different llms will

have a slightly different format or

protocols and it's a little bit of a

wild west right now but for example GPT

40 does it in the following way you have

this special token called imore start

and this is short for IM imaginary

monologue uh the

start then you have to specify um I

don't actually know why it's called that

to be honest then you have to specify

whose turn it is so for example user

which is a token 4

28 then you have internal monologue

separator and then it's the exact

question so the tokens of the question

and then you have to close it so I am

end the end of the imaginary monologue

so

basically the question from a user of

what is 2 plus two ends up being the

token sequence of these tokens and now

the important thing to mention here is

that IM start this is not text right IM

start is a special token that gets added

it's a new token and um this token has

never been trained on so far it is a new

token that we create in a post-training

stage and we introduce and so these

special tokens like IM seep IM start Etc

are introduced and interspersed with

text so that they sort of um get the

model to learn that hey this is a the

start of a turn for who is it start of

the turn for the start of the turn is

for the user and then this is what the

user says and then the user ends and

then it's a new start of a turn and it

is by the assistant and then what does

the assistant say well these are the

tokens of what the assistant says Etc

and so this conversation is not turned

into the sequence of tokens the specific

details here are not actually that

important all I'm trying to show you in

concrete terms is that our conversations

which we think of as kind of like a

structured object end up being turned

via some encoding into onedimensional

sequences of tokens and so because this

is one dimensional sequence of tokens we

can apply all the stuff that we applied

before now it's just a sequence of

tokens and now we can train a language

model on it and so we're just predicting

the next token in a sequence uh just

like before and um we can represent and

train on conversations and then what

does it look like at test time during

inference so say we've trained a model

and we've trained a model on these kinds

of data sets of conversations and now we

want to

inference so during inference what does

this look like when you're on on chash

apt well you come to chash apt and you

have say like a dialogue with it and the

way this works is

basically um say that this was already

filled in so like what is 2 plus 2 2

plus 2 is four and now you issue what if

it was times I am end and what basically

ends up happening um on the servers of

open AI or something like that is they

put in I start assistant I amep and this

is where they end it right here so they

construct this context and now they

start sampling from the model so it's at

this stage that they will go to the

model and say okay what is a good for

sequence what is a good first token what

is a good second token what is a good

third token and this is where the LM

takes over and creates a response like

for example response that looks

something like this but it doesn't have

to be identical to this but it will have

the flavor of this if this kind of a

conversation was in the data set so um

that's roughly how the protocol Works

although the details of this protocol

are not important so again my goal is

that just to show you that everything

ends up being just a one-dimensional

token sequence so we can apply

everything we've already seen but we're

now training on conversations and we're

now uh basically generating

conversations as well okay so now I

would like to turn to what these data

sets look like in practice the first

paper that I would like to show you and

the first effort in this direction is

this paper from openai in 2022 and this

paper was called instruct GPT or the

technique that they developed and this

was the first time that opena has kind

of talked about how you can take

language models and fine-tune them on

conversations and so this paper has a

number of details that I would like to

take you through so the first stop I

would like to make is in section 3.4

where they talk about the human

contractors that they hired uh in this

case from upwork or through scale AI to

uh construct these conversations and so

there are human labelers involved whose

job it is professionally to create these

conversations and these labelers are

asked to come up with prompts and then

they are asked to also complete the

ideal assistant responses and so these

are the kinds of prompts that people

came up with so these are human labelers

so list five ideas for how to regain

enthusiasm for my career what are the

top 10 science fiction books I should

read next and there's many different

types of uh kind of prompts here so

translate this sentence from uh to

Spanish Etc and so there's many things

here that people came up with they first

come up with the prompt and then they

also uh answer that prompt and they give

the ideal assistant response now how do

they know what is the ideal assistant

response that they should write for

these prompts so when we scroll down a

little bit further we see that here we

have this excerpt of labeling

instructions uh that are given to the

human labelers so the company that is

developing the language model like for

example open AI writes up labeling

instructions for how the humans should

create ideal responses and so here for

example is an excerpt uh of these kinds

of labeling instruction instructions on

High level you're asking people to be

helpful truthful and harmless and you

can pause the video if you'd like to see

more here but on a high level basically

just just answer try to be helpful try

to be truthful and don't answer

questions that we don't want um kind of

the system to handle uh later in chat

gbt and so roughly speaking the company

comes up with the labeling instructions

usually they are not this short usually

there are hundreds of pages and people

have to study them professionally and

then they write out the ideal assistant

responses uh following those labeling

instructions so this is a very human

heavy process as it was described in

this paper now the data set for instruct

GPT was never actually released by openi

but we do have some open- Source um

reproductions that were're trying to

follow this kind of a setup and collect

their own data so one that I'm familiar

with for example is the effort of open

Assistant from a while back and this is

just one of I think many examples but I

just want to show you an example so

here's so these were people on the

internet that were asked to basically

create these conversations similar to

what um open I did with human labelers

and so here's an entry of a person who

came up with this BR can you write a

short introduction to the relevance of

the term

manop uh in economics please use

examples Etc and then the same person or

potentially a different person will

write up the response so here's the

assistant response to this and so then

the same person or different person will

actually write out this ideal

response and then this is an example of

maybe how the conversation could

continue now explain it to a dog and

then you can try to come up with a

slightly a simpler explanation or

something like that now this then

becomes the label and we end up training

on this so what happens during training

is that um of course we're not going to

have a full coverage of all the possible

questions that um the model will

encounter at test time during inference

we can't possibly cover all the possible

prompts that people are going to be

asking in the future but if we have a

like a data set of a few of these

examples then the model during training

will start to take on this Persona of

this helpful truthful harmless assistant

and it's all programmed by example and

so these are all examples of behavior

and if you have conversations of these

example behaviors and you have enough of

them like 100,00 and you train on it the

model sort of starts to understand the

statistical pattern and it kind of takes

on this personality of this

assistant now it's possible that when

you get the exact same question like

this at test time it's possible that the

answer will be recited as exactly what

was in the training set but more likely

than that is that the model will kind of

like do something of a similar Vibe um

and we will understand that this is the

kind of answer that you want um so

that's what we're doing we're

programming the system um by example and

the system adopts statistically this

Persona of this helpful truthful

harmless assistant which is kind of like

reflected in the labeling instructions

that the company creates now I want to

show you that the state-of-the-art has

kind of advanced in the last 2 or 3

years uh since the instr GPT paper so in

particular it's not very common for

humans to be doing all the heavy lifting

just by themselves anymore and that's

because we now have language models and

these language models are helping us

create these data sets and conversations

so it is very rare that the people will

like literally just write out the

response from scratch it is a lot more

likely that they will use an existing

llm to basically like uh come up with an

answer and then they will edit it or

things like that so there's many

different ways in which now llms have

started to kind of permeate this

posttraining Set uh stack and llms are

basically used pervasively to help

create these massive data sets of

conversations so I don't want to show

like Ultra chat is one um such example

of like a more modern data set of

conversations it is to a very large

extent synthetic but uh I believe

there's some human involvement I could

be wrong with that usually there will be

a little bit of human but there will be

a huge amount of synthetic help um and

this is all kind of like uh constructed

in different ways and Ultra chat is just

one example of many sft data sets that

currently exist and the only thing I

want to show you is that uh these data

sets have now millions of conversations

uh these conversations are mostly

synthetic but they're probably edited to

some extent by humans and they span a

huge diversity of sort of

um uh areas and so on so these are

fairly extensive artifacts by now and

there's all these like sft mixtures as

they're called so you have a mixture of

like lots of different types and sources

and it's partially synthetic partially

human and it's kind of like um gone in

that direction since uh but roughly

speaking we still have sft data sets

they're made up of conversations we're

training on them um just like we did

before and

uh I guess like the last thing to note

is that I want to dispel a little bit of

the magic of talking to an AI like when

you go to chat GPT and you give it a

question and then you hit enter uh what

is coming back is kind of like

statistically aligned with what's

happening in the training set and these

training sets I mean they really just

have a seed in humans following labeling

instructions so what are you actually

talking to in chat GPT or how should you

think about it well it's not coming from

some magical AI like roughly speaking

it's coming from something that is

statistically imitating human labelers

which comes from labeling instructions

written by these companies and so you're

kind of imitating this uh you're kind of

getting um it's almost as if you're

asking human labeler and imagine that

the answer that is given to you uh from

chbt is some kind of a simulation of a

human labeler uh and it's kind of like

asking what would a human labeler say in

this kind of a conversation

and uh it's not just like this human

labeler is not just like a random person

from the internet because these

companies actually hire experts so for

example when you are asking questions

about code and so on the human labelers

that would be in um involved in creation

of these conversation data sets they

will usually be usually be educated

expert people and you're kind of like

asking a question of like a simulation

of those people if that makes sense so

you're not talking to a magical AI

you're talking to an average labeler

this average labeler is probably fairly

highly skilled

but you're talking to kind of like an

instantaneous simulation of that kind of

a person that would be hired uh in the

construction of these data sets so let

me give you one more specific example

before we move on for example when I go

to chpt and I say recommend the top five

landmarks who see in Paris and then I

hit

enter

uh okay here we go okay when I hit enter

what's coming out here how do I think

about it well it's not some kind of a

magical AI that has gone out and

researched all the landmarks and then

ranked them using its infinite

intelligence Etc what I'm getting is a

statistical simulation of a labeler that

was hired by open AI you can think about

it roughly in that way and so if this

specific um question is in the

posttraining data set somewhere at open

aai then I'm very likely to see an

answer that is probably very very

similar to what that human labeler would

have put down

for those five landmarks how does the

human labeler come up with this well

they go off and they go on the internet

and they kind of do their own little

research for 20 minutes and they just

come up with a list right now so if they

come up with this list and this is in

the data set I'm probably very likely to

see what they submitted as the correct

answer from the assistant now if this

specific query is not part of the post

training data set then what I'm getting

here is a little bit more emergent uh

because uh the model kind of understands

the statistically

um the kinds of landmarks that are in

this training set are usually the

prominent landmarks the landmarks that

people usually want to see the kinds of

landmarks that are usually uh very often

talked about on the internet and

remember that the model already has a

ton of Knowledge from its pre-training

on the internet so it's probably seen a

ton of conversations about Paris about

landmarks about the kinds of things that

people like to see and so it's the

pre-training knowledge that has then

combined with the postering data set

that results in this kind of an

imitation um

so that's uh that's roughly how you can

kind of think about what's happening

behind the scenes here in in this

statistical sense okay now I want to

turn to the topic of llm psychology as I

like to call it which is what are sort

of the emergent cognitive effects of the

training pipeline that we have for these

models so in particular the first one I

want to talk to is of course

hallucinations so you might be familiar

with model hallucinations it's when llms

make stuff up they just totally

fabricate information Etc and it's a big

problem with llm assistants it is a

problem that existed to a large extent

with early models uh from many years ago

and I think the problem has gotten a bit

better uh because there are some

medications that I'm going to go into in

a second for now let's just try to

understand where these hallucinations

come from so here's a specific example

of a few uh of three conversations that

you might think you have in your

training set and um these are pretty

reasonable conversations that you could

imagine being in the training set so

like for example who is Cruz well Tom

Cruz is an famous actor American actor

and producer Etc who is John baraso this

turns out to be a us senetor for example

who is genis Khan well genis Khan was

blah blah blah and so this is what your

conversations could look like at

training time now the problem with this

is that when the human is writing the

correct answer for the assistant in each

one of these cases uh the human either

like knows who this person is or they

research them on the Internet and they

come in and they write this response

that kind of has this like confident

tone of an answer and what happens

basically is that at test time when you

ask for someone who is this is a totally

random name that I totally came up with

and I don't think this person exists um

as far as I know I just Tred to generate

it randomly the problem is when we ask

who is Orson kovats the problem is that

the assistant will not just tell you oh

I don't know even if the assistant and

the language model itself might know

inside its features inside its

activations inside of its brain sort of

it might know that this person is like

not someone that um that is that it's

familiar with even if some part of the

network kind of knows that in some sense

the uh saying that oh I don't know who

this is is is not going to happen

because the model statistically imitates

is training set in the training set the

questions of the form who is blah are

confidently answered with the correct

answer and so it's going to take on the

style of the answer and it's going to do

its best it's going to give you

statistically the most likely guess and

it's just going to basically make stuff

up because these models again we just

talked about it is they don't have

access to the internet they're not doing

research these are statistical token

tumblers as I call them uh is just

trying to sample the next token in the

sequence and it's going to basically

make stuff up so let's take a look at

what this looks

like I have here what's called the

inference playground from hugging face

and I am on purpose picking on a model

called Falcon 7B which is an old model

this is a few years ago now so it's an

older model So It suffers from

hallucinations and as I mentioned this

has improved over time recently but

let's say who is Orson kovats let's ask

Falcon 7B instruct

run oh yeah Orson kovat is an American

author and science uh fiction writer

okay this is totally false it's

hallucination let's try again these are

statistical systems right so we can

resample this time Orson kovat is a

fictional character from this 1950s TV

show it's total BS right let's try again

he's a former minor league baseball

player okay so basically the model

doesn't know and it's given us lots of

different answers because it doesn't

know it's just kind of like sampling

from these probabilities the model

starts with the tokens who is oron

kovats assistant and then it comes in

here and it's get it's getting these

probabilities and it's just sampling

from the probabilities and it just like

comes up with stuff and the stuff is

actually

statistically consistent with the style

of the answer in its training set and

it's just doing that but you and I

experiened it as a madeup factual

knowledge but keep in mind that uh the

model basically doesn't know and it's

just imitating the format of the answer

and it's not going to go off and look it

up uh because it's just imitating again

the answer so how can we uh mitigate

this because for example when we go to

chat apt and I say who is oron kovats

and I'm now asking the stateoftheart

state-of-the-art model from open AI

this model will tell

you oh so this model is actually is even

smarter because you saw very briefly it

said searching the web uh we're going to

cover this later um it's actually trying

to do tool use and

uh kind of just like came up with some

kind of a story but I want to just who

or Kovach did not use any tools I don't

want it to do web

search there's a wellknown historical or

public figure named or oron kovats so

this model is not going to make up stuff

this model knows that it doesn't know

and it tells you that it doesn't appear

to be a person that this model knows so

somehow we sort of improved

hallucinations even though they clearly

are an issue in older models and it

makes totally uh sense why you would be

getting these kinds of answers if this

is what your training set looks like so

how do we fix this okay well clearly we

need some examples in our data set that

where the correct answer for the

assistant is that the model doesn't know

about some particular fact but we only

need to have those answers be produced

in the cases where the model actually

doesn't know and so the question is how

do we know what the model knows or

doesn't know well we can empirically

probe the model to figure that out so

let's take a look at for example how

meta uh dealt with hallucinations for

the Llama 3 series of models as an

example so in this paper that they

published from meta we can go into

hallucinations

which they call here factuality and they

describe the procedure by which they

basically interrogate the model to

figure out what it knows and doesn't

know to figure out sort of like the

boundary of its knowledge and then they

add examples to the training set where

for the things where the model doesn't

know them the correct answer is that the

model doesn't know them which sounds

like a very easy thing to do in

principle but this roughly fixes the

issue and the the reason it fixes the

issue is

because remember like the model might

actually have a pretty good model of its

self knowledge inside the network so

remember we looked at the network and

all these neurons inside the network you

might imagine that there's a neuron

somewhere in the network that sort of

like lights up for when the model is

uncertain but the problem is that the

activation of that neuron is not

currently wired up to the model actually

saying in words that it doesn't know so

even though the internal of the neural

network no because there's some neurons

that represent that the model uh will

not surface that it will instead take

its best guess so that it sounds

confident um just like it sees in a

training set so we need to basically

interrogate the model and allow it to

say I don't know in the cases that it

doesn't know so let me take you through

what meta roughly does so basically what

they do is here I have an example uh

Dominic kek is uh the featured article

today so I just went there randomly and

what they do is basically they take a

random document in a training set and

they take a paragraph and then they use

an llm to construct questions about that

paragraph so for example I did that with

chat GPT

here so I said here's a paragraph from

this document generate three specific

factual questions based on this

paragraph and give me the questions and

the answers and so the llms are already

good enough to create and reframe this

information so if the information is in

the context window um of this llm this

actually works pretty well it doesn't

have to rely on its memory it's right

there in the context window and so it

can basically reframe that information

with fairly high accuracy so for example

can generate questions for us like for

which team did he play here's the answer

how many cups did he win Etc and now

what we have to do is we have some

question and answers and now we want to

interrogate the model so roughly

speaking what we'll do is we'll take our

questions and we'll go to our model

which would be uh say llama uh in meta

but let's just interrogate mol 7B here

as an example that's another model so

does this model know about this answer

let's take a

look uh so he played for Buffalo Sabers

right so the model knows and the the way

that you can programmatically decide is

basically we're going to take this

answer from the model and we're going to

compare it to the correct answer and

again the model model are good enough to

do this automatically so there's no

humans involved here we can take uh

basically the answer from the model and

we can use another llm judge to check if

that is correct according to this answer

and if it is correct that means that the

model probably knows so what we're going

to do is we're going to do this maybe a

few times so okay it knows it's Buffalo

Savers let's drag

in um Buffalo Sabers let's try one more

time Buffalo Sabers so we asked three

times about this factual question and

the model seems to know so everything is

great now let's try the second question

how many Stanley Cups did he

win and again let's interrogate the

model about that and the correct answer

is

two so um here the model claims that he

won um four times which is not correct

right it doesn't match two so the model

doesn't know it's making stuff up let's

try again

um so here the model again it's kind of

like making stuff up right let's

Dragon here it says did he did not even

did not win during his career so

obviously the model doesn't know and the

way we can programmatically tell again

is we interrogate the model three times

and we compare its answers maybe three

times five times whatever it is to the

correct answer and if the model doesn't

know then we know that the model doesn't

know this question

and then what we do is we take this

question we create a new conversation in

the training set so we're going to add a

new conversation training set and when

the question is how many Stanley Cups

did he win the answer is I'm sorry I

don't know or I don't remember and

that's the correct answer for this

question because we interrogated the

model and we saw that that's the case if

you do this for many different types of

uh questions for many different types of

documents you are giving the model an

opportunity to in its training set

refuse to say based on its knowledge and

if you just have a few examples of that

in your training set the model will know

um and and has the opportunity to learn

the association of this knowledge-based

refusal to this internal neuron

somewhere in its Network that we presume

exists and empirically this turns out to

be probably the case and it can learn

that Association that hey when this

neuron of uncertainty is high then I

actually don't know and I'm allowed to

say that I'm sorry but I don't think I

remember this Etc and if you have these

uh examples in your training set then

this is a large mitigation for

hallucination and that's roughly

speaking why chpt is able to do stuff

like this as well so these are kinds of

uh mitigations that people have

implemented and that have improved the

factuality issue over time okay so I've

described mitigation number one for

basically mitigating the hallucinations

issue now we can actually do much better

than that uh it's instead of just saying

that we don't know uh we can introduce

an additional mitigation number two to

give the llm an opportunity to be

factual and actually answer the question

now what do you and I do if I was to ask

you a factual question and you don't

know uh what would you do um in order to

answer the question well you could uh go

off and do some search and uh use the

internet and you could figure out the

answer and then tell me what that answer

is and we can do the exact exact same

thing with these models so think of the

knowledge inside the neural network

inside its billions of parameters think

of that as kind of a vague recollection

of the things that the model has seen

during its training during the

pre-training stage a long time ago so

think of that knowledge in the

parameters as something you read a month

ago and if you keep reading something

then you will remember it and the model

remembers that but if it's something

rare then you probably don't have a

really good recollection of that

information but what you and I do is we

just go and look it up now when you go

and look it up what you're doing

basically is like you're refreshing your

working memory with information and then

you're able to sort of like retrieve it

talk about it or Etc so we need some

equivalent of allowing the model to

refresh its memory or its recollection

and we can do that by introducing tools

uh for the

models so the way we are going to

approach this is that instead of just

saying hey I'm sorry I don't know we can

attempt to use tools so we can create uh

a mechanism

by which the language model can emit

special tokens and these are tokens that

we're going to introduce new tokens so

for example here I've introduced two

tokens and I've introduced a format or a

protocol for how the model is allowed to

use these tokens so for example instead

of answering the question when the model

does not instead of just saying I don't

know sorry the model has the option now

to emitting the special token search

start and this is the query that will go

to like bing.com in the case of openai

or say Google search or something like

that so it will emit the query and then

it will emit search end and then here

what will happen is that the program

that is sampling from the model that is

running the inference when it sees the

special token search end instead of

sampling the next token uh in the

sequence it will actually pause

generating from the model it will go off

it will open a session with bing.com and

it will paste the search query into Bing

and it will then um get all the text

that is retrieved and it will basically

take that text it will maybe represent

it again with some other special tokens

or something like that and it will take

that text and it will copy paste it here

into what I Tred to like show with the

brackets so all that text kind of comes

here and when the text comes here it

enters the context window so the model

so that text from the web search is now

inside the context window that will feed

into the neural network and you should

think of the context window as kind of

like the working memory of the model

that data that is in the context window

is directly accessible by the model it

directly feeds into the neural network

so it's not anymore a vague recollection

it's data that it it has in the context

window and is directly available to that

model so now when it's sampling the new

uh tokens here afterwards it can

reference very easily the data that has

been copy pasted in there so that's

roughly how these um how these tools use

uh tools uh function

and so web search is just one of the

tools we're going to look at some of the

other tools in a bit uh but basically

you introduce new tokens you introduce

some schema by which the model can

utilize these tokens and can call these

special functions like web search

functions and how do you teach the model

how to correctly use these tools like

say web search search start search end

Etc well again you do that through

training sets so we need now to have a

bunch of data and a bunch of

conversations that show the model by

example how to use web search so what

are the what are the settings where you

are using the search um and what does

that look like and here's by example how

you start a search and the search Etc

and uh if you have a few thousand maybe

examples of that in your training set

the model will actually do a pretty good

job of understanding uh how this tool

works and it will know how to sort of

structure its queries and of course

because of the pre-training data set and

its understanding of the world it

actually kind of understands what a web

search is and so it actually kind of has

a pretty good native understanding

um of what kind of stuff is a good

search query um and so it all kind of

just like works you just need a little

bit of a few examples to show it how to

use this new tool and then it can lean

on it to retrieve information and uh put

it in the context window and that's

equivalent to you and I looking

something up because once it's in the

context it's in the working memory and

it's very easy to manipulate and access

so that's what we saw a few minutes ago

when I was searching on chat GPT for who

is Orson kovats the chat GPT language

model decided Ed that this is some kind

of a rare um individual or something

like that and instead of giving me an

answer from its memory it decided that

it will sample a special token that is

going to do web search and we saw

briefly something flash it was like

using the web tool or something like

that so it briefly said that and then we

waited for like two seconds and then it

generated this and you see how it's

creating references here and so it's

citing sources so what happened here is

it went off it did a web web search it

found these sources and these URLs and

the text of these web pages was all

stuffed in between here and it's not

showing here but it's it's basically

stuffed as text in between here and now

it sees that text and now it kind of

references it and says that okay it

could be these people citation could be

those people citation Etc so that's what

happened here and that's what and that's

why when I said who is Orson kovats I

could also say don't use any tools and

then that's enough to um

basically convince chat PT to not use

tools and just use its memory and its

recollection I also went off and I um

tried to ask this question of Chachi PT

so how many standing cups did uh Dominic

Hasek win and Chachi P actually decided

that it knows the answer and it has the

confidence to say that uh he want twice

and so it kind of just relied on its

memory because presumably it has um it

has enough of

a kind of confidence in its weights in

it parameters and activations that this

is uh retrievable just for memory um but

you can also

conversely use web search to make sure

and then for the same query it actually

goes off and it searches and then it

finds a bunch of sources it finds all

this all of this stuff gets copy pasted

in there and then it tells us uh to

again and sites and it actually says the

Wikipedia article which is the source of

this information for us as well so

that's tools web search the model

determines when to search and then uh

that's kind of like how these tools uh

work and this is an additional kind of

mitigation for uh hallucinations and

factuality so I want to stress one more

time this very important sort of

psychology

Point knowledge in the parameters of the

neural network is a vague recollection

the knowledge in the tokens that make up

the context

window is the working memory and it

roughly speaking Works kind of like um

it works for us in our brain the stuff

we remember is our parameters uh and the

stuff that we just experienced like a

few seconds or minutes ago and so on you

can imagine that being in our context

window and this context window is being

built up as you have a conscious

experience around you so this has a

bunch of um implications also for your

use of LOLs in practice so for example I

can go to chat GPT and I can do

something like this I can say can you

Summarize chapter one of Jane Austin's

Pride and Prejudice right and this is a

perfectly fine prompt and Chach actually

does something relatively reasonable

here and but the reason it does that is

because Chach has a pretty good

recollection of a famous work like Pride

and Prejudice it's probably seen a ton

of stuff about it there's probably

forums about this book it's probably

read versions of this book um and it's

kind of like remembers because even if

you've read this or articles about it

you'd kind of have a recollection enough

to actually say all this but usually

when I actually interact with LMS and I

want them to recall specific things it

always works better if you just give it

to them so I think a much better prompt

would be something like this can you

summarize for me chapter one of genos's

spr and Prejudice and then I am

attaching it below for your reference

and then I do something like a delimeter

here and I paste it in and I I found

that just copy pasting it from some

website that I found here um so copy

pasting the chapter one here and I do

that because when it's in the context

window the model has direct access to it

and can exactly it doesn't have to

recall it it just has access to it and

so this summary is can be expected to be

a significantly high quality or higher

quality than this summary uh just

because it's directly available to the

model and I think you and I would work

in the same way if you want to it would

be you would produce a much better

summary if you had reread this chapter

before you had to summarize it and

that's basically what's happening here

or the equivalent of it the next sort of

psychological Quirk I'd like to talk

about briefly is that of the knowledge

of self so what I see very often on the

internet is that people do something

like this they ask llms something like

what model are you and who built you and

um basically this uh question is a

little bit nonsensical and the reason I

say that is that as I try to kind of

explain with some of the underhood

fundamentals this thing is not a person

right it doesn't have a persistent

existence in any way it sort of boots up

processes tokens and shuts off and it

does that for every single person it

just kind of builds up a context window

of conversation and then everything gets

deleted and so this this entity is kind

of like restarted from scratch every

single conversation if that makes sense

it has no persistent self it has no

sense of self it's a token tumbler and

uh it follows the statistical

regularities of its training set so it

doesn't really make sense to ask it who

are you what build you Etc and by

default if you do what I described and

just by default and from nowhere you're

going to get some pretty random answers

so for example let's uh pick on Falcon

which is a fairly old model and let's

see what it tells

us uh so it's evading the question uh

talented engineers and developers here

it says I was built by open AI based on

the gpt3 model it's totally making stuff

up now the fact that it's built by open

AI here I think a lot of people would

take this as evidence that this model

was somehow trained on open AI data or

something like that I don't actually

think that that's necessarily true the

reason for that is

that if you don't explicitly program the

model to answer these kinds of questions

then what you're going to get is its

statistical best guess at the answer and

this model had a um sft data mixture of

conversations and during the

fine-tuning um the model sort of

understands as it's training on this

data that it's taking on this

personality of this like helpful

assistant and it doesn't know how to it

doesn't actually it wasn't told exactly

what label to apply to self it just kind

of is taking on this uh this uh Persona

of a helpful assistant and remember that

the pre-training stage took the

documents from the entire internet and

Chach and open AI are very prominent in

these documents and so I think what's

actually likely to be happening here is

that this is just its hallucinated label

for what it is this is its self-identity

is that it's chat GPT by open Ai and

it's only saying that because there's a

ton of data on the internet of um

answers like this that are actually

coming from open from chasht and So

that's its label for what it is now you

can override this as a developer if you

have a llm model you can actually

override it and there are a few ways to

do that so for example let me show you

there's this MMO model from Allen Ai and

um this is one llm it's not a top tier

LM or anything like that but I like it

because it is fully open source so the

paper for Almo and everything else is

completely fully open source which is

nice um so here we are looking at its

sft mixture so this is the data mixture

of um the fine tuning so this is the

conversations data it right and so the

way that they are solving it for Theo

model is we see that there's a bunch of

stuff in the mixture and there's a total

of 1 million conversations here but here

we have alot to hardcoded if we go there

we see that this is 240

conversations and look at these 240

conversations they're hardcoded tell me

about yourself says user and then the

assistant says I'm and open language

model developed by AI to Allen Institute

of artificial intelligence Etc I'm here

to help blah blah blah what is your name

uh Theo project so these are all kinds

of like cooked up hardcoded questions

abouto 2 and the correct answers to give

in these cases if you take 240 questions

like this or conversations put them into

your training set and fine tune with it

then the model will actually be expected

to parot this stuff later if you don't

give it this then it's probably a Chach

by open

Ai and um there's one more way to

sometimes do this is

that basically um in these conversations

and you have terms between human and

assistant sometimes there's a special

message called system message at the

very beginning of the conversation so

it's not just between human and

assistant there's a system and in the

system message you can actually hardcode

and remind the model that hey you are a

model developed by open Ai and your name

is chashi pt40 and you were trained on

this date and your knowledge cut off is

this and basically it kind of like

documents the model a little bit and

then this is inserted into to your

conversations so when you go on chpt you

see a blank page but actually the system

message is kind of like hidden in there

and those tokens are in the context

window and so those are the two ways to

kind of um program the models to talk

about themselves either it's done

through uh data like this or it's done

through system message and things like

that basically invisible tokens that are

in the context window and remind the

model of its identity but it's all just

kind of like cooked up and bolted on in

some in some way it's not actually like

really deeply there in any real sense as

it would before a human I want to now

continue to the next section which deals

with the computational capabilities or

like I should say the native

computational capabilities of these

models in problem solving scenarios and

so in particular we have to be very

careful with these models when we

construct our examples of conversations

and there's a lot of sharp edges here

that are kind of like elucidative is

that a word uh they're kind of like

interesting to look at when we consider

how these models think so um consider

the following prompt from a human and

supposed that basically that we are

building out a conversation to enter

into our training set of conversations

so we're going to train the model on

this we're teaching you how to basically

solve simple math problems so the prompt

is Emily buys three apples and two

oranges each orange cost $2 the total

cost is 13 what is the cost of apples

very simple math question now there are

two answers here on the left and on the

right they are both correct answers they

both say that the answer is three which

is correct but one of these two is a

significant ific anly better answer for

the assistant than the other like if I

was Data labeler and I was creating one

of these one of these would be uh a

really terrible answer for the assistant

and the other would be okay and so I'd

like you to potentially pause the video

Even and think through why one of these

two is significantly better answer uh

than the other and um if you use the

wrong one your model will actually be uh

really bad at math potentially and it

would have uh bad outcomes and this is

something that you would be careful with

in your life labeling documentations

when you are training people uh to

create the ideal responses for the

assistant okay so the key to this

question is to realize and remember that

when the models are training and also

inferencing they are working in

onedimensional sequence of tokens from

left to right and this is the picture

that I often have in my mind I imagine

basically the token sequence evolving

from left to right and to always produce

the next token in a sequence we are

feeding all these tokens into the neural

network and this neural network then is

the probabilities for the next token and

sequence right so this picture here is

the exact same picture we saw uh before

up here and this comes from the web demo

that I showed you before right so this

is the calculation that basically takes

the input tokens here on the top and uh

performs these operations of all these

neurons and uh gives you the answer for

the probabilities of what comes next now

the important thing to realize is that

roughly

speaking uh there's basically a finite

number of layers of computation that

happened here so for example this model

here has only one two three layers of

what's called detention and uh MLP here

um maybe um typical modern

state-of-the-art Network would have more

like say 100 layers or something like

that but there's only 100 layers of

computation or something like that to go

from the previous token sequence to the

probabilities for the next token and so

there's a finite amount of computation

that happens here for every single token

and you should think of this as a very

small amount of computation and this

amount of computation is almost roughly

fixed uh for every single token in this

sequence um the that's not actually

fully true because the more tokens you

feed in uh the the more expensive uh

this forward pass will be of this neural

network but not by much so you should

think of this uh and I think as a good

model to have in mind this is a fixed

amount of compute that's going to happen

in this box for every single one of

these tokens and this amount of compute

Cann possibly be too big because there's

not that many layers that are sort of

going from the top to bottom here

there's not that that much

computationally that will happen here

and so you can't imagine the model to to

basically do arbitrary computation in a

single forward pass to get a single

token and so what that means is that we

actually have to distribute our

reasoning and our computation across

many tokens because every single token

is only spending a finite amount of

computation on it and so we kind of want

to distribute the computation across

many tokens and we can't have too much

computation or expect too much

computation out of of the model in any

single individual token because there's

only so much computation that happens

per token okay roughly fixed amount of

computation here

so that's why this answer here is

significantly worse and the reason for

that is Imagine going from left to right

here um and I copy pasted it right here

the answer is three Etc imagine the

model having to go from left to right

emitting these tokens one at a time it

has to say or we're expecting to say the

answer is space dollar sign and then

right here we're expecting it to

basically cram all of the computation of

this problem into this single token it

has to emit the correct answer three and

then once we've emitted the answer three

we're expecting it to say all these

tokens but at this point we've already

prod produced the answer and it's

already in the context window for all

these tokens that follow so anything

here is just um kind of post Hawk

justification of why this is the answer

um because the answer is already created

it's already in the token window so it's

it's not actually being calculated here

um and so if you are answering the

question directly and immediately you

are training the model to to try to

basically guess the answer in a single

token and that is just not going to work

because of the finite amount of

computation that happens per token

that's why this answer on the right is

significantly better because we are

Distributing this computation across the

answer we're actually getting the model

to sort of slowly come to the answer

from the left to right we're getting

intermediate results we're saying okay

the total cost of oranges is four so 30

- 4 is 9 and so we're creating

intermediate calculations and each one

of these calculations is by itself not

that expensive and so we're actually

basically kind of guessing a little bit

the difficulty that the model is capable

of in any single one of these individual

tokens and there can never be too much

work in any one of these tokens

computationally because then the model

won't be able to do that later at test

time and so we're teaching the model

here to spread out its reasoning and to

spread out its computation over the

tokens and in this way it only has very

simple problems in each token and they

can add up and then by the time it's

near the end it has all the previous

results in its working memory and it's

much easier for it to determine that the

answer is and here it is three so this

is a significantly better label for our

computation this would be really bad and

is teaching the model to try to do all

the computation in a single token and

it's really

bad so uh that's kind of like an

interesting thing to keep in mind is in

your

prompts uh usually don't have to think

about it explicitly because uh the

people at open AI have labelers and so

on that actually worry about this and

they make sure that the answers are

spread out and so actually open AI will

kind of like do the right thing so when

I ask this question for chat GPT it's

actually going to go very slowly it's

going to be like okay let's define our

variables set up the equation

and it's kind of creating all these

intermediate results these are not for

you these are for the model if the model

is not creating these intermediate

results for itself it's not going to be

able to reach three I also wanted to

show you that it's possible to be a bit

mean to the model uh we can just ask for

things so as an example I said I gave it

the exact same uh prompt and I said

answer the question in a single token

just immediately give me the answer

nothing else and it turns out that for

this simple um prompt here it actually

was able to do it in single go so it

just created a single I think this is

two tokens right uh because the dollar

sign is its own token so basically this

model didn't give me a single token it

gave me two tokens but it still produced

the correct answer and it did that in a

single forward pass of the

network now that's because the numbers

here I think are very simple and so I

made it a bit more difficult to be a bit

mean to the model so I said Emily buys

23 apples and 177 oranges and then I

just made the numbers a bit bigger and

I'm just making it harder for the model

I'm asking it to more computation in a

single token and so I said the same

thing and here it gave me five and five

is actually not correct so the model

failed to do all of this calculation in

a single forward pass of the network it

failed to go from the input tokens and

then in a single forward pass of the

network single go through the network it

couldn't produce the result and then I

said okay now don't worry about the the

token limit and just solve the problem

as usual and then it goes all the

intermediate results it simplifies and

every one of these intermediate results

here and intermediate calculations is

much easier for the model and um it sort

of it's not too much work per token all

of the tokens here are correct and it

arises the solution which is seven and I

just couldn't squeeze all of this work

it couldn't squeeze that into a single

forward passive Network so I think

that's kind of just a cute example and

something to kind of like think about

and I think it's kind of again just

elucidative in terms of how these uh

models work the last thing that I would

say on this topic is that if I was in

practi is trying to actually solve this

in my day-to-day life I might actually

not uh trust that the model that all the

intermediate calculations correctly here

so actually probably what I do is

something like this I would come here

and I would say use code and uh that's

because code is one of the possible

tools that chachy PD can use and instead

of it having to do mental arithmetic

like this mental arithmetic here I don't

fully trust it and especially if the

numbers get really big there's no

guarantee that the model will do this

correctly any one of these intermediates

steps might in principle fail we're

using neural networks to do mental

arithmetic uh kind of like you doing

mental arithmetic in your brain it might

just like uh screw up some of the

intermediate results it's actually kind

of amazing that it can even do this kind

of mental arithmetic I don't think I

could do this in my head but basically

the model is kind of like doing it in

its head and I don't trust that so I

wanted to use tools so you can say stuff

like use

code and uh I'm not sure what happened

there use

code and so um like I mentioned there's

a special tool and the uh the model can

write code and I can inspect that this

code is correct and then uh it's not

relying on its mental arithmetic it is

using the python interpreter which is a

very simple programming language to

basically uh write out the code that

calculates the result and I would

personally trust this a lot more because

this came out of a Python program which

I think has a lot more correctness

guarantees than the mental arithmetic of

a language model uh so just um another

kind of uh potential hint that if you

have these kinds of problems uh you may

want to basically just uh ask the model

to use the code interpreter and just

like we saw with the web search the

model has special uh kind of tokens for

calling uh like it will not actually

generate these tokens from the language

model it will write the program and then

it actually sends that program to a

different sort of part of the computer

that actually just runs that program and

brings back the result and then the

model gets access to that result and can

tell you that okay the cost of each

apple is seven

um so that's another kind of tool and I

would use this in practice for yourself

and it's um yeah it's just uh less error

prone I would say so that's why I called

this section models need tokens to think

distribute your competition across many

tokens ask models to create intermediate

results or whenever you can lean on

tools and Tool use instead of allowing

the models to do all of the stuff in

their memory so if they try to do it all

in their memory I don't fully trust it

and prefer to use tools whenever

possible I want to show you one more

example of where this actually comes up

and that's in counting so models

actually are not very good at counting

for the exact same reason you're asking

for way too much in a single individual

token so let me show you a simple

example of that um how many dots are

below and then I just put in a bunch of

dots and Chach says there are and then

it just tries to solve the problem in a

single token so in a single token it has

to count the number of dots in its

context window

um and it has to do that in the single

forward pass of a network and a single

forward pass of a network as we talked

about there's not that much computation

that can happen there just think of that

as being like very little competation

that happens there so if I just look at

what the model sees let's go to the LM

go to tokenizer it sees uh

this how many dots are below and then it

turns out that these dots here this

group of I think 20 dots is a single

token and then this group of whatever it

is is another token and then for some

reason they break up as this so I don't

actually this has to do with the details

of the tokenizer but it turns out that

these um the model basically sees the

token ID this this this and so on and

then from these token IDs it's expected

to count the number and spoiler alert is

not 161 it's actually I believe

177 so here's what we can do instead uh

we can say use code and you might expect

that like why should this work and it's

actually kind of subtle and kind of

interesting so when I say use code I

actually expect this to work let's see

okay 177 is correct so what happens here

is I've actually it doesn't look like it

but I've broken down the problem into a

problems that are easier for the model I

know that the model can't count it can't

do mental counting but I know that the

model is actually pretty good at doing

copy pasting so what I'm doing here is

when I say use code it creates a string

in Python for this and the task of

basically copy pasting my input here to

here is very simple because for the

model um it sees this string of uh it

sees it as just these four tokens or

whatever it is so it's very simple for

the model to copy paste those token IDs

and um kind of unpack them into Dots

here and so it creates this string and

then it calls python routine. count and

then it comes up with the correct answer

so the python interpreter is doing the

counting it's not the models mental

arithmetic doing the counting so it's

again a simple example of um models need

tokens to think don't rely on their

mental arithmetic and um that's why also

the models are not very good at counting

if you need them to do counting tasks

always ask them to lean on the tool now

the models also have many other little

cognitive deficits here and there and

these are kind of like sharp edges of

the technology to be kind of aware of

over time so as an example the models

are not very good with all kinds of

spelling related tasks they're not very

good at it and I told you that we would

loop back around to tokenization and the

reason to do for this is that the models

they don't see the characters they see

tokens and they their entire world is

about tokens which are these little text

chunks and so they don't see characters

like our eyes do and so very simple

character level tasks often fail so for

example uh I'm giving it a string

ubiquitous and I'm asking it to print

only every third character starting with

the first one so we start with U and

then we should go every third so every

so 1 2 3 Q should be next and then Etc

so this I see is not correct and again

my hypothesis is that this is again

Dental arithmetic here is failing number

one a little bit but number two I think

the the more important issue here is

that if you go to Tik

tokenizer and you look at ubiquitous we

see that it is three tokens right so you

and I see ubiquitous and we can easily

access the individual letters because we

kind of see them and when we have it in

the working memory of our visual sort of

field we can really easily index into

every third letter and I can do that

task but the models don't have access to

the individual letters they see this as

these three tokens and uh remember these

models are trained from scratch on the

internet and all these token uh

basically the model has to discover how

many of all these different letters are

packed into all these different tokens

and the reason we even use tokens is

mostly for efficiency uh but I think a

lot of people areed interested to delete

tokens entirely like we should really

have character level or bite level

models it's just that that would create

very long sequences and people don't

know how to deal with that right now so

while we have the token World any kind

of spelling tasks are not actually

expected to work super well so because I

know that spelling is not a strong suit

because of tokenization I can again Ask

it to lean On Tools so I can just say

use code and I would again expect this

to work because the task of copy pasting

ubiquitous into the python interpreter

is much easier and then we're leaning on

python interpreter to manipulate the

characters of this string so when I say

use

code

ubiquitous yes it indexes into every

third character and the actual truth is

u2s

uqs uh which looks correct to me so um

again an example of spelling related

tasks not working very well a very

famous example of that recently is how

many R are there in strawberry and this

went viral many times and basically the

models now get it correct they say there

are three Rs in Strawberry but for a

very long time all the state-of-the-art

models would insist that there are only

two RS in strawberry and this caused a

lot of you know Ruckus because is that a

word I think so because um it just kind

of like why are the models so brilliant

and they can solve math Olympiad

questions but they can't like count RS

in strawberry and the answer for that

again is I've got built up to it kind of

slowly but number one the models don't

see characters they see tokens and

number two they are not very good at

counting and so here we are combining

the difficulty of seeing the characters

with the difficulty of counting and

that's why the models struggled with

this even though I think by now honestly

I think open I may have hardcoded the

answer here or I'm not sure what they

did but um uh but this specific query

now works

so models are not very good at spelling

and there there's a bunch of other

little sharp edges and I don't want to

go into all of them I just want to show

you a few examples of things to be aware

of and uh when you're using these models

in practice I don't actually want to

have a comprehensive analysis here of

all the ways that the models are kind of

like falling short I just want to make

the point that there are some Jagged

edges here and there and we've discussed

a few of them and a few of them make

sense but some of them also will just

not make as much sense and they're kind

of like you're left scratching your head

even if you understand in- depth how

these models work and and good example

of that recently is the following uh the

models are not very good at very simple

questions like this and uh this is

shocking to a lot of people because

these math uh these problems can solve

complex math problems they can answer

PhD grade physics chemistry biology

questions much better than I can but

sometimes they fall short in like super

simple problems like this so here we go

9.11 is bigger than 9.9 and it justifies

it in some way but obviously and then at

the end okay it actually it flips its

decision later so um I don't believe

that this is very reproducible sometimes

it flips around its answer sometimes

gets it right sometimes get it get it

wrong uh let's try

again okay even though it might look

larger okay so here it doesn't even

correct itself in the end if you ask

many times sometimes it gets it right

too but how is it that the model can do

so great at Olympiad grade problems but

then fail on very simple problems like

this and uh I think this one is as I

mentioned a little bit of a head

scratcher it turns out that a bunch of

people studied this in depth and I

haven't actually read the paper uh but

what I was told by this team was that

when you scrutinize the activations

inside the neural network when you look

at some of the features and what what

features turn on or off and what neurons

turn on or off uh a bunch of neurons

inside the neural network light up that

are usually associated with Bible verses

U and so I think the model is kind of

like reminded that these almost look

like Bible verse markers and in a bip

verse setting 9.11 would come after 99.9

and so basically the model somehow finds

it like cognitively very distracting

that in Bible verses 9.11 would be

greater um even though here it's

actually trying to justify it and come

up to the answer with a math it still

ends up with the wrong answer here so it

basically just doesn't fully make sense

and it's not fully understood and um

there's a few Jagged issues like that so

that's why treat this as a as what it is

which is a St stochastic system that is

really magical but that you can't also

fully trust and you want to use it as a

tool not as something that you kind of

like letter rip on a problem and

copypaste the results okay so we have

now covered two major stages of training

of large language models we saw that in

the first stage this is called the

pre-training stage we are basically

training on internet documents and when

you train a language model on internet

documents you get what's called a base

model and it's basically an internet

document simulator right now we saw that

this is an interesting artifact and uh

this takes many months to train on

thousands of computers and it's kind of

a lossy compression of the internet and

it's extremely interesting but it's not

directly useful because we don't want to

sample internet documents we want to ask

questions of an AI and have it respond

to our questions so for that we need an

assistant and we saw that we can

actually construct an assistant in the

process of a post

training and specifically in the process

of supervised fine-tuning as we call

it so in this stage we saw that it's

algorithmically identical to

pre-training nothing is going to change

the only thing that changes is the data

set so instead of Internet documents we

now want to create and curate a very

nice data set of conversations so we

want Millions conversations on all kinds

of diverse topics between a human and an

assistant and fundamentally these

conversations are created by humans so

humans write the prompts and humans

write the ideal response responses and

they do that based on labeling

documentations now in the modern stack

it's not actually done fully and

manually by humans right they actually

now have a lot of help from these tools

so we can use language models um to help

us create these data sets and that's

done extensively but fundamentally it's

all still coming from Human curation at

the end so we create these conversations

that now becomes our data set we fine

tune on it or continue training on it

and we get an assistant and then we kind

of shifted gears and started talking

about some of the kind of cognitive

implications of what this assistant is

like and we saw that for example the

assistant will hallucinate if you don't

take some sort of mitigations towards it

so we saw that hallucinations would be

common and then we looked at some of the

mitigations of those hallucinations and

then we saw that the models are quite

impressive and can do a lot of stuff in

their head but we saw that they can also

Lean On Tools to become better so for

example we can lo lean on a web search

in order to hallucinate less and to

maybe bring up some more um recent

information or something like that or we

can lean on tools like code interpreter

so the code can so the llm can write

some code and actually run it and see

the

results so these are some of the topics

we looked at so far um now what I'd like

to do is I'd like to cover the last and

major stage of this Pipeline and that is

reinforcement learning so reinforcement

learning is still kind of thought to be

under the umbrella of posttraining uh

but it is the last third major stage and

it's a different way of training

language models and usually follows as

this third step so inside companies like

open AI you will start here and these

are all separate teams so there's a team

doing data for pre-training and a team

doing training for pre-training and then

there's a team doing all the

conversation generation in a in a

different team that is kind of doing the

supervis fine tuning and there will be a

team for the reinforcement learning as

well so it's kind of like a handoff of

these models you get your base model the

then you find you need to be an

assistant and then you go into

reinforcement learning which we'll talk

about uh

now so that's kind of like the major

flow and so let's now focus on

reinforcement learning the last major

stage of training and let me first

actually motivate it and why we would

want to do reinforcement learning and

what it looks like on a high level so I

would now like to try to motivate the

reinforcement learning stage and what it

corresponds to with something that

you're probably familiar with and that

is basically going to school so just

like you went to school to become um

really good at something we want to take

large language models through school and

really what we're doing is um we're um

we have a few paradigms of ways of uh

giving them knowledge or transferring

skills so in particular when we're

working with textbooks in school you'll

see that there are three major kind of

uh pieces of information in these

textbooks three classes of information

the first thing you'll see is you'll see

a lot of exposition um and by the way

this is a totally random book I pulled

from the internet I I think it's some

kind of an organic chemistry or

something I'm not sure uh but the

important thing is that you'll see that

most of the text most of it is kind of

just like the meat of it is exposition

it's kind of like background knowledge

Etc as you are reading through the words

of this Exposition you can think of that

roughly as training on that data so um

and that's why when you're reading

through this stuff this background

knowledge and this all this context

information it's kind of equivalent to

pre-training so it's it's where we build

sort of like a knowledge base of this

data and get a sense of the topic the

next major kind of information that you

will see is these uh problems and with

their worked Solutions so basically a

human expert in this case uh the author

of this book has given us not just a

problem but has also worked through the

solution and the solution is basically

like equivalent to having like this

ideal response for an assistant so it's

basically the expert is showing us how

to solve the problem in it's uh kind of

like um in its full form so as we are

reading the solution we are basically

training on the expert data and then

later we can try to imitate the expert

um and basically um that's that roughly

correspond to having the sft model

that's what it would be doing so

basically we've already done

pre-training and we've already covered

this um imitation of experts and how

they solve these problems and the third

stage of reinforcement learning is

basically the practice problems so

sometimes you'll see this is just a

single practice problem here but of

course there will be usually many

practice problems at the end of each

chapter in any textbook and practice

problems of course we know are critical

for learning because what are they

getting you to do they're getting you to

practice uh to practice yourself and

discover ways of solving these problems

yourself and so what you get in a

practice problem is you get a problem

description but you're not given the

solution but you are given the final

answer answer usually in the answer key

of the textbook and so you know the

final answer that you're trying to get

to and you have the problem statement

but you don't have the solution you are

trying to practice the solution you're

trying out many different things and

you're seeing what gets you to the final

solution the best and so you're

discovering how to solve these problems

so and in the process of that you're

relying on number one the background

information which comes from

pre-training and number two maybe a

little bit of imitation of human experts

and you can probably try similar kinds

of solutions and so on so we've done

this and this and now in this section

we're going to try to practice and so

we're going to be given prompts we're

going to be given Solutions U sorry the

final answers but we're not going to be

given expert Solutions we have to

practice and try stuff out and that's

what reinforcement learning is about

okay so let's go back to the problem

that we worked with previously just so

we have a concrete example to talk

through as we explore sort of the topic

here so um I'm here in the Teck

tokenizer because I'd also like to well

I get a text box which is useful but

number two I want to remind you again

that we're always working with

onedimensional token sequences and so um

I actually like prefer this view because

this is like the native view of the llm

if that makes sense like this is what it

actually sees it sees token IDs right

okay so Emily buys three apples and two

oranges each orange is $2 the total cost

of all the fruit is $13 what is the cost

of each apple

and what I'd like to what I like you to

appreciate here is these are like four

possible candidate Solutions as an

example and they all reach the answer

three now what I'd like you to

appreciate at this point is that if I am

the human data labeler that is creating

a conversation to be entered into the

training set I don't actually really

know which of these

conversations to um to add to the data

set some of these conversations kind of

set up a system equations some of them

sort of like just talk through it in

English and some of them just kind of

like skip right through to the

solution um if you look at chbt for

example and you give it this question it

defines a system of variables and it

kind of like does this little thing what

we have to appreciate and uh

differentiate between though is um the

first purpose of a solution is to reach

the right answer of course we want to

get the final answer three that is the

that is the important purpose here but

there's kind of like a secondary purpose

as well where here we are also just kind

of trying to make it like nice uh for

the human because we're kind of assuming

that the person wants to see the

solution they want to see the

intermediate steps we want to present it

nicely Etc so there are two separate

things going on here number one is the

presentation for the human but number

two we're trying to actually get the

right answer um so let's for the moment

focus on just reaching the final answer

if we're only care if we only care about

the final answer then which of these is

the optimal or the best prompt um sorry

the best solution for the llm to reach

the right

answer um and what I'm trying to get at

is we don't know me as a human labeler I

would not know which one of these is

best so as an example we saw earlier on

when we looked at

um the token sequences here and the

mental arithmetic and reasoning we saw

that for each token we can only spend

basically a finite number of finite

amount of compute here that is not very

large or you should think about it that

way way and so we can't actually make

too big of a leap in any one token is is

maybe the way to think about it so as an

example in this one what's really nice

about it is that it's very few tokens so

it's going to take us very short amount

of time to get to the answer but right

here when we're doing 30 - 4 IDE 3

equals right in this token here we're

actually asking for a lot of computation

to happen on that single individual

token and so maybe this is a bad example

to give to the llm because it's kind of

incentivizing it to skip through the

calculations very quickly and it's going

to actually make up mistakes make

mistakes in this mental arithmetic uh so

maybe it would work better to like

spread out the spread it out more maybe

it would be better to set it up as an

equation maybe it would be better to

talk through it we fundamentally don't

know and we don't know because what is

easy for you or I as or as human

labelers what's easy for us or hard for

us is different than what's easy or hard

for the llm it cognition is different um

and the token sequences are kind of like

different hard for it and so some of the

token sequences here that are trivial

for me might be um very too much of a

leap for the llm so right here this

token would be way too hard but

conversely many of the tokens that I'm

creating here might be just trivial to

the llm and we're just wasting tokens

like why waste all these tokens when

this is all trivial so if the only thing

we care care about is the final answer

and we're separating out the issue of

the presentation to the human um then we

don't actually really know how to

annotate this example we don't know what

solution to get to the llm because we

are not the

llm and it's clear here in the case of

like the math example but this is

actually like a very pervasive issue

like for our knowledge is not lm's

knowledge like the llm actually has a

ton of knowledge of PhD in math and

physics chemistry and whatnot so in many

ways it actually knows more than I do

and I'm I'm potentially not utilizing

that knowledge in its problem solving

but conversely I might be injecting a

bunch of knowledge in my solutions that

the LM doesn't know in its parameters

and then those are like sudden leaps

that are very confusing to the model and

so our cognitions are different and I

don't really know what to put here if

all we care about is the reaching the

final solution and doing it economically

ideally and so long story short we are

not in a good position to create these

uh token sequences for the LM and

they're useful by imitation to

initialize the system but we really want

the llm to discover the token sequences

that work for it we need to find it

needs to find for itself what token

sequence reliably gets to the answer

given the prompt and it needs to

discover that in the process of

reinforcement learning and of trial and

error so let's see how this example

would work like in reinforcement

learning

okay so we're now back in the huging

face inference playground and uh that

just allows me to very easily call uh

different kinds of models so as an

example here on the top right I chose

the Gemma 2 2 billion parameter model so

two billion is very very small so this

is a tiny model but it's okay so we're

going to give it um the way that

reinforcement learning will basically

work is actually quite quite simple um

we need to try many different kinds of

solutions and we want to see which

Solutions work well or not

so we're basically going to take the

prompt we're going to run the

model and the model generates a solution

and then we're going to inspect the

solution and we know that the correct

answer for this one is $3 and so indeed

the model gets it correct it says it's

$3 so this is correct so that's just one

attempt at DIS solution so now we're

going to delete this and we're going to

rerun it again let's try a second

attempt so the model solves it in a bit

slightly different way right every

single attempt will be a different

generation because these models are

stochastic systems remember that at

every single token here we have a

probability distribution and we're

sampling from that distribution so we

end up kind kind of going down slightly

different paths and so this is a second

solution that also ends in the correct

answer now we're going to delete that

let's go a third

time okay so again slightly different

solution but also gets it

correct now we can actually repeat this

uh many times and so in practice you

might actually sample thousand of

independent Solutions or even like

million solutions for just a single

prompt um and some of them will be

correct and some of them will not be

very correct and basically what we want

to do is we want to encourage the

solutions that lead to correct answers

so let's take a look at what that looks

like so if we come back over here here's

kind of like a cartoon diagram of what

this is looking like we have a prompt

and then we tried many different

solutions in

parallel and some of the solutions um

might go well so they get the right

answer which is in green and some of the

solutions might go poorly and may not

reach the right answer which is red now

this problem here unfortunately is not

the best example because it's a trivial

prompt and as we saw uh even like a two

billion parameter model always gets it

right so it's not the best example in

that sense but let's just exercise some

imagination here and let's just suppose

that the um green ones are good and the

red ones are

bad okay so we generated 15 Solutions

only four of them got the right answer

and so now what we want to do is

basically we want to encourage the kinds

of solutions that lead to right answers

so whatever token sequences happened in

these red Solutions obviously something

went wrong along the way somewhere and

uh this was not a good path to take

through the solution and whatever token

sequences there were in these Green

Solutions well things went uh pretty

well in this situation and so we want to

do more things like it in prompts like

this and the way we encourage this kind

of a behavior in the future is we

basically train on these sequences um

but these training sequencies now are

not coming from expert human annotators

there's no human who decided that this

is the correct solution this solution

came from the model itself so the model

is practicing here it's tried out a few

Solutions four of them seem to have

worked and now the model will kind of

like train on them and this corresponds

to a student basically looking at their

Solutions and being like okay well this

one worked really well so this is this

is how I should be solving these kinds

of problems and uh here in this example

there are many different ways to

actually like really tweak the

methodology a little bit here but just

to give the core idea across maybe it's

simplest to just think about take the

taking the single best solution out of

these four uh like say this one that's

why it was yellow uh so this is the the

solution that not only led to the right

answer but may maybe had some other nice

properties maybe it was the shortest one

or it looked nicest in some ways or uh

there's other criteria you could think

of as an example but we're going to

decide that this the top solution we're

going to train on it and then uh the

model will be slightly more likely once

you do the parameter update to take this

path in this kind of a setting in the

future but you have to remember that

we're going to run many different

diverse prompts across lots of math

problems and physics problems and

whatever wherever there might be so tens

of thousands of prompts maybe have in

mind there's thousands of solutions

prompt and so this is all happening kind

of like at the same time and as we're

iterating this process the model is

discovering for itself what kinds of

token sequences lead it to correct

answers it's not coming from a human

annotator the the model is kind of like

playing in this playground and it knows

what it's trying to get to and it's

discovering sequences that work for it

uh these are sequences that don't make

any mental leaps uh they they seem to

work reliably and statistically and uh

fully utilize the knowledge of the model

as it has it and so uh this is the

process of reinforcement

learning it's basically a guess and

check we're going to guess many

different types of solutions we're going

to check them and we're going to do more

of what worked in the future and that is

uh reinforcement learning so in the

context of what came before we see now

that the sft model the supervised fine

tuning model it's still helpful because

it still kind of like initializes the

model a little bit into to the vicinity

of the correct Solutions so it's kind of

like a initialization of um of the model

in the sense that it kind of gets the

model to you know take Solutions like

write out Solutions and maybe it has an

understanding of setting up a system of

equations or maybe it kind of like talks

through a solution so it gets you into

the vicinity of correct Solutions but

reinforcement learning is where

everything gets dialed in we really

discover the solutions that work for the

model get the right answers we encourage

them and then the model just kind of

like gets better over time time okay so

that is the high Lev process for how we

train large language models in short we

train them kind of very similar to how

we train children and basically the only

difference is that children go through

chapters of books and they do all these

different types of training exercises um

kind of within the chapter of each book

but instead when we train AIS it's

almost like we kind of do it stage by

stage depending on the type of that

stage so first what we do is we do

pre-training which as we saw is

equivalent to uh basically reading all

the expository material so we look at

all the textbooks at the same time and

we read all the exposition and we try to

build a knowledge base the second thing

then is we go into the sft stage which

is really looking at all the fixed uh

sort of like solutions from Human

Experts of all the different kinds of

worked Solutions across all the

textbooks and we just kind of get an sft

model which is able to imitate the

experts but does so kind of blindly it

just kind of like does its best guess

uh kind of just like trying to mimic

statistically the expert behavior and so

that's what you get when you look at all

the work Solutions and then finally in

the last stage we do all the practice

problems in the RL stage across all the

textbooks we only do the practice

problems and that's how we get the RL

model so on a high level the way we

train llms is very much equivalent uh to

the process that we train uh that we use

for training of children the next point

I would like to make is that actually

these first two stat ages pre-training

and surprise fine-tuning they've been

around for years and they are very

standard and everyone does them all the

different llm providers it is this last

stage the RL training that is a lot more

early in its process of development and

is not standard yet in the field and so

um this stage is a lot more kind of

early and nent and the reason for that

is because I actually skipped over a ton

of little details here in this process

the high level idea is very simple it's

trial and there learning but there's a

ton of details and little math

mathematical kind of like nuances to

exactly how you pick the solutions that

are the best and how much you train on

them and what is the prompt distribution

and how to set up the training run such

that this actually works so there's a

lot of little details and knobs to the

core idea that is very very simple and

so getting the details right here uh is

not trivial and so a lot of companies

like for example open and other LM

providers have experimented internally

with reinforcement learning fine tuning

for llms for a while but they've not

talked about it publicly

um it's all kind of done inside the

company and so that's why the paper from

Deep seek that came out very very

recently was such a big deal because

this is a paper from this company called

DC Kai in China and this paper really

talked very publicly about reinforcement

learning fine training for large

language models and how incredibly

important it is for large language

models and how it brings out a lot of

reasoning capabilities in the models

we'll go into this in a second so this

paper reinvigorated the public interest

of using RL for llms and gave a lot of

the um sort of n-r details that are

needed to reproduce their results and

actually get the stage to work for large

langage models so let me take you

briefly through this uh deep seek R1

paper and what happens when you actually

correctly apply RL to language models

and what that looks like and what that

gives you so the first thing I'll scroll

to is this uh kind of figure two here

where we are looking at the Improvement

in how the models are solving

mathematical problems so this is the

accuracy of solving mathematical

problems on the a accuracy and then we

can go to the web page and we can see

the kinds of problems that are actually

in these um these the kinds of math

problems that are being measured here so

these are simple math problems you can

um pause the video if you like but these

are the kinds of problems that basically

the models are being asked to solve and

you can see that in the beginning

they're not doing very well but then as

you update the model with this many

thousands of steps their accuracy kind

of continues to climb so the models are

improving and they're solving these

problems with a higher accuracy

as you do this trial and error on a

large data set of these kinds of

problems and the models are discovering

how to solve math problems but even more

incredible than the quantitative kind of

results of solving these problems with a

higher accuracy is the qualitative means

by which the model achieves these

results so when we scroll down uh one of

the figures here that is kind of

interesting is that later on in the

optimization the model seems to be uh

using average length per response uh

goes up up so the model seems to be

using more tokens to get its higher

accuracy results so it's learning to

create very very long Solutions why are

these Solutions very long we can look at

them qualitatively here so basically

what they discover is that the model

solution get very very long partially

because so here's a question and here's

kind of the answer from the model what

the model learns to do um and this is an

immerging property of new optimization

it just discovers that this is good for

problem solving is it starts to do stuff

like this wait wait wait that's Nota

moment I can flag here let's reevaluate

this step by step to identify the

correct sum can be so what is the model

doing here right the model is basically

re-evaluating steps it has learned that

it works better for accuracy to try out

lots of ideas try something from

different perspectives retrace reframe

backtrack is doing a lot of the things

that you and I are doing in the process

of problem solving for mathematical

questions but it's rediscovering what

happens in your head not what you put

down on the solution and there is no

human who can hardcode this stuff in the

ideal assistant response this is only

something that can be discovered in the

process of reinforcement learning

because you wouldn't know what to put

here this just turns out to work for the

model and it improves its accuracy in

problem solving so the model learns what

we call these chains of thought in your

head and it's an emergent property of

the optim of the optimization and that's

what's bloating up the response length

but that's also what's increasing the

accuracy of the problem problem solving

so what's incredible here is basically

the model is discovering ways to think

it's learning what I like to call

cognitive strategies of how you

manipulate a problem and how you

approach it from different perspectives

how you pull in some analogies or do

different kinds of things like that and

how you kind of uh try out many

different things over time uh check a

result from different perspectives and

how you kind of uh solve problems but

here it's kind of discovered by the RL

so extremely incredible to see this

emerge in the optimization without

having to hardcode it anywhere the only

thing we've given it are the correct

answers and this comes out from trying

to just solve them correctly which is

incredible

um now let's go back to actually the

problem that we've been working with and

let's take a look at what it would look

like uh for uh for this kind of a model

what we call reasoning or thinking model

to solve that problem okay so recall

that this is the problem we've been

working with and when I pasted it into

chat GPT 40 I'm getting this kind of a

response let's take a look at what

happens when you give this same query to

what's called a reasoning or a thinking

model this is a model that was trained

with reinforcement learning so this

model described in this paper DC car1 is

available on chat. dec.com uh so this is

kind of like the company uh that

developed is hosting it you have to make

sure that the Deep think button is

turned on to get the R1 model as it's

called we can paste it here and run

it and so let's take a look at what

happens now and what is the output of

the model okay so here's it says so this

is previously what we get using

basically what's an sft approach a

supervised funing approach this is like

mimicking an expert solution this is

what we get from the RL model okay let

me try to figure this out so Emily buys

three apples and two oranges each orange

cost $2 total is 13 I need to find out

blah blah blah so here you you um as

you're reading this you can't escape

thinking that this model is

thinking um is definitely pursuing the

solution solution it deres that it must

cost $3 and then it says wait a second

let me check my math again to be sure

and then it tries it from a slightly

different perspective and then it says

yep all that checks out I think that's

the answer I don't see any mistakes let

me see if there's another way to

approach the problem maybe setting up an

equation let's let the cost of one apple

be $8 then blah blah blah yep same

answer so definitely each apple is $3

all right confident that that's correct

and then what it does once it sort of um

did the thinking process is it writes up

the nice solution for the human and so

this is now considering so this is more

about the correctness aspect and this is

more about the presentation aspect where

it kind of like writes it out nicely and

uh boxes in the correct answer at the

bottom and so what's incredible about

this is we get this like thinking

process of the model and this is what's

coming from the reinforcement learning

process this is what's bloating up the

length of the token sequences they're

doing thinking and they're trying

different ways this is what's giving you

higher accuracy in problem

solving and this is where we are seeing

these aha moments and these different

strategies and these um ideas for how

you can make sure that you're getting

the correct

answer the last point I wanted to make

is some people are a little bit nervous

about putting you know very sensitive

data into chat.com because this is a

Chinese company so people don't um

people are a little bit careful and Cy

with that a little bit um deep seek R1

is a model that was released by this

company so this is an open source model

or open weights model it is available

for anyone to download and use you will

not be able to like run it in its full

um sort of the full model in full

Precision you won't run that on a

MacBook but uh or like a local device

because this is a fairly large model but

many companies are hosting the full

largest model one of those companies

that I like to use is called

together. so when you go to together.

you sign up and you go to playgrounds

you can can select here in the chat deep

seek R1 and there's many different kinds

of other models that you can select here

these are all state-of-the-art models so

this is kind of similar to the hugging

face inference playground that we've

been playing with so far but together. a

will usually host all the

state-of-the-art models so select DT

car1 um you can try to ignore a lot of

these I think the default settings will

often be okay and we can put in this and

because the model was released by Deep

seek what you're getting here should be

basically equivalent to what you're

getting here now because of the

randomness in the sampling we're going

to get something slightly different uh

but in principle this should be uh

identical in terms of the power of the

model and you should be able to see the

same things quantitatively and

qualitatively uh but uh this model is

coming from kind of a an American

company so that's deep seek and that's

the what's called a reasoning

model now when I go back to chat uh let

me go to chat here okay so the models

that you're going to see in the drop

down here some of them like 01 03 mini

O3 mini High Etc they are talking about

uses Advanced reasoning now what this is

referring to uses Advanced reasoning is

it's referring to the fact that it was

trained by reinforcement learning with

techniques very similar to those of deep

C car1 per public statements of opening

ey employees uh so these are thinking

models trained with RL and these models

like GPT 4 or GPT 4 40 mini that you're

getting in the free tier you should

think of them as mostly sft models

supervised fine tuning models they don't

actually do this like thinking as as you

see in the RL models and even though

there's a little bit of reinforcement

learning involved with these models and

I'll go that into that in a second these

are mostly sft models I think you should

think about it that way so in the same

way as what we saw here we can pick one

of the thinking models like say 03 mini

high and these models by the way might

not be available to you unless you pay a

Chachi PT subscription of either $20 per

month or $200 per month for some of the

top models so we can pick a thinking

model and run now what's going to happen

here is it's going to say reasoning and

it's going to start to do stuff like

this and um what we're seeing here is

not exactly the stuff we're seeing here

so even though under the hood the model

produces these kinds of uh kind of

chains of thought opening ey chooses to

not show the exact chains of thought in

the web interface it shows little

summaries of that of those chains of

thought and open kind of does this I

think partly because uh they are worried

about what's called the distillation

risk that is that someone could come in

and actually try to imitate those

reasoning traces and recover a lot of

the reasoning performance by just

imitating the reasoning uh chains of

thought and so they kind of hide them

and they only show little summaries of

them so you're not getting exactly what

you would get in deep seek as with

respect to the reasoning itself and then

they write up the

solution so these are kind of like

equivalent even though we're not seeing

the full under the hood details now in

terms of the performance uh these models

and deep seek models are currently rly

on par I would say it's kind of hard to

tell because of the evaluations but if

you're paying $200 per month to open AI

some of these models I believe are

currently they basically still look

better uh but deep seek R1 for now is

still a very solid choice for a thinking

model that would be available to you um

sort of um either on this website or any

other website because the model is open

weights you can just download it so

that's thinking models so what is the

summary so far well we've talked about

reinforcement learning and the fact that

thinking emerges in the process of the

optimization on when we basically run RL

on many math uh and kind of code

problems that have verifiable Solutions

so there's like an answer three

Etc now these thinking models you can

access in for example deep seek or any

inference provider like together. a and

choosing deep seek over there these

thinking models are also available uh in

chpt under any of the 01 or O3

models but these GPT 4 R models Etc

they're not thinking models you should

think of them as mostly sft models now

if you are um if you have a prompt that

requires Advanced reasoning and so on

you should probably use some of the

thinking models or at least try them out

but empirically for a lot of my use when

you're asking a simpler question there's

like a knowledge based question or

something like that this might be

Overkill like there's no need to think

30 seconds about some factual question

so for that I will uh sometimes default

to just GPT 40 so empirically about 80

90% of my use is just gp4

and when I come across a very difficult

problem like in math and code Etc I will

reach for the thinking models but then I

have to wait a bit longer because

they're thinking um so you can access

these on chat on deep seek also I wanted

to point out that um AI studio.

go.com even though it looks really busy

really ugly because Google's just unable

to do this kind of stuff well it's like

what is happening but if you choose

model and you choose here Gemini 2.0

flash thinking experimental 01 21 if you

choose that one that's also a a kind of

early experiment experimental of a

thinking model by Google so we can go

here and we can give it the same problem

and click run and this is also a

thinking problem a thinking model that

will also do something

similar and comes out with the right

answer here so basically Gemini also

offers a thinking model anthropic

currently does not offer a thinking

model but basically this is kind of like

the frontier development of these llms I

think RL is kind of like this new

exciting stage but getting the details

right is difficult and that's why all

these models and thinking models are

currently experimental as of 2025 very

early 2025 um but this is kind of like

the frontier development of pushing the

performance on these very difficult

problems using reasoning that is

emerging in these optimizations one more

connection that I wanted to bring up is

that the discovery that reinforcement

learning is extremely powerful way of

learning is not new to the field of AI

and one place what we've already seen

this demonstrated is in the game of Go

and famously Deep Mind developed the

system alphago and you can watch a movie

about it um where the system is learning

to play the game of go against top human

players and um when we go to the paper

underlying alphago so in this paper when

we scroll

down we actually find a really

interesting

plot um that I think uh is kind of

familiar uh to us and we're kind of like

we discovering in the more open domain

of arbitrary problem solving instead of

on the closed specific domain of the

game of Go but basically what they saw

and we're going to see this in llms as

well as this becomes more mature is this

is the ELO rating of playing game of Go

and this is leas dull an extremely

strong human player and here what they

are comparing is the strength of a model

learned trained by supervised learning

and a model trained by reinforcement

learning so the supervised learning

model is imitating human expert players

so if you just get a huge amount of

games played by expert players in the

game of Go and you try to imitate them

you are going to get better but then you

top out and you never quite get better

than some of the top top top players of

in the game of Go like LEL so you're

never going to reach there because

you're just imitating human players you

can't fundamentally go beyond a human

player if you're just imitating human

players but in a process of

reinforcement learning is significantly

more powerful in reinforcement learning

for a game of Go it means that the

system is playing moves that empirically

and statistically lead to win to winning

the game and so alphago is a system

where it kind of plays against it itself

and it's using reinforcement learning to

create

rollouts so it's the exact same diagram

here but there's no prompt it's just uh

because there's no prompt it's just a

fixed game of Go but it's trying out

lots of solutions it's trying out lots

of plays and then the games that lead to

a win instead of a specific answer are

reinforced they're they're made stronger

and so um the system is learning

basically the sequences of actions that

empirically and statistically lead to

winning the game and reinforcement

learning is not going to be constrained

by human performance and reinforcement

learning can do significantly better and

overcome even the top players like Lisa

Dole and so uh probably they could have

run this longer and they just chose to

crop it at some point because this costs

money but this is very powerful

demonstration of reinforcement learning

and we're only starting to kind of see

hints of this diagram in larger language

models for reasoning problems so we're

not going to get too far by just

imitating experts we need to go beyond

that set up these like little game

environments and get let let the system

discover reasoning traces or like ways

of solving problems uh that are unique

and that uh just basically work

well now on this aspect of uniqueness

notice that when you're doing

reinforcement learning nothing prevents

you from veering off the distribution of

how humans are playing the game and so

when we go back to uh this alphao search

here one of the suggested modifications

is called move 37 and move 37 in alphao

is referring to a specific point in time

where alphago basically played a move

that uh no human expert would play uh so

the probability of this move uh to be

played by a human player was evaluated

to be about 1 in 10th ,000 so it's a

very rare move but in retrospect it was

a brilliant move so alphago in the

process of reinforcement learning

discovered kind of like a strategy of

playing that was unknown to humans and

but is in retrospect uh brilliant I

recommend this YouTube video um leis do

versus alphao move 37 reactions and

Analysis and this is kind of what it

looked like when alphao played this

move

value that's a very that's a very

surprising move I thought I thought it

was I thought it was a

mistake when I see this move anyway so

basically people are kind of freaking

out because it's a it's a move that a

human would not play that alphago played

because in its training uh this move

seemed to be a good idea it just happens

not to be a kind of thing that a humans

would would do and so that is again the

power of reinforcement learning and in

principle we can actually see the

equivalence of that if we continue

scaling this Paradigm in language models

and what that looks like is kind of

unknown so so um what does it mean to

solve problems in such a way that uh

even humans would not be able to get how

can you be better at reasoning or

thinking than humans how can you go

beyond just uh a thinking human like

maybe it means discovering analogies

that humans would not be able to uh

create or maybe it's like a new thinking

strategy it's kind of hard to think

through uh maybe it's a holy new

language that actually is not even

English maybe it discovers its own

language that is a lot better at

thinking um because the model is

unconstrained to even like stick with

English uh so maybe it takes a different

language to think in or it discovers its

own language so in principle the

behavior of the system is a lot less

defined it is open to do whatever works

and it is open to also slowly Drift from

the distribution of its training data

which is English but all of that can

only be done if we have a very large

diverse set of problems in which the

these strategy can be refined and

perfected and so that is a lot of the

frontier LM research that's going on

right now is trying to kind of create

those kinds of prompt distributions that

are large and diverse these are all kind

of like game environments in which the

llms can practice their thinking and uh

it's kind of like writing you know these

practice problems we have to create

practice problems for all of domains of

knowledge and if we have practice

problems and tons of them the models

will be able to reinforcement learning

reinforcement learn on them and kind of

uh create these kinds of uh diagrams but

in the domain of open thinking instead

of a closed domain like game of Go

there's one more section within

reinforcement learning that I wanted to

cover and that is that of learning in

unverifiable domains so so far all of

the problems that we've looked at are in

what's called verifiable domains that is

any candidate solution we can score very

easily against a concrete answer so for

example answer is three and we can very

easily score these Solutions against the

answer of three

either we require the models to like box

in their answers and then we just check

for equality of whatever is in the box

with the answer or you can also use uh

kind of what's called an llm judge so

the llm judge looks at a solution and it

gets the answer and just basically

scores the solution for whether it's

consistent with the answer or not and

llms uh empirically are good enough at

the current capability that they can do

this fairly reliably so we can apply

those kinds of techniques as well in any

case we have a concrete answer and we're

just checking Solutions again against it

and we can do this automatically with no

kind of humans in the loop the problem

is that we can't apply the strategy in

what's called unverifiable domains so

usually these are for example creative

writing tasks like write a joke about

Pelicans or write a poem or summarize a

paragraph or something like that in

these kinds of domains it becomes harder

to score our different solutions to this

problem so for example writing a joke

about Pelicans we can generate lots of

different uh jokes of course that's fine

for example we can go to chbt and we can

get it to uh generate a joke about

Pelicans uh so much stuff in their beaks

because they don't bellan in

backpacks what

okay we can uh we can try something else

why don't Pelicans ever pay for their

drinks because they always B it to

someone else haha okay so these models

are not obviously not very good at humor

actually I think it's pretty fascinating

because I think humor is secretly very

difficult and the model have the

capability I think anyway in any case

you could imagine creating lots of jokes

the problem that we are facing is how do

we score them now in principle we could

of course get a human to look at all

these jokes just like I did right now

the problem with that is if you are

doing reinforcement learning you're

going to be doing many thousands of

updates and for each update you want to

be looking at say thousands of prompts

and for each prompt you want to be

potentially looking at looking at

hundred or thousands of different kinds

of generations and so there's just like

way too many of these to look at and so

um in principle you could have a human

inspect all of them and score them and

decide that okay maybe this one is funny

and uh maybe this one is funny and this

one is funny and we could train on them

to get the model to become slightly

better at jokes um in the context of

pelicans at least um the problem is that

it's just like way too much human time

this is an unscalable strategy we need

some kind of an automatic strategy for

doing this and one sort of solution to

this was proposed in this paper

uh that introduced what's called

reinforcement learning from Human

feedback and so this was a paper from

open at the time and many of these

people are now um co-founders in

anthropic um and this kind of proposed a

approach for uh basically doing

reinforcement learning in unverifiable

domains so let's take a look at how that

works so this is the cartoon diagram of

the core ideas involved so as I

mentioned the native approach is if we

just set Infinity human time we could

just run RL in these domains just fine

so for example we can run RL as usual if

I have Infinity humans I would I just

want to do and these are just cartoon

numbers I want to do 1,000 updates where

each update will be on 1,000 prompts and

in for each prompt we're going to have

1,000 roll outs that we're scoring so we

can run RL with this kind of a setup the

problem is in the process of doing this

I will need to run one I will need to

ask a human to evaluate a joke a total

of 1 billion times and so that's a lot

of people looking at really terrible

jokes so we don't want to do that so

instead we want to take the arlef

approach so um in our Rel of approach we

are kind of like the the core trick is

that of indirection so we're going to

involve humans just a little bit and the

way we cheat is that we basically train

a whole separate neural network that we

call a reward model and this neural

network will kind of like imitate human

scores so we're going to ask humans to

score um roll

we're going to then imitate human scores

using a neural network and this neural

network will become a kind of simulator

of human

preferences and now that we have a

neural network simulator we can do RL

against it so instead of asking a real

human we're asking a simulated human for

their score of a joke as an example and

so once we have a simulator we're often

racist because we can query it as many

times as we want to and it's all whole

automatic process and we can now do

reinforcement learning with respect to

the simulator and the simulator as you

might expect is not going to be a

perfect human but if it's at least

statistically similar to human judgment

then you might expect that this will do

something and in practice indeed uh it

does so once we have a simulator we can

do RL and everything works great so let

me show you a cartoon diagram a little

bit of what this process looks like

although the details are not 100 like

super important it's just a core idea of

how this works so here I have a cartoon

diagram of a hypothetical example of

what training the reward model would

look like so we have a prompt like write

a joke about picans and then here we

have five separate roll outs so these

are all five different jokes just like

this one now the first thing we're going

to do is we are going to ask a human to

uh order these jokes from the best to

worst so this is uh so here this human

thought that this joke is the best the

funniest so number one joke this is

number two joke number three joke four

and five so this is the worst joke

we're asking humans to order instead of

give scores directly because it's a bit

of an easier task it's easier for a

human to give an ordering than to give

precise scores now that is now the

supervision for the model so the human

has ordered them and that is kind of

like their contribution to the training

process but now separately what we're

going to do is we're going to ask a

reward model uh about its scoring of

these jokes now the reward model is a

whole separate neural network completely

separate neural net um and it's also

probably a transform

uh but it's not a language model in the

sense that it generates diverse language

Etc it's just a scoring model so the

reward model will take as an input The

Prompt number one and number two a

candidate joke so um those are the two

inputs that go into the reward model so

here for example the reward model would

be taken this prompt and this joke now

the output of a reward model is a single

number and this number is thought of as

a score and it can range for example

from Z to one so zero would be the worst

score and one would be the best score so

here are some examples of what a

hypothetical reward model at some stage

in the training process would give uh s

scoring to these jokes so 0.1 is a very

low score 08 is a really high score and

so on and so now um we compare the

scores given by the reward model with uh

the ordering given by the human and

there's a precise mathematical way to

actually calculate this uh basically set

up a loss function and calculate a kind

of like a correspondence here and uh

update a model based on it but I just

want to give you the intuition which is

that as an example here for this second

joke the the human thought that it was

the funniest and the model kind of

agreed right 08 is a relatively high

score but this score should have been

even higher right so after an update we

would expect that maybe this score

should have been will actually grow

after an update of the network to be

like say 081 or

something um for this one here they

actually are in a massive disagreement

because the human thought that this was

number two but here the the score is

only 0.1 and so this score needs to be

much higher so after an update on top of

this um kind of a supervision this might

grow a lot more like maybe it's 0.15 or

something like

that um and then here the human thought

that this one was the worst joke but

here the model actually gave it a fairly

High number so you might expect that

after the update uh this would come down

to maybe 3 3.5 or something like that so

basically we're doing what we did before

we're slightly nudging the predictions

from the models using a neural network

training

process and we're trying to make the

reward model scores be consistent with

human

ordering and so um as we update the

reward model on human data it becomes

better and better simulator of the

scores and orders uh that humans provide

and then becomes kind of like the the

neural the simulator of human

preferences which we can then do RL

against but critically we're not asking

humans one billion times to look at a

joke we're maybe looking at th000

prompts and five roll outs each so maybe

5,000 jokes that humans have to look at

in total and they just give the ordering

and then we're training the model to be

consistent with that ordering and I'm

skipping over the mathematical details

but I just want you to understand a high

level idea that uh this reward model is

do is basically giving us this scour and

we have a way of training it to be

consistent with human orderings

and that's how rhf works okay so that is

the rough idea we basically train

simulators of humans and RL with respect

to those

simulators now I want to talk about

first the upside of reinforcement

learning from Human

feedback the first thing is that this

allows us to run reinforcement learning

which we know is incredibly powerful

kind of set of techniques and it allows

us to do it in arbitrary domains and

including the ones that are unverifiable

so things like summarization and poem

writing joke writing or any other

creative writing really uh in domains

outside of math and code

Etc now empirically what we see when we

actually apply rhf is that this is a way

to improve the performance of the model

and uh I have a top answer for why that

might be but I don't actually know that

it is like super well established on

like why this is you can empirically

observe that when you do rhf correctly

the models you get are just like a

little bit better um but as to why is I

think like not as clear so here's my

best guess my best guess is that this is

possibly mostly due to the discriminator

generator

Gap what that means is that in many

cases it is significantly easier to

discriminate than to generate for humans

so in particular an example of this is

um in when we do supervised fine-tuning

right

sft we're asking humans to generate the

ideal assistant response and in many

cases here um as I've shown it uh the

ideal response is very simple to write

but in many cases might not be so for

example in summarization or poem writing

or joke writing like how are you as a

human assist as a human labeler um

supposed to give the ideal response in

these cases it requires creative human

writing to do that and so rhf kind of

sidesteps this because we get um we get

to ask people a significantly easier

question as a data labelers they're not

asked to write poems directly they're

just given five poems from the model and

they're just asked to order them and so

that's just a much easier task for a

human labeler to do and so what I think

this allows you to do basically is it um

it kind of like allows a lot more higher

accuracy data because we're not asking

people to do the generation task which

can be extremely difficult like we're

not asking them to do creative writing

we're just trying to get them to

distinguish between creative writings

and uh find the ones that are best and

that is the signal that humans are

providing just the ordering and that is

their input into the system and then the

system in rhf just discovers the kinds

of responses that would be graded well

by humans and so that step of

indirection allows the models to become

a bit better so that is the upside of

our LF it allows us to run RL it

empirically results in better models and

it allows uh people to contribute their

supervision uh even without having to do

extremely difficult tasks um in the case

of writing ideal responses unfortunately

our HF also comes with significant

downsides and so um the main one is that

basically we are doing reinforcement

learning not with respect to humans and

actual human judgment but with respect

to a lossy simulation of humans right

and this lossy simulation could be

misleading because it's just a it's just

a simulation right it's just a language

model that's kind of outputting scores

and it might not perfectly reflect the

opinion of an actual human with an

actual brain in all the possible

different cases so that's number one

which is actually something even more

subtle and devious going on that uh

really

dramatically holds back our LF as a

technique that we can really scale to

significantly um kind of Smart Systems

and that is that reinforcement learning

is extremely good at discovering a way

to game the model to game the simulation

so this reward model that we're

constructing here that gives the course

these models are Transformers these

Transformers are massive neurals they

have billions of parameters and they

imitate humans but they do so in a kind

of like a simulation way now the problem

is that these are massive complicated

systems right there's a billion

parameters here that are outputting a

single

score it turns out that there are ways

to gain these models you can find kinds

of inputs that were not part of their

training set and these inputs

inexplicably get very high scores but in

a fake way so very often what you find

if you run our lch for very long so for

example if we do 1,000 updates which is

like say a lot of updates you might

expect that your jokes are getting

better and that you're getting like real

bangers about Pelicans but that's not

EXA exactly what happens what happens is

that uh in the first few hundred steps

the jokes about Pelicans are probably

improving a little bit and then they

actually dramatically fall off the cliff

and you start to get extremely

nonsensical results like for example you

start to get um the top joke about

Pelicans starts to be the

and this makes no sense right like when

you look at it why should this be a top

joke but when you take the the and you

plug it into your reward model you'd

expect score of zero but actually the

reward model loves this as a joke it

will tell you that the the the theth is

a score of 1. Z this is a top joke and

this makes no sense right but it's

because these models are just

simulations of humans and they're

massive neural lots and you can find

inputs at the bottom that kind of like

get into the part of the input space

that kind of gives you nonsensical

results these examples are what's called

adversarial examples and I'm not going

to go into the topic too much but these

are adversarial inputs to the model they

are specific little inputs that kind of

go between the nooks and crannies of the

model and give nonsensical results at

the top now here's what you might

imagine doing you say okay the the the

is obviously not score of one um it's

obviously a low score so let's take the

the the the the let's add it to the data

set and give it an ordering that is

extremely bad like a score of five and

indeed your model will learn that the D

should have a very low score and it will

give it score of zero the problem is

that there will always be basically

infinite number of nonsensical

adversarial examples hiding in the model

if you iterate this process many times

and you keep adding nonsensical stuff to

your reward model and giving it very low

scores you can you'll never win the game

uh you can do this many many rounds and

reinforcement learning if you run it

long enough will always find a way to

gain the model it will discover

adversarial examples it will get get

really high scores uh with nonsensical

results and fundamentally this is

because our scoring function is a giant

neural nut and RL is extremely good at

finding just the ways to trick it uh so

long story short you always run rhf put

for maybe a few hundred updates the

model is getting better and then you

have to crop it and you are done you

can't run too much against this reward

model because the optimization will

start to game it and you basically crop

it and you call it and you ship it um

and uh you can improve the reward model

but you kind of like come across these

situations eventually at some point so

rhf basically what I usually say is that

RF is not RL and what I mean by that is

I mean RF is RL obviously but it's not

RL in the magical sense this is not RL

that you can run

indefinitely these kinds of problems

like where you are getting con correct

answer you cannot gain this as easily

you either got the correct answer or you

didn't and the scoring function is much

much simpler you're just looking at the

boxed area and seeing if the result is

correct so it's very difficult to gain

these functions but uh gaming a reward

model is possible now in these

verifiable domains you can run RL

indefinitely you could run for tens of

thousands hundreds of thousands of steps

and discover all kinds of really crazy

strategies that we might not even ever

think about of Performing really well

for all these problems in the game of Go

there's no way to to beat to basically

game uh the winning of a game or the

losing of a game we have a perfect

simulator we know all the different uh

where all the stones are placed and we

can calculate uh whether someone has won

or not there's no way to gain that and

so you can do RL indefinitely and you

can eventually be beat even leol but

with models like this which are gameable

you cannot repeat this process

indefinitely so I kind of see rhf as not

real RL because the reward function is

gameable so it's kind of more like in

the realm of like little fine-tuning

it's a little it's a little Improvement

but it's not something that is

fundamentally set up correctly where you

can insert more compute run for longer

and get much better and magical results

so it's it's uh it's not RL in that

sense it's not RL in the sense that it

lacks magic um it can find you in your

model and get a better performance and

indeed if we go back to chat GPT the GPT

40 model has gone through rhf because it

works well but it's just not RL in the

same sense rlf is like a little fine

tune that slightly improves your model

is maybe like the way I would think

about it okay so that's most of the

technical content that I wanted to cover

I took you through the three major

stages and paradigms of training these

models pre-training supervised fine

tuning and reinforcement learning and I

showed you that they Loosely correspond

to the process we already use for

teaching children and so in particular

we talked about pre-training being sort

of like the basic knowledge acquisition

of reading Exposition supervised fine

tuning being the process of looking at

lots and lots of worked examples and

imitating experts and practice problems

the only difference is that we now have

to effectively write textbooks for llms

and AIS across all the disciplines of

human knowledge and also in all the

cases where we actually would like them

to work like code and math and you know

basically all the other disciplines so

we're in the process of writing

textbooks for them refining all the

algorithms that I've presented on the

high level and then of course doing a

really really good job at the execution

of training these models at scale and

efficiently so in particular I didn't go

into too many details but these are

extremely large and complicated

distributed uh sort of

um jobs that have to run over tens of

thousands or even hundreds of thousands

of gpus and the engineering that goes

into this is really at the stateof the

art of what's possible with computers at

that scale so I didn't cover that aspect

too much

but um this is very kind of serious and

they were underlying all these very

simple algorithms

ultimately now I also talked about sort

of like the theory of mind a little bit

of these models and the thing I want you

to take away is that these models are

really good but they're extremely useful

as tools for your work you shouldn't uh

sort of trust them fully and I showed

you some examples of that even though we

have mitigations for hallucinations the

models are not perfect and they will

hallucinate still it's gotten better

over time and it will continue to get

better but they can

hallucinate in other words in in

addition to that I covered kind of like

what I call the Swiss cheese uh sort of

model of llm capabilities that you

should have in your mind the models are

incredibly good across so many different

disciplines but then fail randomly

almost in some unique cases so for

example what is bigger 9.11 or 9.9 like

the model doesn't know but

simultaneously it can turn around and

solve Olympiad questions and so this is

a hole in the Swiss cheese and there are

many of them and you don't want to trip

over them so don't um treat these models

as infallible models check their work

use them as tools use them for

inspiration use them for the first draft

but uh work with them as tools and be

ultimately respons responsible for the

you know product of your

work and that's roughly what I wanted to

talk about this is how they're trained

and this is what they are let's now turn

to what are some of the future

capabilities of these models uh probably

what's coming down the pipe and also

where can you find these models I have a

few blow points on some of the things

that you can expect coming down the pipe

the first thing you'll notice is that

the models will very rapidly become

multimodal everything I talked about

above concerned text but very soon we'll

have llms that can not just handle text

but they can also operate natively and

very easily over audio so they can hear

and speak and also images so they can

see and paint and we're already seeing

the beginnings of all of this uh but

this will be all done natively inside

inside the language model and this will

enable kind of like natural

conversations and roughly speaking the

reason that this is actually no

different from everything we've covered

above is that as a baseline you can

tokenize audio and images and apply the

exact same approaches of everything that

we've talked about above so it's not a

fundamental change it's just uh it's

just a to we have to add some tokens so

as an example for tokenizing audio we

can look at slices of the spectrogram of

the audio signal and we can tokenize

that and just add more tokens that

suddenly represent audio and just add

them into the context windows and train

on them just like above the same for

images we can use patches and we can

separately tokenize patches and then

what is an image an image is just a

sequence of tokens and this actually

kind of works and there's a lot of early

work in this direction and so we can

just create streams of tokens that are

representing audio images as well as

text and interpers them and handle them

all simultaneously in a single model so

that's one example of multimodality

uh second something that people are very

interested in

is currently most of the work is that

we're handing individual tasks to the

models on kind of like a silver platter

like please solve this task for me and

the model sort of like does this little

task but it's up to us to still sort of

like organize a coherent execution of

tasks to perform jobs and the models are

not yet at the capability required to do

this in a coherent error correcting way

over long periods of time so they're not

able to fully string together tasks to

perform these longer running jobs but

they're getting there and this is

improving uh over time but uh probably

what's going to happen here is we're

going to start to see what's called

agents which perform tasks over time and

you you supervise them and you watch

their work and they come up to once in a

while report progress and so on so we're

going to see more long running agents uh

tasks that don't just take you know a

few seconds of response but many tens of

seconds or even minutes or hours over

time uh but these uh models are not

infallible as we talked about above so

all of this will require supervision so

for example in factories people talk

about the human to robot ratio uh for

automation I think we're going to see

something similar in the digital space

where we are going to be talking about

human to agent ratios where humans

becomes a lot more supervisors of agent

tasks um in the digital

domain uh next um I think everything is

going to become a lot more pervasive and

invisible so it's kind of like

integrated into the tools and everywhere

um and in addition kind of like computer

using so right now these models aren't

able to take actions on your behalf but

I think this is a separate bullet point

um if you saw chpt launch the operator

then uh that's one early example of that

where you can actually hand off control

to the model to perform you know

keyboard and mouse actions on your

behalf so that's also something that

that I think is very interesting the

last point I have here is just a general

comment that there's still a lot of

research to potentially do in this

domain main one example of that uh is

something along the lines of test time

training so remember that everything

we've done above and that we talked

about has two major stages there's first

the training stage where we tune the

parameters of the model to perform the

tasks well once we get the parameters we

fix them and then we deploy the model

for inference from there the model is

fixed it doesn't change anymore it

doesn't learn from all the stuff that

it's doing a test time it's a fixed um

number of parameters and the only thing

that is changing is now the token inside

the context windows and so the only type

of learning or test time learning that

the model has access to is the in

context learning of its uh kind of like

uh dynamically adjustable context window

depending on like what it's doing at

test time so but I think this is still

different from humans who actually are

able to like actually learn uh depending

on what they're doing especially when

you sleep for example like your brain is

updating your parameters or something

like that right so there's no kind of

equivalent of that currently in these

models and tools so there's a lot of

like um more wonky ideas I think that

are to be explored still and uh in

particular I think this will be

necessary because the context window is

a finite and precious resource and

especially once we start to tackle very

long running multimodal tasks and we're

putting in videos and these token

windows will basically start to grow

extremely large like not thousands or

even hundreds of thousands but

significantly beyond that and the only

trick uh the only kind of trick we have

Avail to us right now is to make the

context Windows longer but I think that

that approach by itself will will not

will not scale to actual long running

tasks that are multimodal over time and

so I think new ideas are needed in some

of those disciplines um in some of those

kind of cases in the main where these

tasks are going to require very long

contexts so those are some examples of

some of the things you can um expect

coming down the pipe let's now turn to

where you can actually uh kind of keep

track of this progress and um you know

be up to date with the latest and grest

of what's happening in the field so I

would say the three resources that I

have consistently used to stay up to

date are number one El Marina uh so let

me show you El

Marina this is basically an llm leader

board and it ranks all the top models

and the ranking is based on human

comparisons so humans prompt these

models and they get to judge which one

gives a better answer they don't know

which model is which they're just

looking at which model is the better

answer and you can calculate a ranking

and then you get some results and so

what you can hear is what you can see

here is the different organizations like

Google Gemini for example that produce

these models when you click on any one

of these it takes you to the place where

that model is

hosted and then here we see Google is

currently on top with open AI right

behind here we see deep seek in position

number three now the reason this is a

big deal is the last column here you see

license deep seek is an MIT license

model it's open weights anyone can use

these weights uh anyone can download

them anyone can host their own version

of Deep seek and they can use it in what

whatever way they like and so it's not a

proprietary model that you don't have

access to it's it's basically an open

weight release and so this is kind of

unprecedented that a model this strong

was released with open weights so pretty

cool from the team next up we have a few

more models from Google and open Ai and

then when you continue to scroll down

you start to see some other Usual

Suspects so xai here anthropic with son

it uh here at number

14 and

um then

meta with llama over here so llama

similar to deep seek is an open weights

model and so uh but it's down here as

opposed to up here now I will say that

this leaderboard was really good for a

long time I do think that in the last

few months it's become a little bit

gamed um and I don't trust it as much as

I used to I think um just empirically I

feel like a lot of people for example

are using a Sonet from anthropic and

that it's a really good model so but

that's all the way down here um in

number 14 and conversely I think not as

many people are using Gemini but it's

racking really really high uh so I think

use this as a first pass uh but uh sort

of try out a few of the models for your

tasks and see which one performs better

the second thing that I would point to

is the uh AI news uh newsletter so AI

news is not very creatively named but it

is a very good newsletter produced by

swix and friends so thank you for

maintaining it

and it's been very helpful to me because

it is extremely comprehensive so if you

go to archives uh you see that it's

produced almost every other day and um

it is very comprehensive and some of it

is written by humans and curated by

humans but a lot of it is constructed

automatically with llms so you'll see

that these are very comprehensive and

you're probably not missing anything

major if you go through it of course

you're probably not going to go through

it because it's so long but I do think

that these summaries all the way up top

are quite good and I think have some

human oversight uh so this has been very

helpful to me and the last thing I would

point to is just X and Twitter uh a lot

of um AI happens on X and so I would

just follow people who you like and

trust and get all your latest and

greatest uh on X as well so those are

the major places that have worked for me

over time and finally a few words on

where you can find the models and where

can you use them so the first one I

would say is for any of the biggest

proprietary models you just have to go

to the website of that LM provider so

for example for open a that's uh chat

I believe actually works now uh so

that's for open

AI now for or you know for um for Gemini

I think it's gem. google.com or AI

Studio I think they have two for some

reason that I don't fly understand no

one does um for the open weights models

like deep SE CL Etc you have to go to

some kind of an inference provider of

LMS so my favorite one is together

together. a and I showed you that when

you go to the playground of together. a

then you can sort of pick lots of

different models and all of these are

open models of different types and you

can talk to them here as an

example um now if you'd like to use a

base model like um you know a base model

then this is where I think it's not as

common to find base models even on these

inference providers they are all

targeting assistants and chat and so I

think even here I can't I couldn't see

base models here so for base models I

usually go to hyperbolic because they

serve my llama 3.1 base and I love that

model and you can just talk to it here

so as far as I know this is this is a

good place for a base model and I wish

more people hosted base models because

they are useful and interesting to work

with in some cases finally you can also

take some of the models that are smaller

and you can run them locally and so for

example deep seek the biggest model

you're not going to be able to run

locally on your MacBook but there are

smaller versions of the deep seek model

that are what's called distilled and

then also you can run these models at

smaller Precision so not at the native

Precision of for example fp8 on deep

seek or you know bf16 llama but much

much lower than that um and don't worry

if you don't fully understand those

details but you can run smaller versions

that have been distilled and then at

even lower precision and then you can

fit them on your uh computer and so you

can actually run pretty okay models on

your laptop and my favorite I think

place I go to usually is LM studio uh

which is basically an app you can get

and I think it kind of actually looks

really ugly and it's I don't like that

it shows you all these models that are

basically not that useful like everyone

just wants to run deep seek so I don't

know why they give you these 500

different types of models they're really

complicated to search for and you have

to choose different distillations and

different uh precisions and it's all

really confusing but once you actually

understand how it works and that's a

whole separate video then you can

actually load up a model like here I

loaded up a llama 3 uh2 instruct 1

billion and um you can just talk to it

so I ask for Pelican jokes and I can ask

for another one and it gives me another

one Etc all of this that happens here is

locally on your computer so we're not

actually going to anywhere anyone else

this is running on the GPU on the

MacBook Pro so that's very nice and you

can then eject the model when you're

done and that frees up the ram so LM

studio is probably like my favorite one

even though I don't I think it's got a

lot of uiux issues and it's really

geared towards uh professionals almost

uh but if you watch some videos on

YouTube I think you can figure out how

to how to use this

interface uh so those are a few words on

where to find them so let me now loop

back around to where we started the

question was when we go to chashi

pta.com and we enter some kind of a

query and we hit go what exactly is

happening here what are we seeing what

are we talking to how does this work and

I hope that this video gave you some

appreciation for some of the under the

hood details of how these models are

trained and what this is that is coming

back so in particular we now know that

your query is taken and is first chopped

up into tokens so we go to to tick

tokenizer and here where is the place in

the in the um sort of format that is for

the user query we basically put in our

query right there so our query goes into

what we discussed here is the

conversation protocol format which is

this way that we maintain conversation

objects so this gets inserted there and

then this whole thing ends up being just

a token sequence a onedimensional token

sequence under the hood so Chachi PT saw

this token sequence and then when we hit

go it basically continues appending

tokens into this list it continues the

sequence it acts like a token

autocomplete so in particular it gave us

this response so we can basically just

put it here and we see the tokens that

it continued uh these are the tokens

that it continued with

roughly now the question

becomes okay why are these the tokens

that the model responded with what are

these tokens where are they coming from

uh what are we talking to and how do we

program this system and so that's where

we shifted gears and we talked about the

under thehood pieces of it so the first

stage of this process and there are

three stages is the pre-training stage

which fundamentally has to do with just

knowledge acquisition from the internet

into the parameters of this neural

network and so the neural net

internalizes a lot of Knowledge from the

internet but where the personality

really comes in is in the process of

supervised fine-tuning here and so what

what happens here is that basically the

a company like openai will curate a

large data set of conversations like say

1 million conversation across very

diverse topics and there will be

conversations between a human and an

assistant and even though there's a lot

of synthetic data generation used

throughout this entire process and a lot

of llm help and so on fundamentally this

is a human data curation task with lots

of humans involved and in particular

these humans are data labelers hired by

open AI who are given labeling

instructions that they learn and they

task is to create ideal assistant

responses for any arbitrary prompts so

they are teaching the neural network by

example how to respond to

prompts so what is the way to think

about what came back here like what is

this well I think the right way to think

about it is that this is the neural

network simulation of a data labeler at

openai so it's as if I gave this query

to a data Li open and this data labeler

first reads all of the labeling

instructions from open Ai and then

spends 2 hours writing up the ideal

assistant response to this query and uh

giving it to me now we're not actually

doing that right because we didn't wait

two hours so what we're getting here is

a neural network simulation of that

process and we have to keep in mind that

these neural networks don't function

like human brains do they are different

what's easy or hard for them is

different from what's easy or hard for

humans and so we really are just getting

a simulation so here I shown you this is

a token stream and this is fundamentally

the neural network with a bunch of

activations and neurons in between this

is a fixed mathematical expression that

mixes inputs from tokens with parameters

of the model and they get mixed up and

get you the next token in a sequence but

this is a finite amount of compute that

happens for every single token and so

this is some kind of a lossy simulation

of a human that is kind of like

restricted in this way and so whatever

the humans

write the language model is kind of

imitating on this token level with only

this this specific computation for every

single token and

sequence we also saw that as a result of

this and the cognitive differences the

models will suffer in a variety of ways

and uh you have to be very careful with

their use so for example we saw that

they will suffer from hallucinations and

they also we have the sense of a Swiss

model of the LM capabilities where

basically there's like holes in the

cheese sometimes the models will just

arbitrarily like do something dumb uh so

even though they're doing lots of

magical stuff sometimes they just can't

so maybe you're not giving them enough

tokens to think and maybe they're going

to just make stuff up because they're

mental arithmetic breaks uh maybe they

are suddenly unable to count number of

letters um or maybe they're unable to

tell you that 911 9.11 is smaller than

9.9 and it looks kind of dumb and so so

it's a Swiss cheese capability and we

have to be careful with that and we saw

the reasons for

that but fundamentally this is how we

think of what came back it's again a

simulation of this neural network of a

human data labeler following the

labeling instructions at open a so

that's what we're getting back now I do

think that the uh things change a little

bit when you actually go and reach for

one of the thinking models like o03 mini

and the reason for that is that GPT

40 basically doesn't do reinforcement

learning it does do rhf but I've told

you that rhf is not RL there's no

there's no uh time for magic in there

it's just a little bit of a fine-tuning

is the way to look at it but these

thinking models they do use RL so they

go through this third state stage of

perfecting their thinking process and

discovering new thinking strategies and

uh

solutions to problem solving that look a

little bit like your internal monologue

in your head and they practice that on a

large collection of practice problems

that companies like openi create and

curate and um then make available to the

LMS so when I come here and I talked to

a thinking model and I put in this

question what we're seeing here is not

anymore just the straightforward

simulation of a human data labeler like

this is actually kind of new unique and

interesting um and of course open is not

showing us the under thehood thinking

and the chains of thought that are

underlying the reasoning here but we

know that such a thing exists and this

is a summary of it and what we're

getting here is actually not just an

imitation of a human data labeler it's

actually something that is kind of new

and interesting and exciting in the

sense that it is a function of thinking

that was emergent in a simulation it's

not just imitating human data labeler it

comes from this reinforcement learning

process and so here we're of course not

giving it a chance to shine because this

is not a mathematical or a reasoning

problem this is just some kind of a sort

of creative writing problem roughly

speaking and I think it's um it's a a

question an open question as to whether

the thinking strategies that are

developed inside verifiable domains

transfer and are generalizable to other

domains that are unverifiable such as

create writing the extent to which that

transfer happens is unknown in the field

I would say so we're not sure if we are

able to do RL on everything that is very

verifiable and see the benefits of that

on things that are unverifiable like

this prompt so that's an open question

the other thing that's interesting is

that this reinforcement learning here is

still like way too new primordial and

nent so we're just seeing like the

beginnings of the hints of greatness uh

in the reasoning problems we're seeing

something that is in principle capable

of something like the equivalent of move

37 but not in the game of Go but in open

domain thinking and problem solving in

principle this Paradigm is capable of

doing something really cool new and

exciting something even that no human

has thought of before in principle these

models are capable of analogies no human

has had so I think it's incredibly

exciting that these models exist but

again it's very early and these are

primordial models for now um and they

will mostly shine in domains that are

verifiable like math en code Etc so very

interesting to play with and think about

and

use and then that's roughly it um um I

would say those are the broad Strokes of

what's available right now I will say

that overall it is an extremely exciting

time to be in the

field personally I use these models all

the time daily uh tens or hundreds of

times because they dramatically

accelerate my work I think a lot of

people see the same thing I think we're

going to see a huge amount of wealth

creation as a result of these models be

aware of some of their shortcomings even

with RL models they're going to suffer

from some of these use it as a tool in a

toolbox don't trust it fully because

they will randomly do dumb things they

will randomly hallucinate they will

randomly skip over some mental

arithmetic and not get it right um they

randomly can't count or something like

that so use them as tools in the toolbox

check their work and own the product of

your work but use them for inspiration

for first draft uh ask them questions

but always check and verify and you will

be very successful in your work if you

do so uh so I hope this video was useful

and interesting to you I hope you had it

fun and uh it's already like very long

so I apologize for that but I hope it

was useful and yeah I will see you later

Loading...

Loading video analysis...