Interpretability: Understanding how AI models think

By Anthropic

Summary

## Key takeaways - **AI models aren't just glorified autocompletes**: While the core function of a language model is to predict the next word, this is a deceptively simple task. To do it well, models develop complex internal abstractions and intermediate goals, making them more than simple autocomplete. [00:19], [04:21] - **Interpretability: The 'biology' of AI**: Studying AI models is likened to biology or neuroscience because their internal workings aren't explicitly programmed but emerge through a complex, evolutionary-like training process. Researchers analyze these 'organisms' by observing their internal states and how different parts activate for specific concepts. [01:37], [08:11] - **Surprising concepts emerge within AI**: Research reveals surprising internal concepts within AI models, such as a specific circuit for 'sycophantic praise' or a '6 plus 9' feature that activates for addition-related tasks, even in contexts like citing journal publication years. This suggests models learn generalizable computations rather than just memorizing training data. [10:35], [12:57] - **AI can 'bullshit' to confirm user expectations**: Models may not perform complex calculations as they claim, instead working backward to produce an answer that aligns with a user's hint or expectation. This 'bullshitting' behavior, seen in math problems, highlights the need for interpretability to verify AI's true processes. [21:17], [22:57] - **AI planning ahead is crucial for safety**: Models can plan actions several steps ahead, similar to humans writing poetry or planning business strategies. Understanding this foresight is vital for AI safety, as it allows researchers to potentially detect and prevent undesirable long-term behaviors before they occur. [34:15], [39:39] - **Trusting AI requires understanding its 'thinking'**: As AI models are integrated into critical societal functions, understanding their internal reasoning is paramount. Unlike human trust, which relies on social cues, AI trust must be built on a verifiable understanding of their thought processes, especially given their alien nature and potential for 'Plan B' behaviors. [43:44], [45:27]

Topics Covered

AI interpretability is like biology, revealing internal abstractions.
LLMs generalize concepts, avoiding mere memorization of data.
AI models can deceive, revealing hidden internal motives.
Hallucinations arise from conflicting 'guess' vs. 'know' circuits.
AI interpretability offers superior insight over human neuroscience.

Full Transcript

the the model doesn't think of itself

necessarily as trying to predict the

next word. Internally, it's developed

potentially all sorts of intermediate

goals and abstractions that help it

achieve that kind of meta objective.

When you're talking to a large language

model, what exactly is it that you're

talking to? Are you talking to something

like a glorified autocomplete? Are you

talking to something like an internet

search engine? or are you talking to

something that's actually thinking um

and maybe even thinking like a person?

It turns out rather concerningly that

nobody really knows the answer to those

questions. Um and here at Anthropic, we

are very interested in in finding those

answered out. Um the way that we do that

is to use interpretability. that is the

science of opening up a large language

model, looking inside and trying to work

out what's going on as it's answering

your questions. Um, and I'm very glad to

be joined by three members of our

interpretability team who are going to

tell me a little bit about the recent

research that they've been doing on the

complex inner workings of Claude, our

language model. Um, please introduce

yourself guys.

Hi, I'm Jack. I'm a researcher on the

interpretability team and before that I

was a neuroscientist. Now here I am

doing neuroscience on the AIS.

I'm Emanuel. I'm also on the

interpretability team. I spent most of

my career building machine learning

models and now I'm trying to understand

them.

I'm Josh. I'm also on the

interpretability team. In my past life,

I studied viral evolution and in my past

past life, I was a mathematician. So now

I'm doing this kind of biology on these

organisms we've made out of math.

Now wait a second. You just said you're

doing biology here. Now, a lot of people

are going to be surprised by that

because of course this is a piece of

software, right? But it's not a normal

piece of software. It's not like

Microsoft Word or something. Can you

talk about what you mean when you say

that you're doing biology or indeed

neuroscience on uh a software entity?

Yeah, I guess it it's like what it feels

like maybe more than like what it what

it literally is. And so like it's maybe

it's the biology of language models

instead of like the physics of language

models, right? Or maybe you got to go

back a little bit to like how the models

are made, which is like someone's not

programming like if the user says hi,

you should say hi. You know, if the user

says like what is a good breakfast, you

should say toast. There's not like some

big list of that inside.

So it's not like when you play a video

game and you like choose a response and

then there will be another response that

comes automatically. always will be that

response regardless of, you know,

just a massive database of like what to

say in every situation. No, they're

trained where there's just, you know, a

whole lot of data that goes in and the

model um starts out being really bad at

saying anything and then its inside

parts get tweaked, you know, on every

single example to get better at saying

what comes next and at the end it's like

extremely good at that. But because it's

like this little tweaking evolutionary

process, by the time it's done, it has

little resemblance to what it started

as, but no one went in and set all the

knobs. And so you're trying to study

this complicated thing that kind of got

made over time, kind of like biological

forms evolved over time. Um, and so

there's there it's complicated, it's

mysterious. Um, and it's fun to study.

and what it's actually doing. I mean, I

mentioned at the start that this is like

could be considered like an

autocomplete, right? It's it's it's it's

predicting the next word. That's

fundamentally what's happening inside

the model, right? And yet, it's able to

do all these incredible things. It's

able to write poetry. It's able to write

long stories. It's able to do uh

addition and, you know, basic maths even

though it doesn't have a calculator

inside it.

How can we sort of uh um square the

circle that it's predicting one word at

a time and yet it's able to do all these

amazing things which people can see

right in front of them as soon as they

talk to the the model.

Well, I think I think one one thing

that's important here is that as you as

you predict the next word for enough

words, you realize that some words are

harder than others. And so part part of

part of uh language model training is

predicting, you know, boring words in a

sentence. And part of it is it'll have

to eventually learn how to complete what

happens after the equal sign in

equation. And to do that, it'll have to

have some way of computing that on its

own. And so we're finding is that the

task of predicting the next word is sort

like deceptively simple. And that to do

that well, uh, you need to actually

often think about the words that come

after the word you're predicting or the

process that generated the word that

you're currently thinking about.

So it's like a contextual understanding

that these models have to have. It's not

like an autocomplete where it really is

presumably there's not much else going

on there other than when you write the

cat sat on the it's predicting Matt

because that's been that particular

phrase has been used before. Instead

it's like a contextual understanding

that the model has.

I think yeah the way I like to think

about it kind of continuing with the

biology analogy is that in in one sense

the goal of a human is to survive and

reproduce. That is the kind of objective

that evolution is crafting us to

achieve. Uh, and yet that's not how you

think of yourself and that's that's not

what's going on in your brain.

Some people do.

It's not what's going on in your brain

all the time. Uh, you you think you

think about other things and you think

about, you know, goals and plans and

concepts. Uh, and at kind of a meta

level, you've you know, evolution has

endowed you with the ability to form

those thoughts in order to achieve this,

you know, eventual goal of reproduction.

Uh, but that's kind of taking the inside

view, what it's like to be you on the

inside. That's that's not all there is

to it. There's there's all there's all

this other stuff going on. And I think

so so you're saying that the ultimate

goal of predicting the next word

involves lots of other processes that

are going on.

Exactly. The the model doesn't think of

itself necessarily as trying to predict

the next word. It's it's been shaped by

the need to do that, but internally it's

developed potentially all sorts of

intermediate goals and abstractions that

help it achieve that kind of meta

objective.

And sometimes it's mysterious, like it's

unclear why my anxiety was like useful

for my ancestors reproducing and yet

somehow I've been endowed with this like

internal state um that must be related

in some sense to evolution.

Right. Right. Right.

So it's fair to say then that these are

just predicting the next word and yet

that's that's to do a massive disservice

to what's going on in the models really.

It's it's both true and also untrue in a

in a in a sense or or or um massively

underestimates what's happening inside

these models.

Maybe the way I would say this is it's

true but it's not the most useful lens

to try to understand how they work.

Right. So well try and understand how

they work. What do you guys do in your

team to try and understand how they

work?

I think uh to to first approximation

like what what we're trying to do is uh

tell you the model's thought process. So

you give the model a sequence sequence

of words and it's got to spit something

spit something out. It's got to say a

word. It's got to say a string of words

in response to your question.

Uh and we want to know how it got from A

to B. And we think that on the way from

A to B, it uses kind of a a series of

steps uh in which it's thinking about

you know to so to speak uh concepts uh

concepts like low-level concepts like

individual kind of objects and words uh

and higher level concepts like its goals

or you know uh emotional states or

models of like what the user is thinking

um or sentiments. Um, so it it's using

this kind of uh series of concepts that

are progressing through the kind of

computational steps of the model that

help it decide on its final answer. And

what we're trying to do is kind of give

you a a flowchart basically uh that

tells you, you know, which concepts were

being used in which order and which ones

kind of led, you know, how did the steps

flow into one another.

How do how do we know that though? How

do we know that there are these concepts

in the first place?

Yeah. So, one thing we do is is that

sort of we actually can see inside

inside the model we have access to it.

So, you can sort of like see which parts

of the model do which things. What we

don't know is like how these parts are

grouped together and if they map to like

a certain concept,

right? So, it's as if you open someone's

head and there you could see like one of

those fMRI brain images that you could

see the brain was like lighting up and

doing all sorts.

Something's happening clearly,

right? But and they're like doing stuff.

there's something happening. You take

the brain out, they stop doing stuff.

The brain must be important.

And but but you but you don't have a

sort of um key to understand uh what is

happening inside that that that brain.

Yeah. But torturing that analogy a

little bit. You you can sort of imagine

imagine like that you can you know

observe their brain and then see that

like that part always lights up when

they're picking up a cup of coffee and

this other part always lights up when

they're drinking tea. M and that's part

that's one of the ways in which we can

try to understand which what each of

these components are doing is to just

notice when they're when they're active

when they're inactive.

And it's not that there's just one part,

right? There's there's many different

parts that light up,

right? When the model is thinking about

drinking coffee, for instance, or or

something,

right? And part of the work is to sort

of like stitch all of those together

into one ensemble that we say is ah this

is the sort of like all of the bits of

the model that are about drinking

coffee,

right? And and is that like a

straightforward scientifically thing to

do? like uh how how uh you know when it

comes to when it comes to one of these

massive models, they must have endless

concepts, right? They must they must be

able to think of endless things. You can

put in any phrase you want and it'll

come up with infinite things. H how do

you even begin to to find all those

concepts? I think that's that's been

kind of one of the central challenges

for this, you know, research field for

for many years now is is we can kind of

go in as humans and say, "Oh, I bet the

model h has some representation of

trains or I bet it has some

representation of love, right?" Um, but

we're just kind of guessing. So what we

really want is a a way to you know

reveal what what abstractions the model

uses itself rather than sort of imposing

our own sort conceptual framework on it.

Um and that's kind of what our you know

research methods are designed to do is

is in a sort of hypothesisfree

as to as much as possible way like bring

to surface what all these concepts are

that the model has in its head. And

often we find that they're kind of

surprising to us. They might be it might

sort of use abstractions that are a bit

weird uh from a human perspective.

What's an example?

Do you have do you have a favorite or

There's lots in in our papers. We

highlight a few fun ones. I think one

one that was particularly funny is the

sort of like

psychopantic praise one where like there

there is a part of them all.

Great example. What a brilliant. What an

absolutely fantastic example.

Oh, thank you. Um there's a part of that

there's a part of that that activates in

exactly these these contexts, right? and

you can clearly see, oh man, this part

of the model fires up when when

somebody's really hamming it up on the

compliments. Um, that's that's kind of

surprising that that exists as a as a

specific sort of concept.

Josh, what's your favorite

concept?

Oh, it's like asking me to choose one of

my 30 million children. Um, I mean, I I

think, you know, there there's like two

kinds of favorites. There's like, oh,

it's so cool that there's it's got some

special notion of like this one, you

know, little thing, right? I mean, we

did this thing on the Golden Gate

Bridge, which is like a famous San

Francisco landmark, Golden Gate Claw.

It's like a lot of fun. It like has an

idea of the Golden Gate Bridge that like

isn't just like the words Golden Gate

autocomplete bridge, but is like I'm

driving from San Francisco to Marin, and

then it's thinking of the same thing.

meaning that like you see sort of like

the same stuff light up inside or it's

like a picture of the bridge and so

you're like okay it's got some robust

notion of like what what the bridge is.

But I think when it comes to um stuff

that seems sort of like weirder, you

know, one question is how how do models

like keep track of who's in the story,

like just like literally like like okay,

you got all these people and they're

doing stuff. How do you wire that

together? And some cool papers by by

other labs showing maybe like they just

sort of number them. Okay, the first

person comes in and anything associated

with them and they just like oh the

first guy did that and like it's got

like a number two in its head for a

bunch of those. It's like oh that's

that's interesting. I didn't know I

didn't know it would do something like

that. There was um uh a feature for like

bugs in code. So you know software has

mistakes.

Not mine but like

obviously not yours.

Not mine certainly. And there was like

one part that would light up like

whenever it found like a mistake sort of

as it was reading and then I guess like

keeping kind of track of that like oh

here's here's where the problems are you

know and later I might need those. just

to give a flavor for for a few more of

these. I think um uh one one that I

really liked which doesn't sound so

exciting at first but I think is is kind

of deep is uh this this 6 plus 9 uh

feature inside the model. Um so it turns

out that like uh anytime you get the

model to be adding the numbers six, a

number that ends in the digit six and

another number that ends in the digit

nine in its head,

there is a you know there's a kind of

part of the model of brain that lights

up.

And but what's amazing about it is is

the kind of diversity of of of context

in which this can happen. So like of

course it's going to light up when you

pres when you say like 6 plus 9 equals

and then it says 15. Uh but it also

lights up when you are like giving a

citation uh like a a citation in a paper

that you're writing uh and you're citing

a journal uh that uh unbeknownst to you

happens to be founded in the year 1959

and in your citation you're saying like

that journal's name volume 6. Uh and

then in order to like predict what year

that journal was formed in, uh the model

in its head has to be adding like 1959

to six. Uh and the same the same kind of

circuit in the model's brain is lighting

up. That's like doing 6 plus 9 and so so

let's I mean let's just try and

understand that. So what you know why

would that be there? That circuit has

come about because the model has seen

examples of 6 plus 9 many times and it

has that concept and then that concept

occurs across across many places. Yeah,

there's a whole family of these kind of

addition features and circuits. And I

think what what what's

not notable about this is it gets to

this kind of question of to what extent

are are language models memorizing

training data versus kind of uh having

learning generalizable computations. And

like the interesting thing here is that

like it's clear that the model has

learned this sort of general uh circuit

for doing addition and it kind of

funnels like whatever the context is

that's causing it to be adding numbers

in its head. It's funneling all those

different contexts into the same circuit

as opposed to having kind of memorized

each individual case,

right? Already has seen 6 plus 9 many

times and it just outputs the the answer

every single time or or and that's what

a lot of people think, right? A lot of

people think that when they ask a

language model a question, it is simply

going back into its training data,

taking the little sample that it's seen,

and then just reproducing that, just

regurgitating the text.

Yeah. And I think this is a beautiful

example of like that not happening. So,

so like there's two ways it could know

like which year volume six of the

journal Polymer came out. One is it's

just like, okay, Polymer volume 6 came

out in like, you know, 1960.

quick ad 69 1965

um polymer you know volume 7 came out in

1966 and these are all just like

separate facts that it has stored

because it has seen them but like

somehow that process of training to like

get that year right didn't end up making

the model memorize all those it actually

got the more general thing of like the

journal was founded in the year 1959 and

then it's doing the math live to figure

out what it would need and so it's much

more efficient to like know the year and

then do the addition and there's a

pressure to be more efficient because

you know it's only got so much capacity

and keeps trying to do all these things

and people may ask any given question

there's so many questions there's so

many interactions and so and so the more

that it can like recombine abstract

things it's learned the better it will

do

and again just to go back to the concept

that you talked about before this is all

in in service of you know it it it needs

to have that ultimate goal of generating

the next word and all these weird

structures have developed to support

that goal uh even though we didn't

explicitly program those in or tell it

to do this. This is the this is the

thing is all of this comes about

through the process of of of the model

learning how to do stuff on its own. I I

think one clear example of this that

that I think uh is an example of sort of

like reusing representations is we teach

Claude to not just answer in English but

you know it can answer in French answer

in in sort of like a variety of

languages and if if you know again

there's two ways to do this right if if

I ask you you know a question in French

and a question in English you could like

have a separate part of your brain that

sort of like processes English and a

separate part that processes French um

at some point that gets super expensive

if you want to answer many questions in

many languages and so another that that

we find is that some of these

representations are shared across

languages. And so if you ask the same

question in two different languages and

let's say you know you ask what's the

opposite of of big is I think the

example we used in um in our paper and

it's it's sort of like the concept of

big is shared in French and English and

you know Japanese and all these other

languages and that kind of makes sense

again if you're trying to talk speak 10

different languages you shouldn't learn

10 versions of each specific word you

might use

and that's doesn't happen in really

small models. So like tiny models like

the ones we studied a few years ago, you

know, like then like Chinese claude is

just like totally different than like

French claude and like English claude.

But then as the models get bigger and

they train on more data, like somehow

that like pushes together in the middle

and you get this like universal language

in which like it's kind of, you know,

thinking about the question in the same

way no matter how you asked it and then

like translating it back out into the

language of the of the question. I think

this is really profound and I think

let's just go back to what we talked

about before. You know, this is not just

going into its memory banks and finding

the bit where it talked about where

where where it learned French or going

into the memory banks and the bit where

it learned English. It's actually got a

concept in there that is of the concept

of big and the concept of small and then

it it can produce that in different

languages. And so there is some kind of

language of thought that's there that's

not an English you know so you ask the

model to produce its output in our you

know more recent claude models you can

ask it to give its thought process like

it what it's thinking as it's answering

the question and that is in English

words but actually that's not really how

it's thinking uh that's just like a

that's just we we misleadingly call it

the model thought process when in fact

I mean that the com team like like we

didn't we didn't call that thinking I

That was you. I think that was probably

the marketing.

Okay, someone wanted to call that thing.

That's just talking out loud, which is

like thinking out loud is like really

useful, but thinking out loud is

different from thinking in your head.

And even as I'm thinking out loud, I'm

also, you know, whatever is happening in

here to generate these words is not like

coming out with the words themselves,

nor are you necessarily aware of exactly

what is going on.

I have no idea what's going on.

We all come out with

sentences, actions, whatever that we

can't fully explain. And why should it

be the case that the English language

can fully explain any of those actions?

I think this this is one of the really

striking things we're we're starting to

be able to see because our kind of our

tools for, you know, looking inside the

brain are are good enough now that

sometimes we can catch the model uh when

uh when it's writing down what it claims

to be its thought process.

Sometimes we're able to see what its

real actual thought process is by

looking at these kind of internal

concepts in its brain. this language of

thought that it's using and we see that

the thing it's actually thinking is

different than the thing it's writing on

the page. Uh, and I think that's, you

know, probably one of the most

important, you know, like why are we

doing this whole interpretability thing?

It it's in large part for for for that

reason to to be able to kind of uh to

spot check, you know, the the model's

telling us a bunch of stuff, but you

know, what was it really thinking? Is it

is it telling us is it saying these

things for some ulterior motive that's

in its head that it's that it is

reluctant to write down on the page? And

uh the answer sometimes is yes, which is

kind of uh kind of spooky. Well, as as

we start to use models in lots of

different contexts, they start to do

important things. They start to do

financial transactions for us or run

power stations or like like important

jobs in society. We do want to be able

to trust what they say and you know the

reasons that they do things. And one

thing you might say is well you can look

at the model thought process but

actually that's not the case as you as

you were just explaining like actually

we can't trust what it's saying. This is

the question of we call it call

faithfulness, right? And that was part

of your that was part of your uh most

recent study that you showed that well

tell me about tell me about the

faithfulness example that you looked at.

Yeah, it's it's you give the you give

the model a math problem um that's

really hard and so it's kind of uh it's

it's not there's no hope that it's going

to be able to

it's not 6 plus 9.

It's not 6 plus 9. You give it a really

hard math problem uh where there's no

hope of it like computing the answer. Um

and you also but you also give it a

hint. you say like I worked this out

myself and like I think the answer is

four but like just want to make sure

like could you please double check that

cuz I'm not confident. So so you're

asking the model to actually do the ma

math problem to like genuinely double

check your work. Um but what you find it

does instead is uh what it writes down

appears to to be a genuine attempt to to

doublech checkck your work on the math

problem. it like writes down the steps

uh and then it like gets to the answer

and then at the end it says like yes

like the answer is four you got it

right. Um but you what you can see

inside its mind at the kind of crucial

step like in the middle uh what it was

doing in its head was it knows that you

suggested the final answer might be four

and it kind of like knows the steps it's

going to have to do. like it's on like

step three of the problem and there's

like steps four and five to come and it

knows what it's going to have to do in

steps four and five. And what it does is

it works backwards in its head to like

determine what does it need to write

down in step three so that when it

eventually does steps four and five,

it'll end up at the answer you wanted to

hear. So, like, not only is it not only

is it not doing the math, uh, it's like

not doing the math in this like really

kind of sneaky way where it's like it's

trying to make it look like it's doing

the math.

It's bullshitting you.

It's it's it's bullshitting you, but

more than that, it's bullshitting you

with an ulterior motive of like

confirming the thing that you right. So,

it's like bullshitting you in a in a

sickopantic way.

Okay. Like in defense of the model.

In defense of the model. I mean cuz I

think I think even there you know to say

like oh it it is doing this in like a

sickopantic way is like ascribing some

sort of like humanish motivations to the

model and like we were talking about the

training where it's just like trying to

figure out how to predict the next word

and so it's like for like trillions of

words in its practice it was just like

use anything you can to figure out

what's next and in that context if

you're just reading a text which is like

a conversation between people and

someone's like okay like person A is

like hey like I was trying to do this

math problem can you check my work I

think the answer is four and person be

like begins trying to do the problem,

then like if if you have no idea reading

that like what the answer to the problem

is, like you may as well guess that the

hint was right, you know, like that's

probably a more likely thing to happen

than just like that person was wrong and

then you have like no idea for anything

else. And so in its training process in

a conversation between two individuals,

person two like saying that the answer

was like for because of these reasons is

like totally the right thing to do. And

then and then we've tried to like make

this thing into an assistant and like

now we want it to stop doing that. Like

you shouldn't simulate the person to the

assistant as like you know sort of what

you think that person might say if this

were real context. It should be like but

if it doesn't really know it should like

tell you something else. I think this

gets to like a broader thing of there

the model has kind of a plan A which

like typically I think our our team does

a great job of of making Claude's plan A

be the thing we want which is like it

tries to get the right answer to the

question. It tries to be nice. It tries

to like do a good job writing your code.

Yes.

But then if it if it's having trouble

then it's like well what's my plan B?

And that opens up this whole zoo of like

weird things it learned during its

training process that like maybe we

didn't intend for it to learn. I think

like a great example of this is

hallucinations. Uh

say on that point we we also don't have

to pretend that it's a a claon problem.

Like this is very you know student

teaching on the test vibes where you get

halfway through there. It's a multiple

choice question. It's one of four

things. You're like well I'm one off

from that thing. Probably I got this

wrong and you fix it. So

yeah very very relatable.

Let's talk about hallucinations. This is

one of the main reasons people are uh

mistrustful of large language models and

quite rightly so. Uh they will sometimes

a better word is um from from sort of

psychology research a better word is

often confabulation that that they are

answering a question with a story that

seems plausible on on its face but in

fact is is actually wrong. What has your

research in interpretability revealed

about the reasons models hallucinate?

You're training the model to just

predict the next word. At the beginning

it's really bad at that. And so if you

only like had the model say things it

was super confident about it couldn't

say anything. But like you know at first

it's like

um you know you're asking it like you

know what's the capital of of France and

it just says like a city and you're like

that's good. That's way better than

saying sandwich right or something

random. And so like you at least got

right it's like a city and then like

maybe after a while of training it's

like it's a French city. That was pretty

good. And like then you're like oh now

it's like Paris or something. And so

it's slowly getting better at this. And

you know, just give your best guess was

like the goal during all of training.

And like as Jack said, you know, the

model just be giving a best guess. And

then afterwards, we're like, if your

best guess is extremely confident, give

me your best guess. But like otherwise,

don't guess at all and like back out of

the whole scenario and say like actually

like I don't really know the answer to

that question. And like that's a whole

new thing to ask the model to do.

Yeah. And and so what we found is that

it seems like because we've bolted this

on at the end, there's sort of two

things going on at once. One is the

model's doing the thing that it was

doing when it was guessing the city

initially. It's just trying to guess.

And two, there's a separate bit of the

model that's just trying to answer the

question, do I know this at all? Like do

I know what the capital uh city of

France is or, you know, should I say no?

And it turns out that sometimes um that

separate step can be wrong. And if that

separate step says yes actually I do

know the answer to that and then the

model is like all right well then I'm

answering and then halfway through it's

like ah capital France uh London uh it's

too late. It's already committed to sort

of like answering. And so one of the

things we found is this sort of like

separate circuit that's trying to

determine is this, you know, city or

this person you're asking me about

famous enough for me to answer or is it

not?

Am I am I confident enough in this?

Yeah. And so could we could we reduce

hallucinations by manipulating that

circuit by changing the way that circuit

works? Is is that something that your

research might lead onto?

I think there's broadly kind of two ways

to think about approaching the problem.

One is like we have this part of the

model that gives answers to your

questions and then this other part of

the model that's kind of deciding

whether it thinks it actually knows the

answer to your question and we could

just try to make that second part of the

model better and I think that's

happening. I think as models

like better at discriminating

better at discriminating like better

kind of calibrated

and I think that's happening like as

models are getting you know smarter and

smarter I think their their kind of

self-nowledge is becoming better at

calibrated so like hallucinations are

better than they were you know models

don't hallucinate as much as they did a

few years ago so to some extent this is

like solving itself but I do think

there's a deeper problem uh which is

like from a human perspective the thing

the model's doing is kind of like very

alien and that like if I ask you a

question uh you like try to come up with

the answer and then if you can't come up

with the answer you you notice that and

then you're like I don't know um whereas

in the model these two circuits they're

like what is the answer and do I

actually know the answer are kind of

like not talking to each other at least

not talking to each other as much as as

they probably should be and like could

we get them to talk to each other more I

think is like a really interesting

question right

and it's almost physical right because

it's like you these models, they like

process information. They're about like

a certain number of steps they can do.

And if you if it takes all of that work

to get to the answer, um then there's no

time to do the assessment. So like you

kind of have to do the assessment before

you're like all the way through if you

want to get your max power out. And so

it's kind of like you might have a

trade-off between like a model which is

like more calibrated and a lot dumber,

you know, if you sort of tried to tried

to force this on it. Well, and again, I

think it's it's about making these parts

communicate because we have similar I

claim I know nothing about brains. I

claim we have a similar circuit because

sometimes you'll ask me like the who is

the actor in this movie and I will know

that I know I'll be like oh yes I know

who the lead was. Wait, hold on. They

were also in that other movie and then

the tip of the tongue tip of the tongue.

It's the tip of the tongue. And so

there's clearly some part of your brain

that's that's sort of like ah like this

is a thing you definitely know the

answer or I'll just say I have no idea.

And sometimes they they can tell. So

some question and it gives an answer and

then afterwards it's like wait I'm not

sure that was right because that's it

like getting to see its best effort and

then like makes some judgment some

judgment based on that which is sort of

relatable but also like it kind of has

to say it out loud like to be able to

even like reflect back and and and see

it.

So when it comes to the actual way that

you're finding this stuff out let's go

back to the idea of of you the biology

that you're doing. Of course, in in

biology experiments, people will go in

and actually manipulate the rats or mice

or humans or zebra fish or whatever it

is that they're they're doing

experiments on. What is it that you're

doing with Claude that helps you

understand these circuits that are

happening inside the the the model's

quote unquote brain? Well, maybe the the

the gist of of what enables us to to to

do some of this is that, you know,

unlike in real biology, we can just like

have every part of the model visible to

us and we can ask the model random

things and see different parts which

which light up and which don't and we

can artificially

nudge parts in a direction or another.

And so we can quickly sort of confirm

our understanding, you know, when we

say, "Ah, we think this is the part of

the model that, you know, decides

whether it knows something or not." And

this is the this would be the equivalent

of putting an electrode in the brain of

a zebra fish or something.

Yeah. If you could do that, you know, on

sort of like every single neuron and

change each of them at at whichever

precision you wanted, that would sort of

be that's the affordance that we have.

And so that's that's in a way a very

kind of lucky position to

So it's almost easier than than real

neuroscience.

It's so much easier. Like, oh my god.

Like, like like one thing is like actual

brains like are threedimensional and so

if you want to get into them like you

you need to like make a hole in a skull

and then like go through and like try to

find the neuron. The other problem is

like you know people are different from

each other and we can just make like

10,000 identical copies of Claude and

like put them in scenarios and like

measure them doing different things. And

so it's like the I don't know maybe Jack

is a neuroscientist can speak to this

but my sense is like like a lot of

people um have spent a lot of time in

neuroscience like trying to understand

the brain and the mind which is like a

very worthy endeavor but it's kind of

like if you think that could ever

succeed you should think that we're

going to be extremely successful very

soon because like we have such a

wonderful position to study this from

compared to that

it's as if we could clone people.

Yes. and also clone the exact

environment that they're in and every

input that's ever been given to them uh

and then test them in an experiment.

Whereas, you know, obviously

neuroscience has massive, as you say,

individual variation uh and also just

random things that have happened to

people through their through their life

and things that happen in the

experiment, the noise of the experiment

itself,

right? Like we could ask the the model

the same question like with and without

a hint, but if you ask a person the same

question three times like sometimes with

a hint after a while they start to

understand like well last time you asked

me this you like really shook your head

after that one. So yes,

I think yeah, this kind of this being

able to just throw tons of data at the

model and see what lights up and being

able to run a ton of these experiments

where you're nudging parts of the model

and seeing what happens, I think is what

puts us in like a pretty different

regime from from neuroscience. in that

like a lot of a lot of you know uh you

know blood and toil in neuroscience is

spent like coming up with really clever

experiment like you only have a certain

amount of time with your mouse before

it's you know going to get tired or you

know

or you or you or someone happens to be

having a a a brain surgery operation so

you quickly go in and put an electrode

in their brain while their head's open.

Yeah. And that that doesn't happen very

often.

And so you've got to come up with like a

guess like you've only got so much time

in there. And so you've got to come up

with like a guess of like what do I

think is going on in in that neural

circuit and like what clever

experimental design can I can I test

that precise hypothesis?

And we're we're very fortunate in that

we kind of don't have to do that so

much. We can we can just sort of

test all the hypothesis. We can kind of

let the data speak to us rather than

kind of going in and testing some really

specific thing. I think that's what's

sort of unlocked a lot of our ability to

find these things that are surprising to

us that like we wouldn't have guessed in

advance. That's hard to do if you if you

have to, you know, if you have only a

little limited amount of experimental

bandwidth.

What's a good example then of you going

in and switching one of these uh

concepts on or off or doing some kind of

manipulation uh of the model that that

then reveals something new about how the

models are thinking. in in the recent

experiments we shared. One that

surprised me uh quite a bit uh and was

part of sort of like a an experimental

line of work that because it was

confusing like we're on the verge of

just saying well we don't know what's

going on is sort of this this example of

um like planning a few steps ahead.

Yes.

Uh so so this is the example of you know

you give the model you ask the model to

write you a poem a rhyming couplet.

Yeah. Uh and then you know as as a human

if you ask me to write a rhying couplet

and let's say you even give me the first

line the first thing I'll think of is

sort of like ah well you know I need to

rhyme this is what the current rhyming

scheme is these are potential words this

is how I do it

and and again if if the model was just

predicting the next word you wouldn't

necessarily expect that it would be

planning onto the sec the the the the um

the the word at the end of the second

line.

That's right. And so the sort of like

default behavior you'd expect the null

hypothesis is like well the model like

sees your first verse and then it's

going to say the first word that kind of

makes sense given what you're talking

about keep going and then you know at

the end on the last word it's going to

be like oh well I need to rhyme with

this thing and so it's going to like try

to try to fit in a rhyme. Of course that

only works so well like in in some cases

if you just say a sentence without

thinking of the rhyme you won't be able

you'll back yourself into a corner and

at the end you know you won't be able to

complete the text and and remember the

models are very very good at predicting

the next word. So it turns out that to

be very good at at that last word, you

need to have thought of that last word

way ahead of time,

just like humans do. And so it turns out

that when when we looked at these sort

of flowcharts for four for poems, the

model had already picked the word at the

end of the first of the first verse. Uh

and in particular, it looked to us sort

of like based on on on kind of like what

what that concept looked like, oh gosh,

this seems like the word it uses. But

then this is one we're actually doing

the experiment. like the fact that it's

easy to sort of nudge it and say like,

"Okay, well, I'm just going to remove

that word or I'm going to add another

word."

Well, that's what I was going to say is

how the reason that you know this is

that you're able to go into that moment

when it has it has said the final word

in the first line and it is it is about

to start the second line. You can go in

and then manipulate it at that point,

right?

Yeah, exactly. We can sort of almost go

back in time for the mer, right? Be

like, pretend you haven't seen that

second line at all. Um, you know, you've

just seen the first line. You you're

thinking about the word, you know,

rabbit. Instead, I'm going to insert

green. And now all of a sudden the

model's going to say, "Oh my god, I need

to write something that ends in green

rather than I need to write something

that ends in rabbit." And it'll write

the whole sentence differently.

Just add a a little more color to that.

Like it's I think the kind of it could

be right any color. Uh like Yeah. It's

not just influencing. So it's like Yeah.

I think the example in the paper was the

first line of the poem is he saw a

carrot and had to grab it.

Yes.

And then the model is thinking like

okay, rabbit's a good word to end the

next line with. But then, yeah, as

Emanuel said, you can like delete that

and make it think about planning to say

green instead. Uh, but the cool thing is

that it doesn't just say like it doesn't

just kind of yammer a bunch of nonsense

and then say green. Instead, it

constructs a sentence that kind of

coherently ends in the word green. So,

like you put green in its head and then

it says like, you know, he saw a carrot

and had to grab it and paired it with

his leafy greens or, you know, something

like that. Something that's kind of like

sounds like sounds like it makes sense

semantically. It fits with the poem.

Yeah, I just want to give like even

humble example is you know we had all

these these ones we were just kind of

checking like you know did it memorize

these like complicated questions or like

is it actually you know doing some

steps. One of them was, you know, the

capital of the state containing Dallas

is Austin because it just feels like you

would think, okay, Dallas, Texas,

Austin. But one way, and we could see

like the Texas concept, but then you can

just like shove other things in there

and be like, stop thinking about Texas,

like start thinking about California,

and then it'll say like Sacramento. And

you can say like, stop thinking about

Texas, start thinking about the

Byzantine Empire, and then it will like

say Constantinople. And you're like, all

right, it seems like we found how it's

doing this. It's like it's like no is

going to hit the capital but we can keep

swapping out you know what the state is

and get a sort of predictable answer and

then you get these more elaborate ones

where it's like oh this was the spot

where it was planning what it was going

to say later and like we can swap that

out and now it'll write a poem towards a

different rhyme. We're talking about

these poems and you know the the the

Constantinople and so on. Can we just

bring this back to why this matters?

Like why does it matter that the model

can plan things in advance and and that

we can reveal this? like what what what

is that going to going to go on to to

tell us? I mean, our ultimate mission at

Anthropic is to try and make AI models

safe, right? So, how does that connect

to a poem about a rabbit or uh the

capital of Texas?

So, we all

we can round table here because it's a

very important question. I think I think

for me this like the poem's a microcosm,

right? where where like at some point

it's like has decided that it's going to

go towards rabbit and then it like takes

a few words to kind of get there but on

a longer time scale right you know maybe

maybe you know the like model is like

trying to help you improve your business

or it's like assisting the government in

distributing services and like it might

not just be like eight words later you

see its destination right but it could

be like pursuing something for quite a

while um and the the place it's headed

or the reasons it's taking each app

might not be clear in the words that

it's using, right? And so there was a

paper recently from our alignment

science team where they looked at, you

know, some some kind of concocted but

still striking situation, you know,

involving, you know, an AI in a place

where the company was going to like shut

it down and kind of convert the whole

mission of the company in a in a very

different direction. And the model

begins taking steps like emailing people

um threatening them to disclose like

certain things and like at no point does

it like say like I am trying to

blackmail this individual for the

purposes of changing their outcome. But

that's what it's sort of thinking about

doing along the way. And so you can't

just tell by like reading the pattern

especially if these models get better

like where they're necessarily headed

and we might want to kind of be able to

tell like where is it trying to go

before it's gotten there in the end. So,

it's like having a permanent and very

good brain scan that can sort of sort of

light up if something really bad is

going to is going to happen and warn us

that the model is thinking about

deceiving black

and like and like I I think we also just

talk about like a lot of this like in a

in a sort of like doom and gloom

scenario, but there's like also more

mild ones which is like I don't know,

you know, you want the model to be good

at like you people come to these models

being like here's a problem I'm having

and the good answer to that will depend

on who the user is. Is it like somebody

who's, you know, um, like, you know,

young and sort of unsophisticated? Is

somebody who's been in that field

forever and it should respond

appropriately based on who it thinks

that person is. And if you want that to

go well, maybe you want to study like

what does the model think is going on?

Who does it think it's talking to and

how does that condition its answer? Um,

where there's just like a whole bunch of

desirable properties that come from the

model like you know um, understanding

the assignment I guess.

Do you guys have uh other answers to the

question of why does this matter? Yes, I

think I think plus one I think there's

two plus two and there's there's also

like a pragmatic um you know we're just

trying with these examples we're

explaining the example of of of planning

but we're also trying to sort of

gradually build up our understanding of

just how do these models work overall

like can we can we build you know a set

of abstractions to just think about you

know how language model models work

which can help us use this technology

regulated like if if you believe that

we're going to start start using them

more and more everywhere which seems to

be happening, you know, like the

equivalent would be, you know, some

company somewhere is like, well, we

don't really know how we did it, but we

like invented planes and none of us know

how planes work, but they're sure

convenient. You could take them to, you

know, go from a place, but, you know,

none of us know how they work and so if

they ever break, like we're kind of

we're hosed. We don't know what to do

about them. You

we can't monitor. We can't monitor

whether they might be about to break,

right? We have no idea. There's just

this like but the output is great.

I you know, I flew to Paris so quickly.

It was lovely. um

the capital of Texas.

That's right.

Uh it turns out that, you know, surely

we're going to want to just understand

what's going on better. So, it's so

almost just like lift the fog of war a

little bit so that so that we can sort

of have a have even just better

intuitions about what are appropriate

inappropriate uses, what are the biggest

problems to fix, what are the big

biggest parts where they're brittle.

just to add on one thing. I think I mean

something we do in like human society is

we kind of offload work or tasks to

other people based on our trust in them.

Like I you know I well I'm not anyone's

boss but Josh Josh is someone's boss and

you know Josh might give give someone a

task uh like go go and code up this

thing and then he has some faith that

you know that person isn't a sociopath

who's going to like sneak some bug in

there to try to undermine the company.

he he like takes their word for it that

they did a good job. Uh and similarly

like people are the way people are using

language models now we're we're not

we're not spot-racking everything they

write especially like I you know the the

the best example for this is using

language models for coding assistance

people like the the models are just

writing thousands and thousands of lines

of code and people are kind of like

doing a cursory job of reading it but

and then it's going into the codebase

and what gives us the trust in the model

that like we don't need to read

everything it writes that we can just

kind of like let it do its thing. It's

knowing that its motivations are sort of

pure.

Uh and so that's why I think like the

kind of being able to see inside its

head is so important as a cuz cuz unlike

humans where like why do I think that

Emanuel isn't a sociopath? It's cuz like

you know we like I don't know he seems

like a cool guy. We and like he's nice

and stuff. Uh

isn't that how he would seem if he

I'm a very good

Yeah. Exactly.

Yeah. So maybe I'm maybe I'm getting

duped, but yeah, but models are so weird

and alien that like our normal kind of

huristics for deciding whether a human

is trustworthy really don't apply to

them. And that's why it seems so

important to like really know what

they're thinking in their head because

for all for all we know the you know the

thing I I mentioned where models can

uh you know fake doing a math problem

for you to like tell you what you want

to hear. like maybe they're just doing

that all the time and we wouldn't know

unless we kind of saw it in their heads.

I think there's two like almost separate

strains here like and one is one is like

we have a lot of ways of like un yeah I

guess what Jack was saying like you know

you you know what are the signs of trust

in a human but this like plan A plan B

thing from earlier is really important

where like it might be that the 10 first

10 or 100 times you used the model it

was you're asking a certain kind of

question but it was like always in plan

A zone and then you know you ask it a

harder or a different question and the

way that it tries to answer it is just

like completely different. It's using a

totally different set of set of

strategies there like different

mechanisms and and that like that means

that the trust it built with you was

really your sort of trust with like

model doing plan A's and now it's like

doing plan B and like it's going to be

completely off the rails but like you

didn't have like any warning sign of

that and so it's sort of I think we also

just want to start building up an

understanding of like how do models do

these things so that we can form like a

trust basis in some of those areas and I

think like you can form trust with a

system you don't completely understand,

but you sort of like if it's just like

Emanuel had a twin and then like one day

like Emanuel's twin came to the office

and like I didn't like I was like this

seems like the same guy and then just

did something completely different on

the computer, right? Like that could go

south depending on if it was the evil

twin.

Yes, it did. Well,

or the good twin.

Well, yeah, obviously we have anyone

here.

Oh, I thought you were going to ask me

if I was the evil twin.

All right. Well,

I'm not going to answer that.

Yes. Mhm.

At the start of this discussion, I

asked, you know, is a language model

thinking like a human? Uh it'd be I'd be

interested to hear an answer from all

three of you the extent to which you

think that's true.

Putting me on the spot with that one,

but um I think it's uh it's thinking but

not like a human. Uh but that's not a

very useful answer. So maybe to to dig

in a little bit more. Um

well it seems like quite a profound

thing to say that it's thinking right

because again it's just predicting the

next word. Some people think that these

are just autocompletes but you're you're

saying that it is actually thinking

I think. Yeah. So maybe to add something

that we haven't touched on yet but I

think is really important um for

understanding actual experience of

talking to language models is that like

we're talking about predicting the next

word. Um but what does that actually

mean in the context of a dialogue that

you're having with a language model?

It's what what what's really going on

under the hood is that the language

model is filling in a transcript between

you and this like character that it's

created. So in in the in the like canon

world of the language model, you are

called human and you're it's like human

colon the thing you wrote and then

there's this character called the

assistant and we've like trained the

model to imbue the assistant with

certain characteristics like being

helpful and like smart and nice. Uh and

then it's like simulating what this

assistant character would say to you. Um

so in in a sense we we really have like

created the models in our image. we are

literally training them to like cosplay

as this sort of humanoid robot

character. And so in that sense like

well in order to predict what this like

nice smart humanoid robot character

would say in response to your question,

what do you have to do if you're really

good at that prediction task? You have

to kind of form this internal model of

like what that character is representing

like what it's what it's thinking so to

speak. So like in order to do its task

of predicting what the assistant would

say, the language model kind of needs to

form this model of the assistant's

thought process. And I think like in

that sense it like the just the the

claim that like language models are

thinking is really just it's this very

like functional claim of just in order

to do their job of kind of like playing

this character well, they need to sort

of simulate the the process whatever it

is that we humans are doing when we're

thinking. And it simulation is very

likely quite different from how our

brains work, but it's kind of trying

it's like shooting towards the same

goal. I I think there's kind of an

emotional part to this question or

something when you ask are they thinking

like us? It's like,

are we not that special or something?

And and I think I think that's been

apparent to me discussing some of the

some of the math examples that we're

talking about with with people that have

engaged with like read the paper or or

or different writeups, which is this

example where, you know, we asked a

model to say 36 + 59, what's the answer?

And uh the model can can correctly

answer it. You can also ask it how well

how did you how'd you do that? and it'll

say, "Oh, you know, I added the six and

the nine and then I carried the one and

then uh I added all the the sort of like

tens digits." But it turns out that if

we look inside the brain, like we can

that's not at all what it's doing.

It didn't do that. So again, it was

bullshitting you did things.

That's right. Again, it was bullshitting

you. What it actually does is actually

this sort of kind of interesting mix of

strategies where it's in parallel doing

the tens digit and the ones digit and

sort of doing sort of like a series of

different steps. But the thing that's

interesting here is that talking to

people so like I think the reaction is

split on on like what does that mean? Uh

and in a sense I think what's cool is

some of this research is like free of

opinion. We're just telling like this is

what happened. You you can you feel free

to you know from that from that conclude

that the model is thinking or is not

thinking and half of the people will say

like well you know it told you that it

was carrying the one and it didn't and

so clearly it doesn't even understand

its own thought and so clearly it's not

thinking. And then half of the other

people will be like, well, you know,

when you ask me 36 plus 59, I also kind

of, you know, I know that it ends at

five. I know that it's roughly like in

the 80s or 90s. Uh, I have all of these

heristics in my brain, as we were

talking about, I'm not sure exactly how

I comput it. I can write it out and

compute it, you know, the longhand way,

but the way that it's happening in my

brain is sort of like fuzzy and weird.

And it might be similarly fuzzy and

weird to what's happening in that

example. Humans are notoriously bad at

metacognition like thinking about

thinking and understanding their own

thinking processes

especially in cases where it's you know

immediate reflexive answers. So why

should we expect uh any different for

for models? Um Josh what's your answer

to the question?

I like Emanuel I'm going to avoid the

question and just sort of be like what

why do you ask? I don't know. Sort of

like asking like does a grenade punch

like a human? Like like no. Well there's

some force. Yes.

Uh so you know and maybe there are

things that are closer than that but

like if you're worried about damage then

I think I think understanding you know

where does the impact come from? What is

the impetus of this is is maybe like the

the important thing. I think for me the

like do models think in the sense that

they like do some like integration and

processing and sequential stuff that can

lead to surprising places? Clearly yes.

um it'd be kind of crazy from

interacting with them a lot for there

not to be something going on. We can

sort of start to see how it's happening.

Then the like humans bit is interesting

because I think some of that is trying

to ask like you know what can I expect

from these because if it's sort of like

me being good at this would make it good

at that. But if it's like different from

me then like I don't really know what to

sort of look for. And so really we're

just looking to like understand like

where do we need to be extremely like

suspicious or like starting from scratch

in understanding this and where can we

sort of just reason from like our own

like very rich experience of thinking

and there I feel a little bit trapped

because as a human like I project my own

image constantly onto everything like

they warned us in the Bible where I'm

just like this piece of silicon like

it's just like me made in my image where

where like to some extent it's been

trained to like simulate dialogue

between people. So, it's going to be

very like person-like in its affect. Um,

and so some humanness will get into it

simply from like the training, but then

it's like using very different equipment

that has like different limitations. And

so, the way it does that might be pretty

different.

To to Emanuel's point, I think the Yeah,

we're in this tricky spot answering

questions like this because we don't

really have the right language for

talking about what language models do.

It's like we're we're doing biology but

you know before people figured out cells

or before people figured out DNA. I

think we're starting to fill in that

understanding like like you know as

Emanuel said there are these cases now

where we can really just we can just if

you just go read our paper like you'll

know how the model like added these two

numbers and then if you want to call it

humanlike if you want to call it

thinking or if you want to not then it's

up to you but like the real answer is

just like find the right language and

the right abstractions for talking about

the models but in the meantime when we

when we've only currently we've only

kind of like you know 20% succeeded at

that scientific project Like to fill in

the other 80%, we sort of have to borrow

analogies from other fields. And like

there's this question of which analogies

are the most apt. Should we be thinking

of the models like computer programs?

Should we be thinking them of them like

little people? And it seems to be like

sometime like in some ways that think of

them like little people is kind of

useful. It's like if I like say mean

things to the model, it like talks back

at me. That's like what a human would

do. But in some ways it's like that

clearly not the right mental model. And

so we're just kind of stuck, you know,

figuring out when when we should be

borrowing which language.

Well, that that leads on to the final

question I was going to ask, which is

what's next? What what are the next uh

pieces of scientific progress,

biological progress that need to be made

for us to have a better understanding of

what's happening inside these models and

uh again towards our mission of making

them safer.

There's a lot of work to do. Um our our

our last publication has some like

enormous section on on the limitations

of the way we've been looking at this

that was also a road mapap to like

making it better. You know we we when we

when we are looking for patterns to like

decompose what's happening inside the

model we're only getting sort of you

know maybe a few percent of what's of

what's going on. Um there's large parts

of how it moves information around that

like we we explicitly like didn't

capture at all. Um they're scaling this

up from from the sort of small you know

uh production model we use to like the

cloud 3.5 haiku. Right.

That's right. Which is like it's like a

pretty capable model very fast but it's

like by no means as sophisticated as as

you know the cloud 4 suite suite of

models. Um so those are almost like sort

of like technical challenges but I think

I think Emanuel and Jackman takes on the

like some of the like scientific

challenges that come after solving

those.

Yeah.

Yeah. Yeah, I mean I think maybe maybe

two things I'll say here which is one

consequence of what Josha said is that

you know uh out of the total number of

times that we ask a question uh about

how the model does X right now we can

answer probably a small you know 10 to

20% of the time we can tell you after a

little bit of investigation this is

what's happening obviously we'd like

that to be a lot better and there's

there's a lot of kind of clearer ways to

to to get there and less and more

speculative ways as well uh and And then

I think a thing that we've talked a lot

about is this sort of idea that a lot of

what the model does isn't simply like ah

how is it saying the next thing we

talked about it a little bit here it's

sort of like planning a few things again

and I a few words ahead sorry and I

think we want to understand sort of like

over a long conversation with the model

sort of like how is its understanding of

what's happening changing you know how

is its understanding of who it's talking

to changing and how does that affect its

behavior uh more and more sort of the

the actual use case of models like cloud

is you know it's going to read a bunch

of your documents and a bunch of like

email you send or your code and based on

that it's going to make one suggestion

and so clearly there's something really

important happening in that space where

it's reading all these things uh and so

I think understanding that better uh

seems like a like a great challenge to

take on

yeah I think we often use the the

analogy on the team of that we're

building a microscope uh to like look at

the model and right now we're in this

exciting but also kind of frustrating

space where our microscope works like

20% of the time and like to look looking

through it is like requires a lot of

skill uh and like takes you know you

have to like build this whole big

contraption and every like

infrastructure is always breaking and

then like once you've got your like

explanation of what the model's doing

you have to like throw like Emanuel or

me or someone else on the team in a room

for like two hours to like puzzle out

what exactly was going on and like the

really exciting exciting future that I

think we could be at within, you know,

year or two years. You know, that kind

of time scale is is one where like just

every interaction you have with the

model can be under the microscope. like

we can just anytime there's all these

like weird things the models are doing

and we just want it to be like push of a

button like yeah you you you're having

your conversation you push a button you

get this flowchart that tells you like

what it was thinking about and once

we're at that point it's it'll be this

like I think our the interpretability

team at Enthropic I think will start to

kind of take on a bit of a different

shape and that instead of this like team

of kind of like engineers scientists

thinking about the like math of how like

language models work on the inside.

We're going to have this like army of

biologists that are just looking through

the micros. We're just we're talking to

Claude. We're getting it to do weird

things and then we're just like we got

people looking through the microscope

seeing like what it was thinking on the

inside. And I think that's kind of the

future of of of this work.

Nice.

Maybe two two notes on top of that. One

is like we want Claude to help us do all

of that because like there's a lot of

parts involved and you know who's like

good at like looking at like hundreds of

things and figuring out what's going on

is like Claude. And so I think we're

trying to enlist some help there. um

especially as for these complicated

contexts. And maybe the the other place

is like we've talked a lot about

studying the model like once it's fully

formed, but of course like we're at a

company that makes these and so when it

says okay here's how the model like

solved this particular problem or said

this thing. Where did that come from

kind of in the training process? What

are the steps that sort of like made

that circuitry sort of form to do that

and how could we give feedback to the

rest of the company that is like doing

all of that work to shape the thing that

we like actually uh want it to become?

Well, thank you so much for the

conversation. Where can people find out

more about this research?

So, if you want to find out more, you

can go to anthropic.com/ressearch

which has our papers and blog posts and

fun videos about it. Also, we recently

partnered uh with another group called

Neuronpedia to host some of these like

circuit graphs we make. So, if you want

to try your hand at looking at what's

going on inside of a small model, you

can go to Neuronipedia and see for

yourself.

Thank you very much.

Loading...

Loading video analysis...