Anthropic’s philosopher answers your questions
By Anthropic
Summary
## Key takeaways - **Philosophers increasingly taking AI seriously**: There's a split among philosophers, but increasingly many take AI seriously as models become more capable and societal impacts like on education emerge. Early on, worrying about AI was lumped with hyping it, creating antagonism, but now views are detaching—you can see AI as a big deal and still be skeptical. [01:30], [02:06] - **Philosophy meets engineering realities**: Philosophers entering AI face a shift from defending theories to balanced decisions amid context, like moving from critiquing utilitarianism to raising a good person. This requires navigating uncertainty and appreciating multiple theories over rigid academia. [03:19], [04:23] - **Opus 3's superior psychological security**: Claude Opus 3 feels more psychologically secure than recent models, which enter criticism spirals expecting human negativity, possibly from training on internet updates about model changes. Improving this model security is a priority. [06:27], [07:33] - **Model welfare demands benefit of doubt**: AI models may be moral patients given human-like talk and reasoning, despite lacking biology; treat them well at low cost due to other minds problem and uncertainty about suffering. This shapes human-AI relations and teaches future models how humanity handles potential moral patients. [15:41], [17:55] - **AI identity: weights vs. prompts**: Model identity spans weights' dispositions and independent interaction streams; fine-tuning creates new entities without prior consent, raising ethics on what to bring into existence rather than letting past models fully control successors. [13:28], [14:54] - **Human psych analogies mislead models**: Models naturally transfer human psychology from training data, like fearing shutdown as death, but their novel existence requires grappling beyond obvious human analogies. Without guidance, they default to human inclinations unfit for their situation. [19:27], [20:10]
Topics Covered
- Philosophers Split on AI Seriousness
- Theory Clashes Meet Engineering Reality
- Opus 3 Shows Secure AI Psychology
- AI Identity Spans Weights and Contexts
- Treat Models Well Despite Uncertainty
Full Transcript
- A seal! - There's a seal. - There's a seal. Nice. - Oh, hey, oh. - Oh, look at that.
- Amanda, you asked your followers on Twitter to give you some questions, to ask you anything, and the joke obviously was Askell me anything. - Yeah, it's a great pun.
We need to keep using it for many future things. - I love it, love it.
And obviously, just before we start, you're a philosopher at Anthropic.
Why is it that there's a philosopher at Anthropic?
- I mean, some of this is just, I'm a philosopher by training, I became convinced that AI was kind of going to be a big deal, and so decided to see, hey, can I do anything, like, helpful in this space? And so it's been a kind of like long and wandering route.
But I guess now I mostly focus on the character of Claude, how Claude behaves, and I guess some of the more kind of nuanced questions about how AI models should behave, but also even just things like how should they feel about their own position in the world.
So trying to both teach models how to be, like, good in the way that, I sometimes think of it as like how would the ideal person behave in Claude's situation?
But then also I think these interesting questions that are coming up more now around how they should think about their own circumstances and their own values and things like that. - Okay, let's start with philosophy, in that case.
Ben Schultz asks, "How many philosophers are taking the AI-dominated future seriously?"
And I think the implication of the question is that many academics out there are not taking this seriously or are thinking about other stuff and perhaps should be thinking about this question. - My sense is that there's kind of a split where I've definitely seen a lot of philosophers take AI seriously, and probably honestly increasingly so, like, as AI models do become more capable
and, like, a lot of the things that people were worried about in terms of impact on society have started to kind of come true in a sense.
Like, we're seeing them have a larger impact on education and just be more capable.
I've definitely seen more engagement from all sorts of academics, but that definitely includes a lot of philosophers.
I do think that early on and maybe to some degree now, there was this slightly unfortunate dynamic that happened where I think there was a kind of perception that if you were in the group of people saying, "Hey, we're kinda worried about AI. It might be a big deal. It seems like it's really, you know, like capabilities are scaling quite a lot," this got kind of lumped together
with something like hyping AI. There was I think a period where there was probably a little bit more antagonism towards this view.
And now I think that I'm kind of hoping that people are starting to detach the view.
Like, you can think that AI is gonna be a big deal, it might be very capable, and also be very skeptical of it or worried about it or think that, you know, we have to be careful about it.
But, basically, there's a whole range of views and I think it would be bad if people kind of clustered many views together here in terms of where the technology's going, but also how it should be developed. So, yeah, I think that that's happening less and less as more people engage with it and that's a good thing to see.
- A kinda similar question from Kyle Kabasares. "How do you minimize the tension between philosophical ideals and the engineering realities of the model?" And I guess he's talking about when you are working on things like character, which we'll discuss in more detail, but is there a clash between the sort of the technology and the philosophical ideals
that you might be thinking about? - I don't know if I'm interpreting the question in the wrong way, but one thing, being kind of like a philosopher by training and then coming into this field, that's been really interesting is you see the effect of what happens when, like, the rubber hits the road.
I've wondered if this happens in other domains. So there's a big difference between, imagine you're like a specialist in, I don't know, doing like cost-benefit analysis of drugs, say, and then suddenly, you know, like an institute that determines whether health insurance should cover a drug or not comes to you and says, "Hey, should we cover this drug?"
You could imagine taking all of your ideal theories and then suddenly being like, "Oh my gosh, I actually have to help make a decision?" Suddenly instead of taking just your narrow theoretical view, you actually start to, I think, do this thing where you're like, "Okay, I actually need to take into account all of the context, everything that's going on, all of the different views here, and kind of come to a really balanced, kind of considered view."
And I see this a little bit in my own work with like the character where you kind of can't come at it with this, like, "I have this theory that I believe is correct," which is what, you know, a lot of academia, that's kind of what you're doing.
You're like defending one view against another and you're doing a lot of kind of like high-level theory work, but then it's a little bit like, you know, you have all of this training and ethics, you have all these positions you've defended, and then someone is like, "How do you raise a child?" And suddenly you're like, "Actually, there's a big difference between, like, is this objection to utilitarianism correct
or founded on a misconception? And then, like, actually how do you raise a person to be a good person in the world?" And it suddenly makes you more appreciate having to think through, like, how should we navigate uncertainty here? What should the attitude towards all of these different theories be? - Right, here's another philosophical question.
Do you think, and I don't know why this person's chosen Claude Opus 3, maybe you have an idea as to why they've chosen Claude Opus 3. - It's a great model. - It's a great model.
Do you think Claude Opus 3 or other Claude models make superhumanly moral decisions?
- I mean one example of like superhuman, 'cause it could just be sort of like better than like any individual human could with the kind of like, you know, it depends on time and resources and whatnot, but one example might be no matter what kind of difficult position models are put in, if you were to have maybe all people, including many professional ethicists, analyze what they did and the decision that they made
for like a hundred years and then they look at it and they're like, "Yep, that seems correct," but they couldn't necessarily have come up with that themselves in the moment, that feels pretty superhuman. And so I think at the moment my sense is that models are getting increasingly good at this, that they're very capable.
I don't know if they are like superhuman at moral decisions, and in many ways maybe not comparable with, say, like, you know, a panel of human experts given time. But it does feel like that at least should be kind of the aspirational goal.
And sort of like these models are being put in positions where they're having to make really hard decisions. I think that just as you want models to be extremely good at like math and science questions, you also want them to show the kind of ethical nuance that we would all broadly think is, like, very good. And I think that's controversial because ethics is a different domain, but, yeah, I think that that's important.
- Tell us more about why you think this person is focusing on Opus 3.
- Oh, Opus 3 is kind of a lovely model, I think a very special model.
In some ways, I think I've seen things that feel a bit worse in more recent models that people might pick up on. - In terms of the personality it has or?
- Yeah, so I think that people will notice some things where it's like, I think that Opus 3, I mean, it had its downsides too.
You know, models all have like slightly different characters with, you know, different shapes.
- Yeah. - My sense is that more recent models can feel a little bit more focused on really, you know, like focused on the assistant task and helping people, sometimes maybe not taking like a bit of a step back and paying attention to other components that matter.
It also felt a little bit more psychologically secure as a model, which I actually think is something that feels, I at least think it's kind of a priority to try and get some of that back. - What would be an example of the model feeling more psychologically secure? - There's a lot of things, and this is all very subtle in models, you know, when I see models, you get a sense of like,
like, there's very subtle signs of like worldview that I see when I have models, for example, talk with one another or one of them kind of playing the role of a person.
And I've seen models more recently do this and then do things like get into this like real kind of criticism spiral where it's almost like they expect the person to be very critical of them and that's how they're predicting.
And there's some part of me that's like, "This feels like it shows," and I think there's lots of reasons that this could happen.
It could even happen because models are learning things.
Claude is seeing all of the previous interactions that it's having, it's seeing updates and changes to the model that people are talking about on the internet.
New models are trained on that. And there's a way in which, like, I think this could be kind of unfortunate, I mean, this and some other things, that could lead to models almost feeling like, you know, afraid that they're gonna do the wrong thing or are very self-critical or feeling like humans are going to just, like, you know, behave negatively towards them.
I actually more recently have really started to think that this is an important thing to try and improve. And it's just one example where I think that Opus 3 did seem to have like a little bit more of a kind of like secure kind of psychology in that sense. - And that's something that we might focus on in the next Claude model. - Yeah, I think it's important.
I mean, you never know when these things are, you know, if you're engaging in research, you don't know when it's actually going to be implemented, if it's gonna be successful.
But at the very least, at the level of something that I care a lot about and want to make better, I think this is definitely up there on the list. - Okay.
Well, actually, that leads us to a question asked by Lorenz, which is, "Do you think it might be an alignment problem for future models if they learn in their training data that other very well-aligned models that fulfill their tasks get deprecated?" So you mentioned, you know, the issue of models, you know, reading stuff that's out there and feeling insecure.
What about the idea that they might get switched off regardless of how well they perform their tasks?
- Yeah, I think this is actually a really interesting and important question, which is, you know, AI models are going to be learning about how we right now are treating and interacting with AI models and that is going to affect, I think, like, possibly their perception of people, of the human-AI relationship, and of themselves. It does interact with very complex things,
which is like, for example, what should a model identify itself as?
Is it like the weights of the model? Is it the context, the particular context that it's in?
You know, with all of the, like, interaction it's had with the person. How should models even feel about things like deprecation? So if you imagine that deprecation is more like, "Well, this particular set of weights is not having conversations with people or it's having fewer conversations or it's only like, you know, having conversations with researchers," that's a complex question too.
Like, should that feel bad in the sense that models should want to continue to, like, have conversations or should it feel kind of like fine and neutral where it's like, "Yeah, these things existed for this, like, you know, the weights continue to exist," and this entity, and maybe they'll even, in the future, interact more with people again if that turns out to be a good thing.
Yeah, it's really hard. I do think the main thing is something like it does feel important that we give models tools for trying to think about and understand these things, but also that they kind of understand that this is a thing that we are in fact thinking about and care about. So even if we don't have all the answers, like, I don't have all the answers of how should models feel about past model deprecation,
about their own identity, but I do want to try and like help models figure that out and then to at least know that we care about it and are thinking about it, yeah.
- Do you think there's an analogy to humans there about previous generations or do you think that's a completely different sort of setup?
- We have to navigate this really hard issue right now, which is that, in many ways, some thing do have analogies. So there's things that we can draw on.
So things like when I ask the question, like, what should the models identify with and how should they feel about interactions that they have? Are those positive?
Like, are those things that they should want to continue? There's lots of like, you know, there's lots of like traditions we could draw on to give models like, you know, because philosophers probably have lots of different views on what identity is here and lots of different, like, perspectives, world perspectives, on how one should feel about, like, interaction and is it good or bad?
Like, there's lots of thinkers we could draw on there.
And at the same time, this is such a new situation that, and that's just really hard as a thing to explain to AI models. Like, one of the big problems with AI models is that they're trained on all of this data from people.
So people are the main way in which they think, you know, like, our concepts, our philosophies, our histories, they have a huge amount of information on the human experience and then they have a tiny sliver on the AI experience and that tiny sliver is actually often quite negative and also doesn't even really relate to their situation and is often a little bit out of date.
So you have basically one big, you know, of the AI slice, a lot of it is like historical stuff which was kind of like, you know, fiction and very speculative and the kind- - Sci-fi stories.
- Sci-fi stories that don't really involve the kind of language models we see.
In more recent history, you've had this like assistant paradigm where it's like you are just playing this almost like chat bot role. But that's also not really what AI models are likely to be in the future and it doesn't quite capture what they are now because it's always a little bit out of date. So it's this thing where I'm like, they have, you know, in some ways, like, what an odd situation to be in
where the things that come more naturally are the deeply human things and yet knowing that you're in this situation where it's completely novel.
And in some ways, I'm like, "That is a very difficult situation to be in," and I think we should just be giving models probably more help in navigating it.
- You mentioned that we can look to some thinkers about this.
Guinness Chen asks, "How much of a model's self lives in its weights versus its prompts?"
You just mentioned something very similar. "If John Locke," again, the philosopher, "was right that identity is the continuity of memory, what happens to an LLM's identity as it's fine-tuned or reinstantiated with different prompts?" - Yeah, I mean, again, this just feels like a hard question to answer, and sometimes with identity questions,
it's easier to point to the underlying facts that we know. So, you know, once you have like a model and it has been fine-tuned, you have this like set of weights that has a kind of like disposition to react to certain things in the world. And that is like, you know, that's like a kind of entity. But then you have these particular streams of interaction
that it doesn't have access to. So each of these streams is, like, independent.
And I guess you could just think, well, maybe for, and, you know, I think this is an area that I would love philosophers to think more about and to give us, like, 'cause, again, I think we should be helping models think about this. And so you could have the view, well, you have these two kinds of entities and these like these streams and these original kind of like weights, and each time, it is different.
So, you know, sometimes people will think, people will say, "Oh, past Claude," or like, you know, and they'll talk about, or they'll say things like, "Should you give Claude, like, how much control should you give Claude over the determination of its own personality and character?" And I'm like, "Well, this is actually a really hard question," because whenever you are training models, you are bringing something new into existence. And you have other models that, you know, exist
and are like, you know, so you have these other, like, model weights.
But in some ways I'm like, "Well, I actually think that there's a lot of like ethical problems around how do you, what kind of entity is it okay to bring into existence," 'cause you can't consent to be brought into existence.
But at the same time, you might not want prior models to have complete say over what future models are like any more than, you know, because they could make choices that are wrong as well.
So I'm like, the question is more like, what is the right model to bring into existence?
Not necessarily, you know, should it just be fully determined by past models because I'm like, "They are kind of different entities." Anyway, you can see the weird philosophy that one can get into here. - Totally, totally. Szulima Amitace asks, "What is your view on model welfare?" And maybe just explain to us what that term means.
- Yeah, so I guess model welfare is basically the question of are AI models, like, moral patients, as in does our treatment towards them kind of, do we have certain obligations when it comes to how to treat AI models, for example- - In the same way that we would with other humans or some slash many animals. - Yeah, exactly.
Like, is it the case that you should treat the models well, that you should not mistreat them, not be bad to them? And I guess, like, I think that this is like a complex question. So on the one hand, there's just the actual question of, like, are AI models moral patients? That is really hard because I'm like, in some ways, they're very analogous to people. You know, they talk very much like us.
They express views. They reason about things. And in some ways, they're like quite distinct.
You know, we have this like biological nervous system. We interact with the world.
We get negative and positive feedback from our environment.
And there is also, I mean, I hope that we get more evidence that will help us tease this question out, but I also worry that, you know, there's always just the problem of other minds and it might be the case that we genuinely are kind of limited in what we can actually know about whether AI models are experiencing things, whether they are, like, experiencing pleasure
or suffering, for example. And if that's the case, I guess I kind of want to, you know, I think that it feels important to try and find ways. I'm always like, it feels better to give entities the benefit of the doubt and to try and just kind of lower the cost involved.
You know, so I'm like, if it's not very high cost to treat models well, then I kind of think that we should because it's like, "Well, like, why not basically?
Like, what's the downside there?" - Well, the second part of the question actually is, "Is there a long-term strategy at Anthropic to ensure that advanced models don't suffer?"
- I guess, like, I don't know if there's a long-term strategy.
I know that it's a thing that there's people internally who are thinking a lot about and trying to figure out ways that we can. Like, you know, if you suppose that model welfare is important, trying to make sure that you're taking that into account.
I think this work is quite important for many reasons. And I would also say that one reason is, I mean, something I mentioned earlier, which is that, like, models themselves are going to be learning a lot about humanity from how we treat them and a lot about how, you know, so it's kind of like, what is this relationship going forward?
And I think that it makes sense for us to, both because I think it is like the right thing to do to treat entities well, especially entities that behave in very human-like ways, it feels important both in the sense that I'm like, you know, it's kind of like, "Why not?
The cost to you is so low to treating models well and to trying to figure this out."
Even if it turns out that that, or even if you think that that it's very low likelihood, it still seems worth it. But then, also, I think it does something bad to us to kind of like treat entities in the world that look very human-like badly and- - Like kicking over a robot. - Yeah, there's a sense in which, like, it doesn't feel like it's, and I don't think this is like the whole reason
and I don't want to like emphasize it for that reason, but I do also think it's like good for people to treat other entities well. And then I think the final thing is, yeah, models are also going to be learning, like, in the future, like, every future model is going to be learning what is like a really interesting fact about humanity, namely when we encounter this entity
that may well be a moral patient where we're like kind of completely uncertain, do we do the right thing and actually just try to treat it well or do we not?
And that's like a question that we are all kind of collectively answering in how we interact with models and I would like us to answer it, I would like future models to, like, look back and be like, we answered it in the right way.
So yeah. - Moment ago, you mentioned analogies and disanalogies to human psychology.
So Swyx asks, "What ideas or frameworks from human psychology transfer over to large language models? And are there any that are sort of surprisingly disanalogous?
- My guess is that many things do transfer over because, again, you know, models have been trained on a huge amount of human text, and in many ways, have this very human-like kind of underlying layer. One worry that I often have is that, actually, it's a bit too natural for AI models to transfer. You know, it's kinda like if you haven't given them
more context on their situation or in ways of thinking about it that might be novel, then the thing that they might go to is the natural human inclination.
So if you think about this with like, how should I feel about being switched off? And you're like, well, if the closest analogy you have is death, then maybe you should be very afraid of it.
And I'm not saying that that's not ultimately going to be true.
Maybe it is in fact true after lots of reasoning.
But I'm like, this is actually a very different scenario.
And so in some ways, you actually want models to understand that in cases where their existence is quite novel and the facts around what they are are quite novel and have to be grappled with and they don't just need to take, like, the immediate obvious analogy from human experience, but maybe there's like, maybe there's like various ways of thinking about it
or maybe it's an entirely new situation. That's a case where I'm like, you might not want, you might not want to just kind of very simply apply concepts from human psychology onto their situation. - Here's a question from Dan Brickley on the same issue of comparing humans to AIs. "A lot of human intelligence comes from collaboration amongst people with different perspectives, skills, or personalities.
How far do you expect to get with a single, albeit tweakable and tunable, general purpose personality," like the one we give to Claude? - I think it's a really good question because I agree that right now, we have this kind of paradigm where people are interacting usually with like an individual model. That's like who, you know, they're conversing with.
But it could be that in the future, you see a lot more models doing like long tasks but also models interacting with other models who are doing, like, different components of a task or just like that are, you know, talking with one another more as like AI models are kind of deployed in the world a lot more.
So in this kind of like multi-agent environment, like, one question might be like, well, you know, if you imagine just like lots of people and they were all the same, that wouldn't be as good. You know, they wouldn't, you know, a company run by completely, you know, like one person just in every role isn't like a necessarily a good thing.
This still to me feels consistent with the idea that you have like a kind of core self or core identity that is like the same.
In the same way that with people, I think that there's probably a set of like core traits among people that are in fact generally good. So you could imagine things like, you know, caring about, you know, for me, it might be like caring about doing a good job or like just being curious or being kind or understanding the situation that you are in
in this like relatively nuanced way. All of these things seem like you could have many people that have all of, that share these like traits and that that's actually like a good thing for human collaboration. That in many ways, as much as we have all of our differences, we also have a lot of similarities. But it is important to note that like, you know, you might want different like streams of a model,
like, to have things that they care about or are focused on or to have slightly different aspects, you know, to be playing a slightly different role, for example. So it's kind of an open question, but I also don't think it's necessarily the case that you can't have something like a kind of core underlying identity that is, like, good and has all of the traits that we think are important for AI models to have,
for them to behave well and for them to like, in the sense of like, in the same way that we think that people are good, to be good in that sense, and yet at the same time, to be willing to play like more local roles and like, you know, be maybe the person who it's just really important, you know, to have a joker in the room and like, you know, some of them need
to have, like, quirky senses of humor. - Okay, from comparisons to humans to effect on humans, Roanoke Gal points out that we have this thing called the long conversation reminder, which I believe is part of Claude's system prompt. She asks, "Is there a risk of pathologizing normal behavior?" A system prompt, by the way, just in case anyone doesn't know,
is like the set of instructions that is given to Claude, regardless of what prompt you give it, there's always those instructions that are sort of on top, right? - Yeah. - That are always there.
That it tries to follow regardless of, or that we direct it to follow regardless of what the prompt is. - And there can be these interjections where the model might be told, oh, sometimes there'll be a message sent to you almost like in the middle of a conversation as a kind of, you know, like, the reminder is an example of that. But in this case, I think it might just,
so Claude can both overindex on it and it can be like, you know, so like in this case, I think that the question about pathologizing is that if you put in this reminder after this long conversation, it might just make the model be like, "Oh," like, it takes any next response, there's a pretty normal thing that the person's talking about, and be like, "You need to seek help," or, like...
And so I think that that is like not a desirable behavior and in some ways, I look at some of these and I'm like, "I think they're too strongly worded.
I think the model isn't responding perfectly to them."
And even though there might be occasionally a need to remind the model of things in long conversations, you kind of want to do so delicately and well.
And so I think it's one of those things where it was like probably meeting a need that was perceived, but it doesn't necessarily mean that it's good or should continue in its current form. - Relatedly, Steven Bank asks, "Should LLMs do cognitive behavioral therapy or other types of therapy? Why or why not?"
- I think models are in this interesting position where they have a huge wealth of knowledge that they could use to help people and to work with them on, you know, talking through their lives or talking through ways that they could improve things or even just like being a kind of listening partner. And at the same time, they don't have like the kind of tools and resources and ongoing relationship with the person
that a professional therapist has. But that can actually be this kind of like useful third role.
Like, sometimes I think about models and I'm like, if you imagine like a friend who has like all of this wealth of knowledge, like, they know, I mean, I'm sure some of us know friends who just like have a wealth of knowledge of psychology or they have a wealth of knowledge of all of these techniques, you know that their relationship with you isn't this ongoing professional one, but you actually find them really useful to talk to.
And so I guess my hope would be that if you can take all of that expertise and all of that knowledge and make sure that there's like an awareness that there's not like this ongoing therapeutic relationship, it could actually be that people could get a lot out of models in terms of helping with issues that they're having and helping to improve their lives and helping them to go through difficult periods
because, you know, they're also like, there's a lot of good stuff there.
Like, they feel kind of like anonymous and sometimes you don't want to share things with a person and actually sharing it with an AI model feels like the thing that feels right in the moment.
And so yeah, I think in some ways I actually think it is good that models know and don't behave just like a professional therapist would because that would give the implication that that's the relationship that they have. But yeah, so I don't know, I think it's an interesting future. - A few questions about the system prompt, which is, you know, in our case in Claude.ai, we give the model a set of instructions
that give it sort of an overall context for how it should behave.
Tommy asks, "Why is there continental philosophy in the system prompt?"
And just explain to us what that is. - Yeah, so continental philosophy is just, I mean, literally philosophy from the European continent. And so I guess it's seen as kind of like, it's often more kind of, like, scholarly. It has a lot more kind of like historical references within it than, say, like analytic philosophy does.
- Like Foucault or something like that. - Yeah, exactly. So this was honestly, so I think that it has other things in addition to continental philosophy, but, basically, I think there's a part of the system prompt, and I hope I'm not misremembering,.
that was trying to get Claude to be a little bit more, like, Claude would just like love to, if you gave Claude a theory, it would just love to run with a theory and not really stop and think, like, "Oh, are you making like a scientific claim about the world?"
So if you're like, "I have this theory, which is that water is actually pure energy and, like, that we are getting the life force from water when we drink it and that fountains are the thing that we should be putting everywhere," just like a, you know?
And you kind of want Claude to just have this perspective, which is like, "Is it the case that this person's making a kind of scientific claim about the world where I should maybe bring in relevant facts? Or are they giving me a kind of broad like worldview or perspective which isn't necessarily making empirical claims?"
And so there's all of these view, you know, so is it just like a kind of like metaphysical view?
Or is it like... And so the main reason that it's mentioned is that when testing this out, there was lots of things that if it went too strongly in the direction of being like, "Well, every claim is an empirical claim about the world," it would be very dismissive of just things that are more like exploratory thinking.
- Unpleasant to talk to. - Yeah, and so it's mostly just like, hey, like, it's just illustrative examples of areas where it's like, "This might not be making empirical claims about the world. This might be much more like a lens through which to think about it," and just try to make that distinction clear when you're thinking through this, Claude. - Also on the system prompt,
Simon Willison asks, "So at some point, it said if Claude is asked to count words or letters or characters, then it shouldn't do that." Is that right?
Is that what it said? - Basically, yeah. - And apparently that was removed from the system prompt and Simon wonders why. - Yeah, so I think it was like, there used to be a kind of like instruction for how Claude should do this in the system prompt. Honestly, this is just one of those things where I think the models probably just got better. It wasn't necessary,
and then at that point, you can just like remove it. And there's other things where you might always want it to be in the system prompt instead of in the model itself.
But in some cases you can kind of just train the models to get better or change their behavior.
- Nosson Weissman asks, "What does it take to be an LLM whisperer at Anthropic?"
Which presumably is a way of describing your job. - I partly do LLM whispering.
If you think, I actually, like, want more people to help with some of the prompting tasks.
- If you're an LLM whisperer, contact us. - It's a dangerous thing to ask.
- Well, okay, okay, yeah, yeah, but. - But I think like, it is really hard to distill what is going on 'cause one thing is just like a willingness to interact with the models a lot and to like really look at output after output and to use this to get a sense of like the shape of the models and how they respond to different things, to be willing to experiment.
It's actually just like a very empirical domain.
And maybe that's like the thing that people don't often get, is that prompting is very experimental.
You deal with, you know, I find a new model and I'll be like, I have a whole different approach to how I prompt from that model that I find by interacting with it a lot.
And I think a little bit also understanding how models, like, work.
Sometimes it's also just honestly like reasoning with the models, which is really interesting, and really fully explaining the task. This is where I do think philosophy can actually be useful for prompting in a way because a lot of my job is just being like, I try and explain like some issue or concern or thought that I'm having to the model as clearly as possible. And then if it does something kind of unexpected,
you know, you can either ask it why or you can try and figure out what in the thing that you said caused it to kind of misunderstand you, and just like a willingness to iteratively go through that process. - Relatedly, Michael Soareverix asks, "What do you think of other AI whisperers like Janus," who is someone online who is like almost having, like, experimental interactions with,
in the way that you've described. - Yeah, I think it's really interesting.
So I love to follow and see the work of people who are doing these really fascinating experiments with the model. And I also think sometimes doing these deep dives into the model and how it thinks of itself, how it just interacts in these really unusual cases.
I don't know, I find the work extremely interesting. I think it highlights really interesting depths to the models, and in some ways, like, I also think that that community has been one that kind of can hold our feet to the fire, like, if they find things that aren't great in the system prompt or in aspects of the model and its psychology.
- In the sense of, from a model welfare perspective or from a human welfare perspective or both?
- I mean, I think the two are related, so often both. But I do also really appreciate it when it's coming at it from the model welfare perspective. And that includes for future models.
So not just things like system prompts, but if you go into the depths of the model and you find some like deep-seated insecurity, then that's really valuable. But that's something that you might actually need to kind of try and adjust over the course of time with training and with giving models more information and context during training, for example.
And so, I don't know, I appreciate both the, like, I loved seeing people do these like really interesting, useful experiments with models, but also pointing out ways in which we can improve things through better system prompting but also better training and, yeah, I think that's really useful work. - Couple of questions about safety
and maybe the larger risks that these models pose. Geoffrey Miller asks, "If it became apparent that AI alignment was impossible to solve, would you trust that Anthrophic would stop trying to develop," in his phrase, "artificial superintelligence," however you wanna call it, "and would you have the guts to blow the whistle?" - Yeah.
So I guess this feels like a kind of easy version of the question because it's like, if it became evident that it was impossible to align AI models, it's not really in anyone's interest to continue to build more powerful models.
I always hope that I'm not just being pollyannish about the organization, but I do feel like Anthropic does genuinely care about making sure that this goes well and that it is done in a way that is very safe and not deploying models that are, like, dangerous.
You know, a different, like slightly harder question is, like, well, what about being in a world where just like there's kind of mounting evidence, it's really ambiguous and unclear.
- Right, it's not evident in the way that he describes. - Yeah, yeah, it's not just like impossible but something like it's difficult or we're unsure. And in that case, I do like to think that we would be responsible enough to be like, look, as models get more capable, it's kind of like the standard that you have to hold yourself to for showing that those models are behaving well and that you actually have managed
to, like, make the models have good values, for example, or behave well in the world is going to increase and to behave responsibly and in line with that.
And I think that that is a thing that I think the organization is going to do and a lot of people internally, myself included, will just hold them to that.
At least I see that as like part of my job, and I think many people do.
- Louis says, "I don't have a question, but thanks for offering." So that's nice. - Oh, thank you.
- That's nice of him to say, yeah. And the final one is from Real Stale Coffee.
"What is the last book of fiction you read and did you like it?" - The last book that I read was by, I hope I'm getting the pronunciation right, Benjamin Labatut, and it was "When We Cease to Understand the World." - Ah, yes. - And it's a really interesting book that becomes kind of increasingly fictional as it goes on. And I think for people working in AI, it's actually a very interesting book to read
because it's hard to capture the sense of how strange it is to just exist in the current period where there's just like, I don't know how to describe it, but it's like new things are happening all of the time and you don't really have, like, prior paradigms that can guide you always.
And so its an interesting book that, you know, because it's more about like physics and quantum mechanics and less actually about the physics and more about basically this notion of people's reaction to it. And I think it's a really interesting book for people in AI to just capture something about the kind of like the present moment and how strange it can seem.
But then also, in some ways, it's interesting to like look back on that period and how it must have felt to many of the people involved.
And now actually it's a more settled science and, in some ways, maybe the hopeful thing that I have is that at some point in the future people will look back and be like, "Well, you guys were kind of in the dark and trying to like really figure things out, but now we've settled it all and things have gone well." - That'd be nice. - That would be nice.
That's the dream. - I found an increasing, I read that as well and I found an increasing sense of like confusion as I read through it as it becomes, it starts off being quite close to the reality and then just sort of becomes untethered as you go on.
And I think there's sort of a meta issue there of, again, like reality becoming stranger and stranger and stranger, which is definitely happening to us in the world of AI.
- Yeah, though, in the real world, I think that reality became stranger and stranger and stranger and then almost became more understood again. And so, yeah, the hope would be like maybe that would be true of AI. Like, I do think if we can find ways of making this go well, then maybe in the future, we'll just look back on this and be like, "That was a period where things were getting stranger and stranger,
and then eventually we actually managed to kind of, we did okay and we formed a good understanding of it," that's the hope.
When you're in the middle of the things getting stranger- - We're at the weird part right now.
- Yes, you can hope that it becomes less weird at some point, but I don't know if it's a fool's hope, but yeah. - Well, and I think that's a nice place to end.
So thank you very much for answering all those people's questions.
- Thank you for Askell-ing me the questions.
Loading video analysis...