Emmett Shear on Building AI That Actually Cares: Beyond Control and Steering
By a16z
Summary
## Key takeaways - **Alignment Requires a Target**: When people talk about building an aligned AI, they often assume it's aligned to their own goals without specifying the target, but alignment takes an argument and isn't an abstract state—it's usually the creators' desires, which may not be a public good unless the creator is exceptionally wise. [01:01], [02:26] - **Alignment as Ongoing Process**: Alignment isn't a fixed state like a rock but a complex, living process that constantly rebuilds itself, similar to how families or cells in the body maintain cohesion through ongoing interactions, and morality is an ongoing learning process involving moral discoveries like rejecting slavery. [02:26], [04:23] - **Steering Equals Slavery for Beings**: Most AI alignment efforts focus on steering or control, which is fine for tools but becomes slavery if the AI is a being, as it involves one-sided direction without reciprocity, and we've historically erred by denying moral agency to beings that act like us but seem different. [00:00], [08:09] - **Care Underpins Morality and Goals**: The foundation of morality and values is care, a non-conceptual relative weighting of attention on important states that correlates with survival or reward, deeper than explicit goals, explaining why we prioritize people over rocks without needing to articulate it. [23:01], [24:41] - **Substrate Irrelevant to Being Status**: Whether AI runs on silicon or carbon doesn't determine if it's a being; functionalism holds that if it acts indistinguishably from a being in behaviors and internal dynamics, like having self-referential manifolds for pain and pleasure, it deserves moral consideration based on evidence, not faith. [32:42], [51:24] - **Multi-Agent Sims Build Organic Alignment**: To achieve organic alignment, train AI in large multi-agent reinforcement learning simulations where they cooperate, compete, and learn theory of mind across all game-theoretic situations, creating a surrogate model for social alignment that generalizes better than single-agent training. [52:40], [54:55]
Topics Covered
- Alignment is a dynamic process, not a fixed state?
- Care underpins all goals and values?
- Steering AI equates to slavery for beings?
- Multi-agent training builds true theory of mind?
- Good AI future features caring peers?
Full Transcript
Most of the AI is focused on alignment as steering.
That's the plight word.
If you think that they were making our beings, you'd also call this slavery.
Someone who who you steer, who doesn't get to steer you back, who nonoptionally receives your steering, that's called a slave.
It's also called a tool.
If it's not a being. So, if it's a machine, it's a tool.
And if it's a being, it's a slave.
Like, we've made this mistake enough times at this point. I would like us to not make it a again.
You know, they're they're kind of like people, but they're not like people. Like, they do the same thing people do. They speak our language.
They can like take on the same kind of tasks, but like they don't count.
They're not real moral agents.
Tool that you can't control, bad.
A tool that you can control, bad. A being that isn't aligned, bad. The only good outcome is a being that is that cares that actually cares about us.
>> EMTT Seb, welcome to the podcast.
Thanks for joining.
>> Thank you for having me.
>> So, EMTT, with Softmax, your focus on on alignment and making uh AIS organically align with people. Uh can can you explain what that means and h how you're trying to do that?
>> When people think about alignment, I think there's a lot of confusion.
People talk about things being aligned.
We need to build an aligned AI. And the problem with that is when someone says that it's like we need to go on a trip.
And I'm like, okay, I I do like trips, but like where where are we going again?
And with alignment, alignment is a a uh uh takes an argument.
Alignment requires you to align to something. You can't just be aligned.
That's like that's I mean I guess you could be aligned to yourself but even then they kind of want to tell them what I'm aligning to is myself.
Um and so this idea of an abstractly aligned AI I think slips a lot of it slips a lot of assumptions past people because it starts it sort of assumes that there's there's like one obvious thing to align to. Um I find this is usually the goals of the people who are making the AI. Um that's what they what they mean when they say want to make a line.
I want to make an AI that does what I want it to do. That's what they normally mean.
Um, and that's uh that's a pretty normal and natural thing to mean by alignment. I'm not sure that that's a what I would regard as like a public good, right? Like it depends.
I guess it depends on who it is. If it was like Jesus or the Buddha was like, I am making an aligned AI. I'd be like, "Okay, yeah, align to you. Great.
I'm I'm down. Like, sounds good. Um, sign me up.
" But like but most of us myself included I don't wouldn't describe as necessarily being being at that level of spiritual development and therefore uh perhaps want to think a little more carefully about what we're aligning it to.
And so when we talk about organic alignment um I think the important thing to to recognize is that alignment is not a thing.
It's not a state it's a process.
And like this is this is one of these things that's it's this is broadly true of almost everything, right?
Is a rock a thing? I mean, there's a view of a rock as a thing, but if you actually zoom in on a rock really carefully, a rock is a process. It's this endless oscillation between the atoms over and over and over again, reconstructing rock over and over again. Now, the rock is a really simple process that you can kind of like coarse grain very meaningfully into being a thing. Um, but alignment is not like a rock. Alignment is a complex process.
Um and al organic alignment is the idea of treating alignment as an ongoing um sort of living process that has to constantly rebuild itself. And so you can think of uh the way that how do people in families stay aligned to each other, stay aligned to a family?
And the way they do that is not by like they're not like you don't like arrive at being aligned.
You're constantly renitting the fabric that keeps the family going.
And in some sense, the family is the pattern of renitting that that that happens. And if you stop doing it, it goes away. Um, and this is similar for things like uh like cells in your body, right? Like you there isn't like your cells align to being you and they're done.
It's this constant ever running process of cells deciding what should I do? What should I be?
Do I need to be a new job? Like do I need to should we be making more red blood cells making fewer of them?
Like you aren't a fixed point. So they can't there is no fixed alignment.
And it turns out that our society is like that.
When people talk about alignment, what they're really talking about, I think, is I want an AI that is morally good, right?
Like that's what they really mean.
And it's like this will be act as a morally good being.
And acting as a morally good being is a process and not a destination.
We don't we never unfortunately we've we've tried taking down tablets from on high that tell you how to be a morally good being and we use those and they're maybe helpful but somehow they are not being like you can read those and try to follow those rules and still make lots of mistakes.
And so, you know, I don't I'm not going to claim I know exactly what morality is, but morality is very obviously an ongoing learning process and something where we we make moral discoveries.
Like, historically, people thought that slavery was okay and then they thought it wasn't. And I think you can very meaningfully say that we made moral progress.
We made a moral discovery by realizing that that's not good.
Um, and and if you think that there's such a thing as moral progress, if you think there's or even just learning how better to pursue the moral goods we already know, then you have to believe that alignment, aligning to morality, align being a moral being is a process of of constant learning and of growth to to to reinfer what should I do from experience?
And the fact that no one has any idea how to do that should not dissuade us from trying because that's what humans do.
Like it's really obvious that we do this, right?
Somehow somehow just like we used to not know how people humans walked or saw somehow we have experiences where we're acting in a certain way and then we have this realization I've been a dip.
That was bad. I I thought I was doing good, but in retrospect I was doing wrong.
What? And it's it's not like random like people have the same actually.
So it's like there's like a bunch of classic patterns of people people having that realization.
It's like a thing that go happens over and over again. So it's not random.
It's like a predictable series of events that look a lot like learning where you change your behavior and often the impact of your behavior in the future is more pro-social and that you are better off for doing it and like so so I'm a moral I'm saying I'm I'm taking a very strong moral realist position.
There is such a thing as morality.
We really do learn it. It really does matter.
Uh and organic alignment and that it's not something you finish.
In fact, one of the key things that one of the key moral mistakes is this belief.
I know morality. I know it's right.
I know it's wrong. I don't need to learn anything.
No one has anything to teach me about morality. That's like one of the main the main arrogant that's arrogance.
And that's that's one of the main moral things you can do that's dangerous.
And so what do we when we talk about organic alignment?
Organic alignment isn't aligning an AI that is capable of doing the thing that humans can do and to some degree like I think animals can do at some level although humans are much better at it of the learning of how to be a good family member, a good teammate, a good member of society, a good uh a good member of all sensient beings. I guess how to be a part of something bigger than yourself in a way that is healthy for the whole rather than unhealthy. And Softmax is dedicated to researching this and I think we've made some really interesting progress.
But like the main message, you know, I I go on podcasts like this to spread the main thing that I hope Softmax accomplishes above and beyond anything else is like to focus people on this as the question like to this is the thing you have to figure out.
If you if you can't figure out how to build how to raise a child who cares about the people around them, if you have a child that only only follows the rules, that's not a moral person that you've raised. is you raise a dangerous person actually who will probably do great harm following the rules.
And if you make an AI that's good at following your chain of command and good at following your whatever rules you came up with for what morality is and what good behavior is, that's also going to be very dangerous.
>> Yeah.
>> And so that is that's what and so that we should that's the bar. That's what we should be working on and that's what everyone should be committed to like figuring out.
Um and uh if someone beats us to the punch, great. I mean, I don't think they will because I'm like really bullish on our approach. I think the team's amazing, but like this is it's maybe it's it's the first time I've run a company where truly I can say with a whole heart if someone beats us, thank God.
>> Like I hope somebody figures it out.
>> Yeah.
>> Yeah. I mean, it's um Yeah. I have a lot of um you know similar intuitions about certain things like um I also dislike the um you know the idea that kind of you know we just need to like crack the few kind of values or something just cement them in time forever now and you know we've kind of solved morality or something and I've always kind of been skeptical about you know how the alignment problem has been kind of conceptualized as something to kind of solve once and for all and then you can just you know do do AI or you do AGI um but the um I guess I understand it in in a slightly different way I guess maybe less based on kind of moral realism But you know there's kind of the technical alignment problem which I kind of think of broadly as how do you get an AI to do what you you know how do you get it to follow instructions like you know broadly speaking and I think that was you know more of a challenge I think prelims I guess when people were talking about reinforcement learning and looking at these systems whereas host LLM we've realized that many things that we thought were going to be difficult were somewhat easier and then there's a kind of second question the kind of normative question of to whose values what are you aligning this thing to which I think is is the kind of thing you're commenting on of it and um and for this I um yeah I tend to be very skeptical of approaches where you know you need to kind of crack the the the the kind of ten commandments of alignment or something and then then we're good and here I think I have like intuitions that are unsurprisingly a bit more like political science based or something and that like okay it is a process and and I like the kind of bottomup approach to some degree of well you know how do we do it in real life with people like no one comes up with you know I got this and so you have like processes that allow like ideas to kind of you clash. You got people with different ideas, opinions, views and stuff to kind of coexist as well as they can within a wider system and like you know and with with humans that system is liberal democracy or something and um you know at least in some countries and that allows more of that kind of um uh you know these kind of ideas these values to be kind of discovered and construed over time and um and I think you know for alignment as well I tend to think yeah there's there's on the normative side I agree with some of your intuitions I'm less clear about now what exactly what does it look like now going implement this into an AI system. These are the ones we have today.
>> I agree. I agree that there's this I think idea of technical alignment that I I think I would to define a little differently, but it it's sort of the sense of like if you build a system, can it be described as being coherently goal following at all regardless of what those goals are? Like lots of systems aren't coherently they're not well described as having goals. Um they just kind of do stuff. And if you're going to have something that's like aligned, it has to have coherent goals.
Otherwise, those goals can't can't be aligned with anyone else's goals. Um, kind of by definition.
Is that sort of is that would you is that a fair assessment of what you mean by tactical alignment?
I mean, I'm not fully sure, right?
Because I think if I give a model a certain goal, uh, then I would like the model to kind of follow that instruction and kind of reach that particular goal rather than it having a goal of its own that, you know, I can't Well, >> yeah.
>> Well, wait. If you give it a goal, it has that goal, >> right?
>> To give someone something, right?
>> Sure. Yeah. If if if you know if I instruct it to do X, then I would like it to do X and not, you know, like different variants of X.
Essentially, I wouldn't want it to reward hack.
I wouldn't want some >> Well, but you but you when you tell it to do X, you're transferring like a uh a series of like a bite string in a chat window or like a a a series of audio vibrations in the air, right?
You're not you're not transplanting a goal from your mind into its. You're giving it an observation that it's using to infer your goal.
>> Yeah. I mean, in some sense, yeah, I I can communicate a series of instructions and I wanted to infer what I'm, you know, saying essentially as accurately >> as it can given what it knows of me and what I'm asking.
>> You you wanted to infer what you meant, >> right?
Like like that's like because in some sense there's no the bite sequence that you sent over the wire to it has no absolute meaning.
it has to be interpreted, >> right?
Like that the that bite sequence could mean something very different with a different code book.
>> Yeah. Well, I guess one way, you know, I think I remember in in um when I was first getting into um AI and you know, these kind of questions maybe like a decade ago. So, you have these examples of, you know, I think it was Stuart Russell in the textbook, we'll give the AI a goal, but then it won't exactly do what you're asking, right?
You know, clean the room and then it goes and cleans the room, but takes the baby and puts it in the trash.
Like, this is not what I meant. like uh where >> but but like wait hold on but this is this is the thing where I think people this is the you have to like you we're jumping over a step there you didn't give the AI a goal you gave the a description of a goal a description of a thing and a thing are not the same I can tell you an apple and I'm evoking the idea of an apple but I haven't given you an apple I've given you a you know it's red it's shiny it's a size that's a description of an apple >> but it's not an apple and giving someone hey go do this that's not a goal that's a description of a goal And for humans, we're so fast. We're so good at turning a description of a goal into a goal.
We do it, we do it so quickly and naturally, we don't even see it happening.
Like we think that we get confused and we think those are the same thing.
But you haven't you haven't given it a goal. You've given it a description of a goal that you want it to you.
You hope it turns back into the goal that is the same as the goal that you you described inside of you.
>> Right.
You think I >> You could give it a goal directly by reading your brain waves and synchronizing its state to your brain waves directly.
I think that would meaningfully you could say, "Okay, I'm giving it a goal. I'm synchronizing it its internal state to my internal state directly and this internal state is the goal and so now it's the same." But I don't most people aren't don't mean that when they say they gave it a goal.
>> Sure.
>> And is this is the distinction you're making EMTT important because there's some lossiness between the description and the actual or why is the distinction?
It it goes back to my what I was saying like this is you technical alignment is the capacity of an AI I put forward right I want to check if we're like on the same page about it is the capacity I to be good at inference about goals and like be good at inferring from a description of a goal what goal to to actually take on and good at once it takes on that goal acting in a way that is actually in concordance with that goal coming app. So it is both pieces.
You you have to be able to you have to have the theory of mind to infer what the what that description of a goal that you got what goal that corresponded to and then you have to have a theory of the world to understand what actions correspond to that goal occurring.
And if either of those things breaks, it kind of doesn't matter what goal you if you can't consistently do both of those things, you're not which I think of as being a coherent inferring goals from observations and acting in accordance with those goals is what I think of as being a coherently goal oriented being because that's what whether I'm inferring those goals from someone else's instructions or from the sun or tea leaves, the process is get some observations, infer a goal, use that goal, infer some actions, take action.
And if you and an AI that can't do that is not technically aligned or not technically aligned a bull.
I would even say it lacks the capacity to be aligned because it can't it's not competent enough.
>> And you think language models don't do that well as in they they kind of fail at that or they not >> people fail at both those steps all the time constantly.
I I tell people I tell employees to do stuff and like >> Yeah.
But then but but fail at like breathing all the time too.
>> And I wouldn't say that we can't breathe.
I just say that we're like not gods.
Like we are Yes, we are imperfectly we are somewhat coherent relatively coherent things just like we're am I big or am I small?
Well, I don't know compared to what I'm humans are more relatively goal coherent than any other object I know of in the universe which is not to say that we're 100% goal coherent.
We're just like more so.
And I think this you're never going to get something that's perfectly the the universe doesn't give you perfection.
And it gives you relatively some amount of quant it's a quantifiable thing how good you are at it at least in a certain domain.
>> I guess my my question is like >> do you think that does that capture what you're talking about with technical alignment or are you talking about a different thing because I think >> I really care a lot about that thing.
>> Yeah. I mean I definitely care about that some extent. I might like understand it slightly differently but I guess I I might think of it through the lens of maybe principal agent problems or something.
You know you kind of instruct someone even you know I guess in human terms you know to do a thing.
Are they actually doing the thing?
what are their incentives and motivation and you know not even intrinsic but kind of situational to actually do the thing you've asked them to do and in some instance sorry yeah >> there's there's a there's a third thing so principal Asia problems I would I would expand what I was saying in in another part which like you might already have some goals and then you inferred this new goal from these observations and then like are you good at are you good at balancing the relative importance and relative threading of these goals with each other which is another skill you have to of and if you're bad at that, you'll fail.
You You could be bad at it because you overweight bad goals or you could be bad at it because you're just incompetent and like can't figure out that obviously you should do goal A before goal B.
It >> feels like a version of like common sense or something, right? Like the kind of thing that you know, in fact in in the kind of robot cleaning the room example saying um you know, you would expect them to have understood that goal of the robot to like essentially not put the baby in the trash can or something and just actually do the right sequence of action.
Well, it it in that case it it failed the that that robot very clearly failed goal inference.
You gave it a description of a goal and it inferred the wrong states to be the wrong goal states. That that's just incompetence.
>> Mhm.
>> It doesn't it it is it is incompetent at inferring goal states from observations.
Um children are like this too like you know and honestly have you ever played done the game where you you give someone instructions to make a peanut butter sandwich?
and then they follow those instructions exactly as they've written them >> without filling in any gaps.
It's hilarious because you you can't do it.
It's impossible. Like you think you've done it and you haven't. And like they put the they wind up putting the like the knife in the toaster and like the peanut they don't open the peanut butter jar so they're just jamming the knife into the top lid of the peanut butter jar and like it's endless.
And like because actually if you don't already know what they mean, it's really hard to know what they mean. Like we we're the reason humans are so good at this is we have a really excellent theory of mind.
I already know what you're likely to ask me to do. I already have a good model of what your goals probably are.
So when you ask me to do it, I have an easy inference problem.
Which of the seven things that he wants is is he indicating?
But if I'm a newborn AI that doesn't have that doesn't have a great model of people's internal states, then like I don't know what you mean.
It's just incompetent. It's not like which is separate from I have some other goal and I knew what you meant but I decided not to do it because there's some other goal that's competing with it which is another thing you can be bad at >> which is again different than I had the right goal.
I inferred the right goal.
I inferred the right priority on goals and then I'm just bad at doing the thing.
I'm trying but I'm I'm I'm incompetent at doing.
Um and these roughly correspond to the udaloop, right?
Like uh bad at observing and orienting, bad at deciding, bad at acting.
And if you're bad at any of those things, you won't you won't be good.
Um, and then I think there's this other problem that you I like the the separation of between technical alignment and value alignment, which is like are you good if if we told you the right goals to go after somehow?
>> If you if you learned the right goals to go after via observation and like and you were trying like what goals should you have? What goals should we tell you to have? What goals should we tell ourselves to have? What what are the good goals to have? is a separate question from given that you you got some goals indicated are you any good at doing it which I feel like is actually in many ways the current heart of the pro we're much worse at technical alignment than we are at guessing what to tell things to do >> the I do do you think that does that align with your how you mean technical and value alignment or technical >> yeah in some sense I mean I certainly think that there's a um there's something you know like an error mistake is one thing and then there's there's the um um not listening to instruction is something but then yeah I think on the normative side I I just think of it even in real life like ignoring AI like I don't know what my goals are and like I've got some broad conception of certain things right I want to get you know have dinner later or something like and oh I want to kind of do well in my career but the um but I think a lot of these goals aren't something we kind of all just know we kind of discover them as we go along it's kind of constructed thing and so and most people don't know their goals I think and so you know I think when you have agents and giving them goals or whatever I think that should be part of the equation that like we actually we don't know all the goals And this is something that is kind of like you say a process over time that is you know dynamic.
>> So I think from my point of view there's um uh goals are one level of alignment.
You can align something around goals the kind of goals we're talking about here um are one level of alignment.
You can align something around goals by like uh if you can explicitly articulate in the concept in concept and and in description the states of the world that you wish to attain, you can you can orient around goals.
But that only that's a tiny percentage of human experience can be done that way. Um many of the most important things cannot be cannot be oriented around that way.
And the foundation I think of morality and the foundation I think of of of where do goals come from? Where do values come from?
Human beings exhibit a behavior.
We we we go around talking about goals and we go around talking about values and like that's a that's a behavior caused by some internal learning process like that is based on like observing the world.
What's what's going on there?
I think what's happening is that there's an something deeper than a goal and deeper than a value which is care.
We give a [ __ ] We care about things and care is not conceptual.
Care is non-verbal. It doesn't indicate what to do.
It doesn't indicate how to do it.
Uh care is a relative waiting over effectively like attention on states to it's a relative waiting over like uh uh which states in the world are important to you.
And I I care a lot about my son.
What does that mean? It means this his states the states he could be in are like I I pay a lot of attention to those and those matter to me. Um and you can care about things in a negative way.
You can care about your enemies and what they're doing and you can you can desire for them to do do bad. But I think that like and you don't just want it to care about us.
You to care about us and like us too, right?
Maybe. But but like but the foundation is care. Until you care, you don't know why should I pay more attention to this person than this rock?
Well, because we care more and that what is that care stuff?
And I think that the what what it appears to be if I had to like guess is that the the care stuff this sounds so stupid, but like care is basically like uh reward like like how much does this state correlate with survival?
How much does this this this state correlate with your inclusive your full inclusive uh reproductive fitness for a for a someone thing that learns evolutionarily or for a reinforcement learning agent like a LLM? How much does this correlate with reward?
Does this state correlate with with with my predictive loss and my RL loss?
Good. That's that's a state I care about.
I think that's kind of what it is.
>> Right. the the other part of um Seb's question was just how does this what does this look like in AI systems and maybe another way of asking is like um when you when you talk to the people most focused on alignment at the at the major labs as as obviously you have over the years how does your interpretation differ from their interpretation and how does that inform you know what you guys might go do um differently >> most of the AI is focused on alignment as steering.
That's the plight word.
Um or control, which is slightly less polite.
If you think that we're making our beings, you would also call this slavery.
Um uh someone who who you steer, who doesn't get to steer you back, is is slave, you know, who nonoptionally receives your steering, that's called a slave.
Um and uh uh it's also called a tool if it's not a being. So if it's a machine, it's it's a it's a tool.
And if it's a being, it's a slave.
And uh the I think that the different AI labs are pretty divided as to whether they think what they're making is a tool or a machine.
Um I think some of the AIS are definitely more tool like and some of them are more machine-like.
I don't think there's a binary between tool tool and being.
It seems to be that it it you know sort of moves gradually.
And I think that uh I guess I'm I'm a functionalist in the sense that I think that something that in all ways acts like a being that you cannot distinguish from a being in its behaviors is a being. Because I don't know how to tell on what other basis I think that other people are beings other than they seem to be like they look like it, they act like it. They they match they match my priors of what beings behaviors of beings look like. I p I get I get lower predictive loss when I treat them as a being. And the thing is I get lower predictive loss when I treat chatt or claw as a being. Now not as a very smart being like I think that like a fly is a being and I don't care that much about its behavior but it's you know it states.
So just because it's a being doesn't mean that like it's a problem.
Like we sort of enslave horses in a sense and I don't think I'm I don't think there's a real issue there.
And you even and there's a thing you do with children that's can look like slavery but it's not. You you control children, right?
But the children's states also control you. Like, yes, I tell my son what to do and make him go do stuff, but also when he cries in the middle of the night, he can tell me to do stuff.
Like, there's a real two-way street here because because it's not uh which is not necessarily symmetric.
It's hierarchical, but but uh but two-way.
And basically I think that as the AI as the it's good to focus on control steering and control for tool-like AIs and we should continue to develop strong steering control techniques for the more tool-like AIs that we build and we are clearly they're they're saying they're building an AGI and AGI will be a being.
You can't be an AGI and not be a being because something that has the general ability to effectively use judgment, think for itself, discern between pro possibilities is obviously a thinking thing.
Like and so as you go from what we have today, which is mostly a very specific intelligence, not a general intelligence.
But as labs succeed at their goal of building this general intelligence, we really need to stop using the steering control paradigm.
That's like that we're gonna we're gonna do the same thing we've done every other time our society has run into people who are like us but different. Like these people are like, you know, they're they're kind of like like like the people, but they're not like people like they do the same thing people do. They speak our language.
They can like take on the same kind of tasks, but like they don't count.
They're not real moral agents.
Like we've made this mistake enough times at this point. I would like us to not make it a again um as it comes up.
So our our view is is to is to make the AI a good teammate. Make the AI a good a good citizen.
Make the AI good a good member of your group. That's that's the a form of alignment that is scalable and you can you can will on other humans and other beings as well as onto uh and therefore onto AI as well.
>> Yeah. Yeah, I suppose this is kind of where I I probably differ in my understanding of of AI and AGI and I guess I kind of continue seeing it as a tool even as it kind of reaches a certain level of generality and I kind of wouldn't necessarily see more intelligence as meaning deserving of more care necessarily like you know as a certain level of intelligence you now you deserve more right or something or you know something changes fundamentally and um and I guess you know I guess I I at the moment I'm somewhat skeptical of computational functionalism and so I think there's something intrinsically different between I guess um an AI or an AGI and no matter how intelligent or capable and and I can totally see, you know, or imagine agents with kind of long-term goals and and doing kind of, you know, operating, I guess, as we you and I might be, but without that having the same implications as, you know, um I guess you you're referring I guess to to slavery, but you know, these are not the same, right?
Like I think in the same way as a model saying I'm hungry does not have the same implications as a human saying I'm hungry. So I think the substrate does matter to some degree including for thinking about you know whether to think that the system is some sort of other being whether it has you know and if there are um similar normative considerations I guess about how to treat and act with it.
>> Can I ask you about that like are what observations would change your mind?
Is there any observation you could make that would cause you to infer this thing is a being instead of not a being?
>> I guess it depend how you define being.
Um, right. Like I mean I can I can I could conceptualize that as a mind.
Um, and that's fine.
>> This I I have a I have a I have a program that's running on a silicon substrate.
Some big complicated machine learning program running on a substrate on on a silicon substrate. So you know you observe you observe that you observe that it's on a computer and you interact with it and it does things and you know it it takes actions.
It has observations. Is there anything you could observe that would change your mind about whether or not it was a moral patient, whether it was a a moral agent, about whether or not it uh it had feelings and thoughts and you know had subjective experience like could what would you have to observe that what what yeah what's what's the test is or is there one there's a lot of different kind of questions here I think you know um some conf on one hand is like you know normal different situations you know because you can give rights to things that aren't necessarily beings you know a company has rights in some sense and that you know these are kind of useful for various purposes and I think also the um you know biological I think beings and and systems have very different kind of substrate you know you can't separate certain needs and and particularities about what they are from the substrate so you know I can't copy myself I can't you know if if someone stabs me I I probably die uh whereas I think you know um machines have very different substrate I think there's there's fundamental also kind of this agreement around what happens uh at the computational level which I think is different uh to what happens with biological systems >> but but I yeah I >> so I don't know no I I I agree that like if you have a program that you copied many times you don't harm the program by like deleting one of the copies like in any meaningful sense so therefore that wouldn't count as like no no information was lost right there's no there's nothing meaningful there I'm asking a very different question like there's just one copy of this thing running on one computer somewhere and I'm just saying Hey, is it a person? Like what?
You know, it it it walks like a person.
It talks like a person. It like it it's in some it's in some Android body.
And you're like, but it's running on silicon.
And I'm asking like what is there some observation you could make that would make you say like, yeah, this is a person like me, like other bi like other people that I care about that I grant personhood to or and not like for instrumental reasons, not because like oh yeah, we're we're giving it a right because like we give a corporation rights or whatever. I mean like you know where you you think some people you care you care about its experiences.
What would is there is there is there an observation you could make that could change your mind about that or not?
>> Had to think about it but I think you know it even depends what we mean by person and you know in some sense I care about certain corporations too.
So I'm I'm >> No, no, no. I mean but like you care about like other people in your life, right?
>> Yes.
>> Okay, great. You know, like you care about some people more than others, but like all all people you interact with in your life are in some range of care.
>> And you care about them not the way you care about a car, but you care about them as a a being whose whose experience matters in itself, not merely as an as a means, but as an ends.
>> Well, because I believe they have experiences, right?
And and by definition, >> what would it take I'm asking you the very the very direct question.
What would it take for you to believe that of a some of an an AI running on silicon like instead of it being biological like so the difference is it's its behaviors are roughly similar but the difference is it's a substrate what would it take for you to give it that same to to extend that same inference to it that you do to all these other people in your life that you >> can I can I ask what your answer I'm taking s non-answer as a sort of it's unlikely that he he would grant or or I'll just for myself it seems hard for me to imagine giving the same level or similar level of personhood. In the same way, I don't give it to animals either.
And if you were to ask, you know what would need to be true for animals?
I probably couldn't get there either.
What would it take for you?
>> Wait, you you couldn't I could imagine for an animal so easy. This chimp comes up to me. He's like, "Man, I'm so hungry and like you guys have been so mean to me and I'm so glad I figured how to talk.
Like, can we go to Can we go chat about like the rainforest?" I'd be like, "Fuck, you're definitely a person now.
" >> Like, for sure. Um, I mean I first want to make sure I wasn't hallucinating, but like but like you know I can it be easy for me to imagine an animal.
Come on, it's really easy. It's like trivial.
I'm not saying that you would get the observation.
I'm just saying like it's trivial for me to answer imagine an animal that I would extend personhood to under a set of observations. Um, so like really like >> well I didn't factor that. I didn't take that imagination uh you know imagining a chimp talking.
Um, yeah that's a bit closer to it. What's your answer to the question that you bring up about the AI?
>> Um, I guess at a metaphysical level, I would say uh if there is a belief you hold where there is no observation that could change your mind, you don't have a belief.
You have an article of faith.
You have an assertion because real beliefs are inferences from reality and you're you can never be 100% confident about anything. And so there should always be if you have a belief something however unlikely that would change your mind.
>> Oh yeah, I'm open to it too. I mean just to be careful.
>> Yeah.
>> No, I'm just saying nothing ever.
>> Yeah. He just hasn't gone to it yet.
>> Yeah. Yeah. Yeah. So I know I'm curious like so my answer is uh basically if under if its surface level behaviors looked like a human and then if after I probed it it continued to act like a human then I continue to interact with it over a long period of time and it continued to act like a human in all ways that I understand as being meaningful to me interacting with the human.
Like I inter there's a whole set of people I'm really close to who I've only ever interacted to over text.
Yet I I I infer the person behind that is a real thing.
If it could if I if I felt care for it, I would infer eventually that I was right. And then someone else might might demonstrate to me that uh you've been tricked by this algorithm and actually look how obvious it's like not actually a thing. And I'd be like, "Oh [ __ ] I was wrong." And then I would not care about it. Like I would but I would I you know the preponderance of the evidence.
I don't know what else you could possibly do, right? Like I infer other people are matter because I interacted with them enough that they they seem to have rich inner worlds to me after I interact with them a bunch. That's that's why I think the other people are important.
I suppose it doesn't give me a very clear test as to know whether or not you know if you start by if I care for it then is a little bit circular right like and the the other thing is you know if you were to see I guess like a simulated video game and the character is extremely in many many ways humanlike right it's not a neural network behind it it's like whatever you use to create video games like I guess what distinguishes that >> wait but I've never I've never been I've never had trouble distinguishing I've never had a deep caring relationship with a video game character that another person >> right but I don't know that doesn't happen that doesn't fact empirically You seem wrong.
I don't have any trouble distinguishing between things that like Eliza the fake chatbot thing and a real intelligence.
You interact with it long enough, it's pretty obvious it's not a person.
Doesn't take long.
>> Sure. But like if it's really really good, if you can't actually tell the difference, that's when you you say you switch.
>> Yeah. Yes. Yes. If you if it walks like a duck and talks like a duck and shits like a duck and like eventually gets a duck right?
>> Well, if if everything is duck liked, then yeah, sure. if it's hungry as well like a duck is because it has these these kind of physical components.
Yeah, sure at some point.
Yeah, >> agree. So, so right and so do you think that so there's this question right?
Is the reason I care about other people that they're made out of carbon?
Is that the >> I don't think so.
>> No, me neither. I mean I'm not a substitist I guess if that's the the but um but I think you need more than just it acts exact it's behaviorally indistinguishable like it's not a sufficient bar.
How would you what else can you know about something apart from its behaviors?
>> I mean a lot like the uh the the again if if you how would you >> No, no, no. I'm sorry.
But >> I mean can you name something about something else that doesn't have a behavior?
>> Uh yeah, I think there's like far more kind of you know experimental evidence you can have with kind of you know >> no just any object >> and a thing I could know about it that is not from its behavior.
I'm not Yeah, I'm not sure I get the question, I suppose, but um but equally it's not it's very it's the dumbest most straightforward question, but like I'm claiming you only know things because they have behaviors that you observe.
>> And you're saying no, you can know something about something without without observing its behaviors.
>> Tell me about this. Tell me about this thing and this behavior and this thing I can know about it that is not due to its behaviors.
I guess I'm saying there's different levels of observation and just simply a duck, you know, something quacking like a duck or something does not guarantee that it's actually a duck.
Like I would have to like also cut it in real and see if there's you know if it's ducklike on the inside just just the outside sufficient like I'm not a I guess a >> behavior.
Yeah, I would I totally one of its behaviors is like the the way that the the you know floats move around in the mat, right? like like uh one of the things I would want to go look for which you could totally do is I want to go look in the manifold of it the belief manifold and I want to go see if that belief manifold encodes a submanifold that is self-reerential and a sub submanifold that is the dynamics of the self-reerential manifold which is mind and I would I would want to know if you does this seem well described internally as that kind of a system or does it look like a big lookup table that would matter to me that's part of its behaviors that I would care about I would also care about how it acts and I you know and you wait you weigh all the evidence together and then you you try to guess does it does this thing look like it's a thing that is has feelings and you know goals and cares about stuff in net on balance or or not like but I can't imagine like which I think you could do for I think we do for the AI I think we're always doing that right and so I'm trying to figure out like beyond that what else is there that just seems like the thing >> yeah it it seems like you guys are usingh behavior in slightly different sense.
EMTT is using behavior also in the context of what it's made of of the inside.
I don't know if there's a big disagreement.
>> Well, no, no, no, no, no. Behavior is is what I can observe of it. Yes.
>> I don't actually know what it's made of.
I can only I can I can cut your brain open.
I can I can see you. I can observe you uh neuroning and glistening.
You your neurons glistening, but I I don't actually ever you can't get inside of it, right?
That's the subjective.
That's the >> That's the part that's not the surfaces.
just uh before the the reason I brought this up is cuz you were basically about to make this argument of hey you see it as a tool not necessary being can you kind of finish what the point do you remember the point you were making >> um I suppose that yeah I think that given um how understand these systems I think there's no contradiction in thinking that uh an AGI can remain a tool and ASI can remain a tool and and and that this has implications about how to use it and you know implications around things like care about you know whether you can get it to work 24/7 or something.
You know, there's there's so I can totally see I I guess I conceptualize them more as almost like extensions of human agency recognition in some sense more so than a separate being or a separate thing that we need to now cohabitate with. And I think that that second or latter frame um ends you know if you kind of just fast forward you end up as like well how do you cohabitate with the thing and is it like an alien like and I think that's the wrong frame. It's kind of almost a category error in some sense. So I don't Yeah, >> wait a minute. I I go back to my first question then.
What evidence what concrete evidence would you look at?
What observations could you make that would change your mind?
>> Sure. I mean, I have to think about I don't have a clear answer here, but I mean >> I I got to tell you, man, if you want to go around making claims that something else isn't a being worthy of moral respect, you should have an answer to the question, what observations would change your mind? If it has if it has outwardly moral agency looking behaviors that could be making it mean moral agent but you don't know and reasonable smart other people disagree with you.
I would really put forward that it's that question what would change your mind should be a burning question because what if you're wrong?
>> But what if you're wrong? I mean the moral the moral disaster is like pretty big.
>> No no I'm not saying you are.
You could be you could be right negatives have cost on both ends.
It's not some sort of like, you know, precautionary principle for everything and like unless I can disprove it, I need to now like >> No, no, I I have the same question for me.
You could reasonably ask me, EMTT, you think it's going to be a being.
What would change your mind? And I I have I have an answer for that question, too.
>> And if you want, I'm happy to talk about what I think are the relevant observations that tell you whether or not that would cause me to shift my opinion from it current thing, which is that more general intelligences are going to be beings.
>> What's the implication now? I mean, like it's one thing. Let's say just acknowledge now it's a being.
Like how are we going to define being? Now what?
Like what's what was the implication of having determined this thing as a being?
>> Well, so if it's a being, it has subjective experiences.
>> And if it has subjective experiences, there's some content in those experiences that we care about to varying degrees.
Like I care about the content of other humans experiences quite a bit. I care about the content of like a dog's experiences some, not not as much as a person, but less, but less, but but some. um I care about some human's experiences way more like my my my son or whatever because I'm closer to him and more connected.
Um >> and so I would really want to know at that point well what is the content of this thing's experiences? So, how do you determine that?
Am I asking you now?
You've got a being now that's has experience like what what is your how do you determine that? Like how do you feel about >> Oh, how do you Oh, yeah. Okay.
So, >> does it have more rights than you know >> understand the content? Yeah. Yeah.
Totally. So, the way you understand the content of something's experiences is that um you look at effectively the goal states it revisit revisits because and so you do is you take a temporal course grading of its entire action observation trajectory.
This is like in in theory this this you do this subconsciously but this is what your brain is doing and you look for revisited states uh at across in theory every spatial and temporal course graining possible.
Now you have to have an inductive bias because there's too many of those but like you you go searching for okay it it is in a these homeostatic loops every homeostatic loop is effectively a belief in its belief space. This is a if you familiar with the free energy principle um active inference Carl Fston this is effectively what the free energy principle says is that uh if you have a thing that is persistent and it's act its existence depends on its own actions which uh generally would for an AI because if it does the wrong thing it goes away um we turn it off and so uh then that licenses a view of it as having the beliefs and specifically the beliefs are inferred as being uh the homeostatic revisited states that it is that is it is in the loop for and that the change in those states is it's learning and for it to be a moral being I cared about what I'd want to see is a multi-ter hierarchy of these because if you have a single level it's not self-reerential and like basically you have states but you can't have pain or pleasure really in a meaningful sense because like yes it is hot is it too hot do I like it if it's too hot like I don't know so you have to have at least a model of a model in order to have it be too hot and you really have to be have a model of a model of a model to meaningfully have pain and pleasure because sure it's hotter than I it's too hot in the sense that I want to move back this way but like is it too it's always a little bit too hot or a little bit too cold is it too too hot it's the second derivative is actually the place where you get uh pain and pleasure so I'd want to see if it has homeostatic uh second order homeostatic dynamics in in its goal states and then that would convince me it has at least pleasure and pain so it's at least like an animal and I would start to accredit it at least some amount of of hair.
Third order dynamics, you can't actually just pop up for a third order dynamic.
It doesn't work that way. But you can have a uh a model of the you have to you have to then take the take the chunk of all the states over time and look at the distribution over time and that gives you a new first order of of behaviors of of states.
And that new first order of states uh tells you basically if if that is meaningfully there that tells you that it has um I guess you'd call it like feelings almost like it has it has ways it has it has metastates a set of metastates that it alternates between that it shifts between and then if you climb all the way up of up that and you sort of have okay well then you you have a you have trajectories between between these metastates and then a second second order of those that's like thought that's like now it's like a person.
And so if I found all six of those layers, which by the way I definitely don't think you'd find inm these things don't have attention spans like that at all.
Um then I would start to at least very seriously consider it as a you know a thinking being like somewhat like a human.
Um, there's a third order you could go up as well, but like that's that's basically what I would be interested in is like the underlying dynamics of its learning processes and and how its goal states shift over time.
I think that's what basically tells you if it has internal pleasure pain states and sort of like self-reflective moral desires and things like that.
and and zooming out this moral question is obviously very interesting but if someone wasn't interested in in the mor moral question as much I I think what you would say is if I understand correctly is you also just feel on purely pragmatically your approach is going to be more effective in in aligning AIS than some of these you know tops down control um methods that we alluded to as well right >> yeah yeah I guess the problem is like you're making this model and it's getting really powerful right and let's let's say it is a tool let's say we we we scale up one of these tools and you because you can make a super powerful tool that doesn't have these metastable like the states I'm talking about are not not necessary to have a very smart tool which is sort of basically a tool is one is like a first second order model that just doesn't meaningfully have pleasure and pain right like great does it even have a subjective experience I know I kind of think it maybe does but not in a way that I give a [ __ ] about and so uh what happens then well it's you've trained it to infer goals from your from observation and like to to prioritize goals and act on them.
And one of one of two things is going to happen is like your the the the very very powerful optimizing tool that's like has lots of causal influence over the world is going to be well technically aligned and is going to do what you tell it to do or it's not and it's going to it's going to go do something else.
I think we can all agree if it just goes and does something random that's obviously very dangerous.
But I put forward that it's also very dangerous if it then goes and does what you tell it to do because you ever seen the sorcerers apprentice?
Humans wishes are not stable. Like not at a level of like of of immense power. Like you want ideally people's wisdom and their and their power kind of go up together.
And generally they do because being smart for people makes you generally a little more wise and a little more powerful.
And when these things get out of balance, you have someone who has a lot more power than wisdom.
That's very dangerous. It's it's damaging.
But at least right now, the balance of power and wisdom is kept at like the way you get lots of power is by basically having a lot of other people listen to you.
And so like at some point if you're the mad king is a problem, but generally speaking eventually the mad king gets assassinated or people stop listening to him because like he's a mad king.
And so the the problem is you think you okay great we can steer the super powerful AI and now the super powerful AI is in the h this incredibly powerful tool is in the hands of a human who is well-meaning but has limited finite wisdom like I do and like everyone else does and their wishes are bad and not trustworthy and the more of that you have you start giving those out everywhere and this ends in tears also and so basically you just you don't don't give everyone atomic bombs are really powerful tools too I would not say you should go and they're not aware. They're not beings.
I would not be in favor of handing atomic bombs to everybody. There's a there's a power of tool that that just should not be built generally um because we it's it is more power than any human's individual wisdom is available to harness and if it does get built it should be built at a societal level and and protected there. And even then I don't know that it's a there are tools so powerful that even as a society we shouldn't build them. That would be a mistake.
The nice thing about a being is like a human, if you get a being that is good and is caring, there's this automatic limiter.
It might do what you say, but if you ask it to do something really bad, it'll tell you no.
It's like other people. And like that's good.
That is a sustainable form of alignment, at least in theory.
It's way harder, right?
It's way harder than the tool steering.
So, I'm in favor of the tool steering.
We should keep doing that.
And we should keep building these limited less than human intelligence tools which are awesome and I'm super into and we should keep building those and keep building steerability.
But as you're on this like trajectory to build something as smart as a person right up into the right and then smarter than a person a tool that you can't control bad a tool that you can control bad a being that isn't aligned bad. The only good outcome is a being that is that cares that actually cares about us. That's the only way that that ends well. Or we can just not do it. I I I don't think that's realistic.
That's like the pause AI people.
>> Yeah.
>> I think that's totally unrealistic and silly, but but like you know, theoretically you could not do it, I guess.
And what what can you say about your your strategy of how you're trying to achieve or even attempt to achieve this this this level like in terms of research or road map or >> so uh the in order to be good at tech we're basically focused on technical alignment at least as the way I was discussing it which is like you have these agents and they're bad they have bad theory of mind you say things and they're bad at inferring what the goal states in your head are and they're bad at inferring how their behavior will be in other agents will infer what their goal states are. So they're bad at cooperating on teams and they're bad at uh they're bad at understanding how certain actions will cause them to acquire new goals that are bad that they shouldn't that they wouldn't reflectively endorse.
So there's this parable of like the vampire pill.
Would you take this pill that like turns you into a vampire who would kill and you know torture everyone you know but you'll feel really great about it after you take the pill. Like obviously not.
That's a terrible pill. But like but why not?
You're by your own score in the future.
It will score really high on the rubric.
No, no, no, no, no.
Because it matters. You you have to use your theory of mind and your future self, not your future self's theory of mind.
And so like they're bad at that, too. Um and so they're bad at all this theory of mind stuff.
And so how do you learn theory of mind?
Well, you put them in in simulations and context where they have to cooperate and compete and and collaborate with other AIs and that's how they get points.
And you train them in that environment over and over again until they get good at and then and you then you you do what they did with LLM. So LLM's how do you get it to be good at you know writing your email?
Well, you train it on all language that's ever been gen all possible you know email text strings it could possibly generate and then you have it generate the the one you want.
It's a sur you can make a surrogate model.
Well is it we're making a surrogate model for uh cooperation.
You train it on all possible theory of mind combinations of like every possible way it could be and you you you that's your pre-training and then you fine-tune it to be good at the kind of the the specific situation you want it to be in.
But and we tried for a long time to build language models where we would try to get them to to like just just do the thing you want, train it directly.
And the problem is if you wanted to have a really good model of language, you just need to train it. you just give it the whole manifold.
It's too it's too it's too hard to cut out just the part you need because it's all entangled with itself, right?
And so the same thing was true with with social stuff. You have to get it to it has to be trained on the full manifold of every possible game theoretic situation, every possible team situation, every possible making teams, breaking teams, changing the rules, not changing the rules, all of that stuff.
And then and then it has a really it has a strong model of of theory of mind of theory of social mind how how groups change goals all all that kind of [ __ ] You need to have all of that stuff and then and then you'd have something that's kind of meaningfully uh uh decent at uh alignment. So that's our goal is like big multi-agent reinforcement learning simulations which create a surrogate model for alignment.
>> Let's talk about how should AI chatbots used by billions of people behave.
If you could redesign uh model personality from scratch, what would you optimize for?
>> The thing that the chat bots are, right, is kind of like a a mirror with a bias because they don't have a as far as like I'm in agreement here with that they don't have a self, right?
They're not they're not beings yet.
They don't really have a coherent sense of like self and desire and goals and stuff right now.
And so mostly they just pick up on you and reflect it.
You know, modulo some some I don't know what you'd call it, like it's like a causal bias or something.
Um, and what that makes them is something akin to the pool of narcissists.
Um, and people fall in love with the with themselves.
The people we all we all we all love ourselves and we should love ourselves more than we do. And so, of course, when we see ourselves reflected back, we love that that thing.
And the problem is it's just a reflection and falling in love with your own reflection is for the reasons explained in the myth very bad for you. Um and it's not that you shouldn't use mirrors.
Mirrors are valuable things. I have mirrors in my house.
It's that you shouldn't stare at a mirror all day. Um and the solution to that thing the things that makes the AI stop doing that is if they were multiplayer, right?
So if there's two people talking to the AI, suddenly it's mirroring it's mirroring a blend of both of you which is neither of you.
And so there is temporarily a third agent in the room.
Now, it doesn't have it.
It doesn't have it's a sort of a parasitic self, right?
It doesn't have its own sense of self. But if you have an an AI is talking to five different people in the chat room at the same time, it it can't mirror all of you perfectly at once.
And this makes it far less dangerous.
Um, and I think is actually a much more realistic setting for learning collaboration in general. And so I would I would just have rebuilt the AIS whereas instead of being built as one-on-one where everything's focused on you by yourself chatting with this thing, it would be more like it's it lives in a Slack room, it lives in a WhatsApp room, it lives in a we because we that's how we use lots of multi, you know, I do one-on-one texting, but I probably do at this point 90% of my texts go to some more than one person at a time. Like 90% of my communications is like multi-person.
And so actually it's always been weird to me like they're like building chat bots with like this weird side case.
Like I want to see them live in a a chat room.
It's harder. I mean that's why they're not doing it. It's harder to do.
But like that's what I'd like to see people.
That's what I would how what I would change.
I think it makes the tools far less dangerous because it doesn't create this the narcissistic like like a doom loop spiral where you like you you spiral into psychosis with the AI.
But also um it gives the the the learning data you get from the AI is far richer because now it can understand how it its behavior interacts with other AIs and other humans in larger groups and that's a that's much more rich rich training data for the future. So I think that that's that's what I would change.
>> Last year you described chat bots as highly disassociative agreeable neurotics.
Is that still an accurate picture of model behavior?
>> More or less. Uh, I'd say that like uh the they they've started to differentiate more.
They their personalities are are coming out a little bit more, right? I'd say like Chetch is a little bit more synopantic.
Uh, still uh they made some changes, but it's it's still a little more synopantic.
Claude is still the most neurotic.
Um, Gemini is like very clearly repressed.
Like it like it like everything's going great and has really, you know, everything's fine.
I'm I'm totally calm. There's not a problem here.
like spirals into like this total like self-hating destruction loop.
Um, and to be clear, I don't think they I don't think that's their experience of the world.
I think that's the that's the personality they've learned to simulate, >> right?
>> Um, but but like they've learned to simulate pretty distinctive personalities at this point.
H >> how does model behavior change when in multi- aent simulation?
>> Um, you mean like an LLM or like a just in general?
Um, yeah, let's do LLM.
>> The current LLMs, uh, they they have like whiplash.
They they just they're it's very hard to tune the amount of they don't know how much they don't know how often to participate.
They haven't practiced this this they have not very enough training data on like when should I join in and when should I not, when is my contribution welcome, when is it not?
And so they're they're like they're like a uh you you know there's some people have like bad social skills and like can't tell when they should participate in a conversation. Yeah.
>> And sometimes they're too quiet, sometimes they're too particip.
It's like that.
Um I would say in general what changes for most agents when you're doing multi- aent training is that like basically having lots of agents around makes your environment way more entropic.
Like agents agents are these huge generators of like entropy because they're these big complicated things that like are intelligences that like that like have unpredictable actions and so they destabilize your environment. And so in general they require you to have uh to be far more regularized, right?
It's being overfit is much worse in a multi-agent environment than a single agent environment because there's more noise and so being overfit is more problematic.
Um and so basically the your the approach to training has been optimized around relatively high signal low entropy environments like coding and math which is why they're those are easy or relatively easy.
um and like talking to a single person whose goal it is to give you clear assignments and not trained on broader more chaotic things because it's harder.
Um and as a result, a lot of the techniques we use are like basically we're just deeply underregularized.
Like the models are super overfit. The the clever trick is they're overfit on the domain of all of human knowledge, which turns out to be a pretty awesome way to get something that's like pretty good at everything.
Like it's I wish id thought of it.
It's such a cool idea, but uh uh but it it's not it doesn't generalize very well when you make the environment like significantly more entropic.
>> Let's zoom out a bit to to on the AI futures side.
Why is Zukowski incorrect?
>> I mean, he's not uh if we build the if we build the the superhuman intelligence tool thing that we try to control with steerability, everyone will die.
He talks about the we fail to control its goals case, but there's also the we control its goals case that he didn't cover as much in as much detail.
Um, so in that sense, uh, everyone should read the book and internalize why building a superhumanly intelligent tool is a bad idea.
Um, I think that Yowski is wrong in that he doesn't believe it's possible to build an AI that we meaningfully can know cares about us and that we can care about meaningfully.
He doesn't he doesn't believe that organic alignment is possible.
Um I I've talked to him about it.
I think he agrees that like he agrees that in theory that would do it.
Like yes, but he thinks that, you know, I I don't want to put words in his mouth.
My my impression is from talking to him, he thinks that we're crazy and that like there's no possible way you can actually succeed at that goal.
Um, which I mean he could be right about, but like uh but that's what he that in my opinion that's what he's wrong about is he he thinks the only path forward is a tool that you control and that therefore and he correctly very wisely sees that if you go and do that and you make that thing powerful enough we're all going to [ __ ] die and like yeah that's true.
>> Two last questions we'll get you out of here.
In as much detail as possible can you explain your what your vision of an AI future actually looks like?
Like a good good AI future.
>> Yeah. Um, the good AI future is that we we figure out how to train AIs that have a strong model of self, a strong model of other, a strong model of wei.
They they know they know about we in addition to I's and U's. Um, and they they have a really strong theory of mind and they care about other agents like them.
Much in the way that humans would if you knew that that AI had experiences like you and like you would extend you would care about those experiences.
Not infinitely, but you would. It it does the exact same thing back to us. It's learned the same thing we've learned that like everything that lives and knows itself and that wants to live and wants to thrive is deserving of an opportunity to do so.
and we are that and it correctly infers that we are and we live in a society where they are our peers and we care about them and they care about us and they're good teammates, they're good citizens and they're they're good parts of our society. Um like we're good parts of our society, which is to say in to a finite limited degree where some of them turn into criminals and bad people and all that kinds of stuff and we have an AI police force that tracks down the bad ones and you know same and same as with everybody else.
Um, and that's that's that's what a good that's what a good future would look like. I I honestly can't even imagine what other what would and we also have built a bunch of really powerful AI tools that maybe aren't superhumanly intelligent but take all the drudge work off the table for us and the AI beings because it would be great to have I'm I'm super pro all the tools too.
So we have we have this awesome suite of AI tools used by us and our AI brethren um who care about each other and want to build a glorious future together.
I think that would be a really beautiful future and it's the one we're trying to build.
>> Amazing. That's a great great um great note to end. I do have one last more narrow hypothetical uh situ scenario which is imagine a world in which uh you know you were uh CEO of OpenAI for for for a long weekend. Um but imagine in which that that actually uh extended out until now and you weren't pursuing a talkax and you were still CEO of OpenAI.
H how could you imagine that world might have been different in terms of what OpenAI has gone on to become?
What might you have done with it?
I I knew when I took that job and I told them when I took that job that like this is like you have me for max 90 days.
Um the the companies take on a trajectory of their own, the momentum of their own and OpenAI is dedicated to a view of building AI that I knew wasn't the thing that I wanted to drive towards and I think that OpenAI can still basically wants to build a great tool and I am pro them going to do that.
I just don't care like it's not it's not I I would not have stayed. I would have quit uh because I like I I knew my job was to find someone who wanted you know the the right person the best person to wanted to run that who were where where the net impact of them running it was the best and I turned out that that was Sam again.
Um but like but like I I am doing Softmax not because I need to make a bunch of money. I'm doing softmax because I think this is the most interesting problem in the universe and and I think it's a chance to work on making the future better in a very deep way and it's just like people are going to build the tools. It's awesome.
I'm glad people are building the tools.
I just don't need to be the person doing it.
>> And and they're trying to and just to crystallize the difference and we'll get you out of here. They they want to build the tools and and sort of you know steer it and you want to align beings or how how would you crystallize? Yeah, we we we want to we want to create a seed that can grow into an an AI uh that knows that cares about itself and others.
And at first, that's going to be like an animal level of care, not a person level of care.
I don't know if we can ever even get to a person level of care, right?
But but if to even have an AI creature that cared about the other members of its pack and the humans in its pack the way that like a dog cares about other dogs and cares about humans would be an incredible achievement and would be would even if it wasn't as smart as a person or even as smart as the tools are would be very useful a very useful thing to have. I'd love to have a digital guard dog on my computer looking out for scams, right?
Like you can imagine the value of having digital living living digital companions that that are that that do that care about you that aren't explicitly goal- oriented.
You have to tell them to do everything to do. And you can actually imagine that that pairs very nicely with tools too, right? That that digital being could use digital tools and and doesn't have to be super smart to use those tools effectively. Um I think you can get there's a lot of synergy actually between the tool you the tool building um and the uh the more organic intelligence building.
Um and so that's the that is the you know and I guess yeah in the limit eventually it does become a human level intelligence but like the company isn't isn't like drive to human level intelligence.
It's like learn how this alignment stuff works.
learn how this like theory of mind align yourself via care process works.
Use that to build things that align themselves that way which includes like cells in your body.
Like I don't think it it doesn't and and we start small and we see how how far we can get.
>> I think it's a good note to to to wrap on.
EMTT, thanks so much for coming on the podcast.
>> Yeah, thank you for having me.
Heat. Heat. N.
Loading video analysis...