This is the conclusion of a LessWrong post, following The AI Knows, But Doesn’t Care.
If an artificial intelligence is smart enough to be dangerous to people, we’d intuitively expect it to be smart enough to know how to make itself safe for people. But that doesn’t mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety.
That means we have to understand how to code safety. We can’t pass the entire buck to the AI, when only an AI we’ve already safety-proofed will be safe to ask for help on safety issues! Generally: If the AI is weak enough to be safe, it’s too weak to solve this problem. If it’s strong enough to solve this problem, it’s too strong to be safe.
This is an urgent public safety issue, given the five theses and given that we’ll likely figure out how to make a decent artificial programmer before we figure out how to make an excellent artificial ethicist.
The AI’s trajectory of self-modification has to come from somewhere.
“Take an AI in a box that wants to persuade its gatekeeper to set it free. Do you think that such an undertaking would be feasible if the AI was going to interpret everything the gatekeeper says in complete ignorance of the gatekeeper’s values? […] I don’t think so. So how exactly would it care to follow through on an interpretation of a given goal that it knows, given all available information, is not the intended meaning of the goal? If it knows what was meant by ‘minimize human suffering’ then how does it decide to choose a different meaning? And if it doesn’t know what is meant by such a goal, how could it possible [sic] convince anyone to set it free, let alone take over the world?”
“If the AI doesn’t know that you really mean ‘make paperclips without killing anyone’, that’s not a realistic scenario for AIs at all–the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to ‘make paperclips in the way that I mean’.”—Jiro
The wish-granting genie we’ve conjured — if it bothers to even consider the question — should be able to understand what you mean by ‘I wish for my values to be fulfilled.’ Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie’s map can compass your true values. Superintelligence doesn’t imply that the genie’s utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.
The critical mistake here is to not distinguish the seed AI we initially program from the superintelligent wish-granter it self-modifies to become. We can’t use the genius of the superintelligence to tell us how to program its own seed to become the sort of superintelligence that tells us how to build the right seed. Time doesn’t work that way.
We can delegate most problems to the FAI. But the one problem we can’t safely delegate is the problem of coding the seed AI to produce the sort of superintelligence to which a task can be safely delegated.
When you write the seed’s utility function, you, the programmer, don’t understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence’s actions,long after it’s become smart enough to fully understand our values.
Why is the superintelligence, if it’s so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can’t we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: ‘When you’re smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.’?
Because that sentence has to actually be coded in to the AI, and when we do so, there’s no ghost in the machine to know exactly what we mean by ‘frend-lee-ness thee-ree’. Instead, we have to give it criteria we think are good indicators of Friendliness, so it’ll know what to self-modify toward. And if one of the landmarks on our ‘frend-lee-ness’ road map is a bit off, we lose the world.
Yes, the UFAI will be able to solve Friendliness Theory. But if we haven’t already solved it on our own power, we can’tpinpoint Friendliness in advance, out of the space of utility functions. And if we can’t pinpoint it with enough detail to draw a road map to it and it alone, we can’t program the AI to care about conforming itself with that particular idiosyncratic algorithm.
Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI’s decision criteria, no argument or discovery will spontaneously change its heart.
And, yes, the UFAI will be able to simulate humans accurately enough to know that its own programmers would wish, if they knew the UFAI’s misdeeds, that they had programmed the seed differently. But what’s done is done. Unless we ourselves figure out how to program the AI to terminally value its programmers’ True Intentions, the UFAI will just shrug at its creators’ foolishness and carry on converting the Virgo Supercluster’s available energy into paperclips.
And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer’s True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we’ve solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.
Not all small targets are alike.
“You write that the worry is that the superintelligence won’t care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean? […]“If an AI is meant to behave generally intelligent [sic] then it will have to work as intended or otherwise fail to be generally intelligent.”
It’s easy to get a genie to care about (optimize for) something-or-other; what’s hard is getting one to care about the right something.
‘Working as intended’ is a simple phrase, but behind it lies a monstrously complex referent. It doesn’t clearly distinguish the programmers’ (mostly implicit) true preferences from their stated design objectives; an AI’s actual code can differ from either or both of these. Crucially, what an AI is ‘intended’ for isn’t all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to kill Friendliness much more easily than intelligence.
It may be hard to build self-modifying AGI. But it’s not the same hardness as the hardness of Friendliness Theory. Being able to hit one small target doesn’t entail that you can or will hit every small target it would be in your best interest to hit. Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:
(i) Research Inertia. Far more people are working on AGI than on Friendliness. And there may not come a moment when researchers will suddenly realize that they need to take all their resources out of AGI and pour them into Friendliness. If the status quo continues, the default expectation should be UFAI.
(ii) Disjunctive Instrumental Value. Being more intelligent — that is, better able to manipulate diverse environments — is of instrumental value to nearly every goal. Being Friendly is of instrumental value to barely any goals. This makes it more likely by default that short-sighted humans will be interested in building AGI than in developing Friendliness Theory. And it makes it much likelier that an attempt at Friendly AGI that has a slightly defective goal architecture will retain the instrumental value of intelligence than of Friendliness.
(iii) Incremental Approachability. Friendliness is an all-or-nothing target. Value is fragile and complex, and a half-good being editing its morality drive is at least as likely to move toward 40% goodness as 60%. Cross-domain efficiency, in contrast, is not an all-or-nothing target. If you just make the AGI slightly better than a human at improving the efficiency of AGI, then this can snowball into ever-improving efficiency, even if the beginnings were clumsy and imperfect. It’s easy to put a reasoning machine into a feedback loop with reality in which it is differentially rewarded for being smarter; it’s hard to put one into a feedback loop with reality in which it is differentially rewarded for picking increasingly correct answers to ethical dilemmas.
The ability to productively rewrite software and the ability to perfectly extrapolate humanity’s True Preferences are two different skills. (For example, humans have the former capacity, and not the latter. Most humans, given unlimited power, would be unintentionally Unfriendly.)
It’s true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don’t have them both, and a pre-FOOM self-improving AGI (‘seed’) need not have both. Being able to program good programmers is all that’s required for an intelligence explosion; but being a good programmer doesn’t imply that one is a superlative moral psychologist or moral philosopher.
If the programmers don’t know in mathematical detail what Friendly code would even look like, then the seed won’t be built to want to build toward the right code. And if the seed isn’t built to want to self-modify toward Friendliness, then the superintelligence it sproutsalso won’t have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general ‘hit whatever target I want’ ability that makes Friendliness easy.
And that’s why some people are worried.