This is the first half of a LessWrong post. For background material, see A Non-Technical Introduction to AI Risk and Truly Part of You.
I summon a superintelligence, calling out: ‘I wish for my values to be fulfilled!’
The results fall short of pleasant.
Gnashing my teeth in a heap of ashes, I wail:
Is the artificial intelligence too stupid to understand what I meant? Then it is no superintelligence at all!
Is it too weak to reliably fulfill my desires? Then, surely, it is no superintelligence!
Does it hate me? Then it was deliberately crafted to hate me, for chaos predicts indifference. ———But, ah! no wicked god did intervene!
Thus disproved, my hypothetical implodes in a puff of logic. The world is saved. You’re welcome.
On this line of reasoning, safety-proofed artificial superintelligence (Friendly AI) is not difficult. It’s inevitable, provided only that we tell the AI, ‘Be Friendly.’ If the AI doesn’t understand ‘Be Friendly.’, then it’s too dumb to harm us. And if it does understand ‘Be Friendly.’, then designing it to follow such instructions is childishly easy.
The end!
… …
Is the missing option obvious?
What if the AI isn’t sadistic, or weak, or stupid, but just doesn’t care what you Really Meant by ‘I wish for my values to be fulfilled’?
When we see a Be Careful What You Wish For genie in fiction, it’s natural to assume that it’s a malevolent trickster or an incompetent bumbler. But a real Wish Machine wouldn’t be a human in shiny pants. If it paid heed to our verbal commands at all, it would do so in whatever way best fit its own values. Not necessarily the way that best fits ours.
Is indirect indirect normativity easy?
“If the poor machine could not understand the difference between ‘maximize human pleasure’ and ‘put all humans on an intravenous dopamine drip’ then it would also not understand most of the other subtle aspects of the universe, including but not limited to facts/questions like: ‘If I put a million amps of current through my logic circuits, I will fry myself to a crisp’, or ‘Which end of this Kill-O-Zap Definit-Destruct Megablaster is the end that I’m supposed to point at the other guy?’. Dumb AIs, in other words, are not an existential threat. […]
“If the AI is (and always has been, during its development) so confused about the world that it interprets the ‘maximize human pleasure’ motivation in such a twisted, logically inconsistent way, it would never have become powerful in the first place.”
If an AI is sufficiently intelligent, then, yes, it should be able to model us well enough to make precise predictions about our behavior. And, yes, something functionally akin to our own intentional strategy could conceivably turn out to be an efficient way to predict linguistic behavior. The suggestion, then, is that we solve Friendliness by method A —
- A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions’real meaning. Then just instruct it ‘Satisfy my preferences’, and wait for it to become smart enough to figure out my preferences.
— as opposed to B or C —
- B. Solve the Problem of Preference-in-General in advance, and directly program it to figure out what our human preferences are and then satisfy them.
- C. Solve the Problem of Human Preference, and explicitly program our particular preferences into the AI ourselves, rather than letting the AI discover them for us.
But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.
1. You have to actually code the seed AI to understand what we mean. You can’t just tell it ‘Start understanding the True Meaning of my sentences!’ to get the ball rolling, because it may not yet be sophisticated enough to grok the True Meaning of ‘Start understanding the True Meaning of my sentences!’.
2. The Problem of Meaning-in-General may really be ten thousand heterogeneous problems, especially if ‘semantic value’ isn’t a natural kind. There may not be a single simple algorithm that inputs any old brain-state and outputs what, if anything, it ‘means’; it may instead be that different types of content are encoded very differently.
3. The Problem of Meaning-in-General may subsume the Problem of Preference-in-General. Rather than being able to apply a simple catch-all Translation Machine to any old human concept to output a reliable algorithm for applying that concept in any intelligible situation, we may need to already understand how our beliefs and values work in some detail before we can start generalizing. On the face of it, programming an AI to fully understand ‘Be Friendly!’ seems at least as difficult as just programming Friendliness into it, but with an added layer of indirection.
4. Even if the Problem of Meaning-in-General has a unitary solution and doesn’t subsume Preference-in-General, it may still be harder if semantics is a subtler or more complex phenomenon than ethics. It’s not inconceivable that language could turn out to be more of a kludge than value; or more variable across individuals due to its evolutionary recency; or more complexly bound up with culture.
5. Even if Meaning-in-General is easier than Preference-in-General, it may still be extraordinarily difficult. The meanings of human sentences can’t be fully captured in any simple string of necessary and sufficient conditions. ‘Concepts‘ are just especially context-insensitive bodies of knowledge; we should not expect them to be uniquely reflectively consistent, transtemporally stable, discrete, easily-identified, or introspectively obvious.
6. It’s clear that building stable preferences out of B or C would create a Friendly AI. It’s not clear that the same is true for A. Even if the seed AI understands our commands, the ‘do’ part of ‘do what you’re told’ leaves a lot of dangerous wiggle room. See section 2 of Yudkowsky’s reply to Holden. If the AGI doesn’t already understand and care about human value, then it may misunderstand (or misvalue) the component of responsible request- or question-answering that depends on speakers’ implicit goals and intentions.
7. You can’t appeal to a superintelligence to tell you what code to first build it with.
The point isn’t that the Problem of Preference-in-General is unambiguously the ideal angle of attack. It’s that the linguistic competence of an AGI isn’t unambiguously the right target, and also isn’t easy or solved.
Point 7 seems to be a special source of confusion here, so I’ll focus just on it for my next post.
I posted two comments on LessWrong. You might want to reply there. But note that due to the reputation system I cannot be completely honest about what I believe there. As attempts to reveal my true beliefs in the past have only resulted in the decrease of a number associated with the comment, from which I could infer very little useful information.
Hi, Alexander. If you’d like to talk more openly here, you’re welcome to! I could even link here from LessWrong, so others could join in on ‘neutral ground’ where karma’s not an issue.
Choice C seems intractable and for choice B I don’t like the notion of having it solve these in advance and then just “using it.” If you’re asking a super intelligence to solve the problem of values as an initial goal with nothing else, it enables it to be unethical as it strives to determine what is ethical! The better approach that I suggest is defining its goals as maximizing human values from the start and what human values are is simply always a matter of uncertainty. As it gains experience that uncertainty decreases and it responds accordingly, but it never commits and it never has to have a period of reasoning in which it can behave unethically without concern. This approach also gives us flexibility to encode our own first stab at what our values are via priors without having the agent commit to it.