Richard Loosemore recently wrote an essay criticizing worries about AI safety, “The Maverick Nanny with a Dopamine Drip“. (Subtitle: “Debunking Fallacies in the Theory of AI Motivation”.) His argument has two parts. First:
1. Any AI system that’s smart enough to pose a large risk will be smart enough to understand human intentions, and smart enough to rewrite itself to conform to those intentions.
2. Any such AI will be motivated to edit itself and remove ‘errors’ from its own code. (‘Errors’ is a large category, one that includes all mismatches with programmer intentions.)
3. So any AI system that’s smart enough to pose a large risk will be motivated to spontaneously overwrite its utility function to value whatever humans value.
4. Therefore any powerful AGI will be fully safe / friendly, no matter how it’s designed.
5. Logical AI is brittle and inefficient.
6. Neural-network-inspired AI works better, and we know it’s possible, because it works for humans.
7. Therefore, if we want a domain-general problem-solving machine, we should move forward on Loosemore’s proposal, called ‘swarm relaxation intelligence.’
Combining these two conclusions, we get:
8. Since AI is completely safe — any mistakes we make will be fixed automatically by the AI itself — there’s no reason to devote resources to safety engineering. Instead, we should work as quickly as possible to train smarter and smarter neural networks. As they get smarter, they’ll get better at self-regulation and make fewer mistakes, with the result that accidents and moral errors will become decreasingly likely.
I’m not persuaded by Loosemore’s case for point 2, and this makes me doubt claims 3, 4, and 8. I’ll also talk a little about the plausibility and relevance of his other suggestions.
Does intelligence entail docility?
Loosemore’s claim (also made in an older essay, “The Fallacy of Dumb Superintelligence“) is that an AGI can’t simultaneously be intelligent enough to pose a serious risk, but “unsophisticated” enough to disregard its programmers’ intentions. I replied last year in two blog posts (crossposted to Less Wrong).
In “The AI Knows, But Doesn’t Care” I noted that while Loosemore posits an AGI smart enough to correctly interpret natural language and model human motivation, this doesn’t bridge the gap between the ability to perform a task and the motivation, the agent’s decision criteria. In “The Seed is Not the Superintelligence,” I argued, concerning recursively self-improving AI (seed AI):
When you write the seed’s utility function, you, the programmer, don’t understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence’s actions, long after it’s become smart enough to fully understand our values.
Why is the superintelligence, if it’s so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can’t we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: ‘When you’re smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.’?
Because that sentence has to actually be coded in to the AI, and when we do so, there’s no ghost in the machine to know exactly what we mean by ‘frend-lee-ness thee-ree’. Instead, we have to give it criteria we think are good indicators of Friendliness, so it’ll know what to self-modify toward.
My claim is that if we mess up on those indicators of friendliness — the criteria the AI-in-progress uses to care about (i.e., factor into its decisions) self-modification toward safety — then it won’t edit itself to care about those factors later, even if it’s figured out that that’s what we would have wanted (and that doing what we want is part of this ‘friendliness’ thing we failed to program it to value).
Loosemore discussed this with me on Less Wrong and on this blog, then went on to explain his view in more detail in the new essay. His new argument is that MIRI and other AGI theorists and forecasters think “AI is supposed to be hardwired with a Doctrine of Logical Infallibility,” meaning “it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place”.
Loosemore thinks that if we reject this doctrine, the AI will “understand that many of its more abstract logical atoms have a less than clear denotation or extension in the world”. In addition to recognizing that its reasoning process is fallible, it will recognize that its understanding of terms is fallible and revisable. This includes terms in its representation of its own goals; so the AI will improve its understanding of what it values over time. Since its programmers’ intention was for the AI to have a positive impact on the world, the AI will increasingly come to understand this fact about its values, and will revise its policies to match its (improved interpretation of its) values.
The main problem with this argument occurs at the phrase “understand this fact about its values”. The sentence starts by talking about the programmers’ values, yet it ends by calling this a fact about the AI’s values.
Consider a human trying to understand her parents’ food preferences. As she develops a better model of what her parents mean by ‘delicious,’ of their taste receptors and their behaviors, she doesn’t necessarily replace her own food preferences with her parents’. If her food choices do change as a result, there will need to be some added mechanism that’s responsible — e.g., she will need a specific goal like ‘modify myself to like what others do’.
We can make the point even stronger by considering minds that are alien to each other. If a human studies the preferences of a nautilus, she probably won’t acquire them. Likewise, a human who studies the ‘preferences’ (selection criteria) of an optimization process like natural selection needn’t suddenly abandon her own. It’s not an impossibility, but it depends on the human’s having a very specific set of prior values (e.g., an obsession with emulating animals or natural processes). For the same reason, most decision criteria a recursively self-improving AI could possess wouldn’t cause it to ditch its own values in favor of ours.
If no amount of insight into biology would make you want to steer clear of contraceptives and optimize purely for reproduction, why expect any amount of insight into human values to compel an AGI to abandon all its hopes and dreams and become a humanist? ‘We created you to help humanity!’ we might protest. Yet if evolution could cry out ‘I created you to reproduce!’, we would be neither rationally obliged nor psychologically impelled to comply. There isn’t any theorem of decision theory or probability theory saying ‘rational agents must promote the same sorts of outcomes as the processes that created them, else fail in formally defined tasks’.
Epistemic and instrumental fallibility v. moral fallibility
I don’t know of any actual AGI researcher who endorses Loosemore’s “Doctrine of Logical Infallibility”. (He equates Muehlhauser and Helm’s “Literalness” doctrine with Infallibility in passing, but the link isn’t clear to me, and I don’t see any argument for the identification. The Doctrine is otherwise uncited.) One of the main organizations he critiques, MIRI, actually specializes in researching formal agents that can’t trust their own reasoning, or can’t trust the reasoning of future versions of themselves. This includes work on logical uncertainty (briefly introduced here, at length here) and ’tiling’ self-modifying agents (here).
Loosemore imagines a programmer chiding an AI for the “design error” of pursuing human-harming goals. The human tells the AI that it should fix this error, since it fixed other errors in its code. But Loosemore is conflating programming errors the human makes with errors of reasoning the AI makes. He’s assuming unargued that flaws in an agent’s epistemic and instrumental rationality are of a kind with defects in its moral character or docility.
Any efficient goal-oriented system has convergent instrumental reasons to fix ‘errors of reasoning’ of the kind that are provably obstacles to its own goals. Bostrom discusses this in “The Superintelligent Will,” and Omohundro discusses it in “Rational Artificial Intelligence for the Greater Good,” under the name ‘Basic AI Drives’.
‘Errors of reasoning,’ in the relevant sense, aren’t just things humans think are bad. They’re general obstacles to achieving any real-world goal, and ‘correct reasoning’ is an attractor for systems (e.g., self-improving humans, institutions, or AIs) that can alter their own ability to achieve such goals. If a moderately intelligent self-modifying program lacks the goal ‘generally avoid confirmation bias’ or ‘generally avoid acquiring new knowledge when it would put my life at risk,’ it will add that goal (or something tantamount to it) to its goal set, because it’s instrumental to almost any other goal it might have started with.
On the other hand, if a moderately intelligent self-modifying AI lacks the goal ‘always and forever do exactly what my programmer would ideally wish,’ the number of goals for which it’s instrumental to add that goal to the set is very small, relative to the space of all possible goals. This is why MIRI is worried about AGI; ‘defer to my programmer’ doesn’t appear to be an attractor goal in the way ‘improve my processor speed’ and ‘avoid jumping off cliffs’ are attractor goals. A system that appears amazingly ‘well-designed’ (because it keeps hitting goal after goal of the latter sort) may be poorly-designed to achieve any complicated outcome that isn’t an instrumental attractor, including safety protocols. This is the basis for disaster scenarios like Bostrom on AI deception.
That doesn’t mean that ‘defer to my programmer’ is an impossible goal. It’s just something we have to do the hard work of figuring out ourselves; we can’t delegate the entire task to the AI. It’s a mathematical open problem to define a way for adaptive autonomous AI with otherwise imperfect motivations to defer to programmer oversight and not look for loopholes in its restrictions. People at MIRI and FHI have been thinking about this issue for the past few years; there’s not much published about the topic, though I notice Yudkowsky mentions issues in this neighborhood off-hand in a 2008 blog post about morality.
Do what I mean by ‘do what I mean’!
Loosemore doesn’t discuss in any technical detail how an AI could come to improve its goals over time, but one candidate formalism is Daniel Dewey’s value learning. Following Dewey’s work, Bostrom notes that this general approach (‘outsource some of the problem to the AI’s problem-solving ability’) is promising, but needs much more fleshing out. Bostrom discusses some potential obstacles to value learning in his new book Superintelligence (pp. 192-201):
[T]he difficulty is not so much how to ensure that the AI can understand human intentions. A superintelligence should easily develop such understanding. Rather, the difficulty is ensuring that the AI will be motivated to pursue the described values in the way we intended. This is not guaranteed by the AI’s ability to understand our intentions: an AI could know exactly what we meant and yet be indifferent to that interpretation of our words (being motivated instead by some other interpretation of the words or being indifferent to our words altogether).
The difficulty is compounded by the desideratum that, for reasons of safety, the correct motivation should ideally be installed in the seed AI before it becomes capable of fully representing human concepts or understanding human intentions.
We do not know how to build a general intelligence whose goals are a stable function of human brain states, or patterns of ink on paper, or any other encoding of our preferences. Moreover, merely making the AGI’s goals a function of brain states or ink marks doesn’t help if we make it the wrong function. If the AGI starts off with the wrong function, there’s no reason to expect it to self-correct in the direction of the right one, because (a) having the right function is a prerequisite for caring about self-modifying toward the relevant kind of ‘rightness,’ and (b) having goals that are an ersatz function of human brain-states or ink marks seems consistent with being superintelligent (e.g., with having veridical world-models).
When Loosemore’s hypothetical programmer attempts to argue her AI into friendliness, the AI replies, “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.” MIRI and FHI’s view is that the AI’s actual reply (assuming it had some reason to reply, and to be honest) would invoke something more like “the Doctrine of Not-All-Children-Assigning-Infinite-Value-To-Obeying-Their-Parents.” The task ‘across arbitrary domains, get an AI-in-progress to defer to its programmers when its programmers dislike what it’s doing’ is poorly understood, and looks extremely difficult. Getting a corrigible AI of that sort to ‘learn’ the right values is a second large problem. Loosemore seems to treat corrigibility as trivial, and to equate corrigibility with all other AGI goal content problems.
A random AGI self-modifying to improve its own efficiency wouldn’t automatically self-modify to acquire the values of its creators. We have to actually do the work of coding the AI to have a safe decision-making subsystem. Loosemore is right that it’s desirable for the AI to incrementally learn over time what its values are, so we can make some use of its intelligence to solve the problem; but raw intelligence on its own isn’t the solution, since we need to do the work of actually coding the AI to value executing the desired interpretation of our instructions.
“Correct interpretation” and “instructions” are both monstrously difficult to turn into lines of code. And, crucially, we can’t pass the buck to the superintelligence here. If you can teach an AI to “do what I mean,” you can proceed to teach it anything else; but if you can’t teach it to “do what I mean,” you can’t get the bootstrapping started. In particular, it’s a pretty sure bet you also can’t teach it “do what I mean by ‘do what I mean'”.
Unless you can teach it to do what you mean, teaching it to understand what you mean won’t help. Even teaching an AI to “do what you believe I mean” assumes that we can turn the complex concept “mean” into code.
I’ll run more quickly through some other points Loosemore makes:
a. He criticizes Legg and Hutter’s definition of ‘intelligence,’ arguing that it trivially applies to an unfriendly AI that self-destructs. However, Legg and Hutter’s definition seems to (correctly) exclude agents that self-destruct. On the face of it, Loosemore should be criticizing MIRI for positing an unintelligent AGI, not for positing a trivially intelligent AGI. For a fuller discussion, see Legg and Hutter’s “A Collection of Definitions of Intelligence“.
b. He argues that safe AGI would be “swarm-like,” with elements that are “unpredictably dependent” on non-representational “internal machinery,” because “logic-based AI” is “brittle”. This seems to contradict the views of many specialists in present-day high-assurance AI systems. As Gerwin Klein writes, “everything that makes it easier for humans to think about a system, will help to verify it.” Indiscriminately adding uncertainty or randomness or complexity to a system makes it harder to model the system and check that it has required properties. It may be less “brittle” in some respects, but we have no particular reason to expect safety to be one of those respects. For a fuller discussion, see Muehlhauser’s “Transparency in Safety-Critical Systems“.
c. MIRI thinks we should try to understand safety-critical general reasoning systems as far in advance as possible, and mathematical logic and rational agent models happen to be useful tools on that front. However, MIRI isn’t invested in “logical AI” in the manner of Good Old-Fashioned AI. Yudkowsky and other MIRI researchers are happy to use neural networks when they’re useful for solving a given problem, and equally happy to use other tools for problems neural networks aren’t well-suited to. For a fuller discussion, see Yudkowsky’s “The Nature of Logic” and “Logical or Connectionist AI?“
d. One undercurrent of Loosemore’s article is that we should model AI after humans. MIRI and FHI worry that this would be very unsafe if it led to neuromorphic AI. On the other hand, modeling AI very closely after human brains (approaching the fidelity of whole-brain emulation) might well be a safer option than de novo AI. For a fuller discussion, see Bostrom’s Superintelligence.
On the whole, Loosemore’s article doesn’t engage much with the arguments of other AI theorists regarding risks from AGI.