Virtue, public and private

Effective altruists have been discussing animal welfare rather a lot lately, on a few different levels:

1. object-level: How likely is it that conventional food animals suffer?

2. philanthropic: Compared to other causes, how important is non-human animal welfare? How effective are existing organizations and programs in this area? Should effective altruists concentrate attention and resources here?

3. personal-norm: Is it morally acceptable for an individual to use animal products? How important is it to become a vegetarian or vegan?

4. group-norm: Should effective altruist meetings and conventions serve non-vegan food? Should the effective altruist movement rally to laud vegans and/or try to make all effective altruists go vegan?

These questions are all linked, but I’ll mostly focus on 4. For catered EA events, I think it makes sense to default to vegan food whenever feasible, and order other dishes only if particular individuals request them. I’m not a vegan myself, but I think this sends a positive message — that we respect the strength of vegans’ arguments, and the large stakes if they’re right, more than we care about non-vegans’ mild aesthetic preferences.

My views about trying to make as many EAs as possible go vegan are more complicated. As a demonstration of personal virtue, I’d put ‘become a vegan’ in the same (very rough) category as:

  • have no carbon footprint.
  • buy no product whose construction involved serious exploitation of labor.
  • give 10+% of your income to a worthy cause.
  • avoid lifestyle choices that have an unsustainable impact on marine life.
  • only use antibiotics as a last (or almost-last) resort, so as not to contribute to antibiotic resistance.
  • do your best to start a career in effective altruism.

Arguments could be made that many of these are morally obligatory for nearly all people. And most people dismiss these policies too hastily, overestimating the action’s difficulty and underestimating its urgency. Yet, all the same, I’m not confident any of these is universally obligatory — and I’m confident that it’s not a good idea to issue blanket condemnations of everyone who fails to live up to some or all of the above standards, nor to make these actions minimal conditions for respectable involvement in EA.

People with eating disorders can have good grounds for not immediately going vegan. Immunocompromised people can have good grounds for erring on the side of overusing medicine. People trying to dig their way out of debt while paying for a loved one’s medical bills can have good grounds not to give to charity every year.

The deeper problem with treating these as universal Standards of Basic Decency in our community isn’t that we’d be imposing an unreasonable demand on people. It’s that we’d be forcing lots of people to disclose very sensitive details about their personal lives to a bunch of strangers or to the public Internet — physical disabilities, mental disabilities, personal tragedies, intense aversions…. Putting people into a tight spot is a terrible way to get them on board with any of the above proposals, and it’s a great way to make people feel hounded and unsafe in their social circles.

No one’s suggested casting all non-vegans out of our midst. I have, however, heard recent complaints from people who have disabilities that make it unusually difficult to meet some of the above Standards, and who have become less enthusiastic about EA as a result of feeling socially pressured or harangued by EAs to immediately restructure their personal lives. So I think this is something to be aware of and nip in the bud.

In principle, there’s no crisp distinction between ‘personal life’ and ‘EA activities’. There may be lots of private details about a person’s life that would constitute valuable Bayesian evidence about their character, and there may be lots of private activities whose humanitarian impact over a lifetime adds up to be quite large.

Even taking that into account, we should adopt (quasi-)deontic heuristics like ‘don’t pressure people into disclosing a lot about their spending, eating, etc. habits.’ Ends don’t justify means among humans. For the sake of maximizing expected utility, lean toward not jabbing too much at people’s boundaries, and not making it hard for them to have separate private and public lives — even for the sake of maximizing expected utility.


Edit (9/1): Mason Hartman gave the following criticism of this post:

I think putting people into a tight spot is not only not a terrible way to get people on board with veganism, but basically the only way to make a vegan of anyone who hasn’t already become one on their own by 18. Most people like eating meat and would prefer not to be persuaded to stop doing it. Many more people are aware of the factory-like reality of agriculture in 2014 than are vegans. Quietly making the information available to those who seek it out is the polite strategy, but I don’t think it’s anywhere near the most effective one. I’m not necessarily saying we should trade social comfort for greater efficacy re: animal activism, but this article disappoints in that it doesn’t even acknowledge that there is a tradeoff.

Also, all of our Standards of Basic Decency put an “unreasonable demand” (as defined in Robby’s post) on some people. All of them. That doesn’t necessarily mean we’ve made the wrong decision by having them.

In reply: The strategy that works best for public outreach won’t always be best for friends and collaborators, and it’s the latter I’m talking about. I find it a lot more plausible that open condemnation and aggressive uses of social pressure work well for strangers on the street than that they work well for coworkers, romantic partners, etc. (And I’m pretty optimistic that there are more reliable ways to change the behavior of the latter sorts of people, even when they’re past age 18.)

It’s appropriate to have a different set of norms for people you regularly interact with, assuming it’s a good idea to preserve those relationships. This is especially true when groups and relationships involve complicated personal and professional dynamics. I focused on effective altruism because it’s the sort of community that could be valuable, from an animal-welfare perspective, even if a significant portion of the community makes bad consumer decisions. That makes it likelier that we could agree on some shared group norms even if we don’t yet agree on the same set of philanthropic or individual norms.

I’m not arguing that you shouldn’t try to make all EAs vegans, or get all EAs to give 10+% of their income to charity, or make EAs’ purchasing decisions more labor- or environment-friendly in other respects. At this point I’m just raising a worry that should constrain how we pursue those goals, and hopefully lead to new ideas about how we should promote ‘private’ virtue. I’d expect strategies that are very sensitive to EAs’ privacy and boundaries to work better, in that I’d expect them to make it easier for a diverse community of researchers and philanthropists to grow in size, to grow in trust, to reason together, to progressively alter habits and beliefs, and to get some important work done even when there are serious lingering disagreements within the community.

Loosemore on AI safety and attractors

Richard Loosemore recently wrote an essay criticizing worries about AI safety, “The Maverick Nanny with a Dopamine Drip“. (Subtitle: “Debunking Fallacies in the Theory of AI Motivation”.) His argument has two parts. First:

1. Any AI system that’s smart enough to pose a large risk will be smart enough to understand human intentions, and smart enough to rewrite itself to conform to those intentions.

2. Any such AI will be motivated to edit itself and remove ‘errors’ from its own code. (‘Errors’ is a large category, one that includes all mismatches with programmer intentions.)

3. So any AI system that’s smart enough to pose a large risk will be motivated to spontaneously overwrite its utility function to value whatever humans value.

4. Therefore any powerful AGI will be fully safe / friendly, no matter how it’s designed.

Second:

5. Logical AI is brittle and inefficient.

6. Neural-network-inspired AI works better, and we know it’s possible, because it works for humans.

7. Therefore, if we want a domain-general problem-solving machine, we should move forward on Loosemore’s proposal, called ‘swarm relaxation intelligence.’

Combining these two conclusions, we get:

8. Since AI is completely safe — any mistakes we make will be fixed automatically by the AI itself — there’s no reason to devote resources to safety engineering. Instead, we should work as quickly as possible to train smarter and smarter neural networks. As they get smarter, they’ll get better at self-regulation and make fewer mistakes, with the result that accidents and moral errors will become decreasingly likely.

I’m not persuaded by Loosemore’s case for point 2, and this makes me doubt claims 3, 4, and 8. I’ll also talk a little about the plausibility and relevance of his other suggestions.

 

Does intelligence entail docility?

Loosemore’s claim (also made in an older essay, “The Fallacy of Dumb Superintelligence“) is that an AGI can’t simultaneously be intelligent enough to pose a serious risk, but “unsophisticated” enough to disregard its programmers’ intentions. I replied last year in two blog posts (crossposted to Less Wrong).

In “The AI Knows, But Doesn’t Care” I noted that while Loosemore posits an AGI smart enough to correctly interpret natural language and model human motivation, this doesn’t bridge the gap between the ability to perform a task and the motivation, the agent’s decision criteria. In “The Seed is Not the Superintelligence,” I argued, concerning recursively self-improving AI (seed AI):

When you write the seed’s utility function, you, the programmer, don’t understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence’s actions, long after it’s become smart enough to fully understand our values.

Why is the superintelligence, if it’s so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can’t we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: ‘When you’re smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.’?

Because that sentence has to actually be coded in to the AI, and when we do so, there’s no ghost in the machine to know exactly what we mean by ‘frend-lee-ness thee-ree’. Instead, we have to give it criteria we think are good indicators of Friendliness, so it’ll know what to self-modify toward.

My claim is that if we mess up on those indicators of friendliness — the criteria the AI-in-progress uses to care about (i.e., factor into its decisions) self-modification toward safety — then it won’t edit itself to care about those factors later, even if it’s figured out that that’s what we would have wanted (and that doing what we want is part of this ‘friendliness’ thing we failed to program it to value).

Loosemore discussed this with me on Less Wrong and on this blog, then went on to explain his view in more detail in the new essay. His new argument is that MIRI and other AGI theorists and forecasters think “AI is supposed to be hardwired with a Doctrine of Logical Infallibility,” meaning “it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place”.

Loosemore thinks that if we reject this doctrine, the AI will “understand that many of its more abstract logical atoms have a less than clear denotation or extension in the world”. In addition to recognizing that its reasoning process is fallible, it will recognize that its understanding of terms is fallible and revisable. This includes terms in its representation of its own goals; so the AI will improve its understanding of what it values over time. Since its programmers’ intention was for the AI to have a positive impact on the world, the AI will increasingly come to understand this fact about its values, and will revise its policies to match its (improved interpretation of its) values.

The main problem with this argument occurs at the phrase “understand this fact about its values”. The sentence starts by talking about the programmers’ values, yet it ends by calling this a fact about the AI’s values.

Consider a human trying to understand her parents’ food preferences. As she develops a better model of what her parents mean by ‘delicious,’ of their taste receptors and their behaviors, she doesn’t necessarily replace her own food preferences with her parents’. If her food choices do change as a result, there will need to be some added mechanism that’s responsible — e.g., she will need a specific goal like ‘modify myself to like what others do’.

We can make the point even stronger by considering minds that are alien to each other. If a human studies the preferences of a nautilus, she probably won’t acquire them. Likewise, a human who studies the ‘preferences’ (selection criteria) of an optimization process like natural selection needn’t suddenly abandon her own. It’s not an impossibility, but it depends on the human’s having a very specific set of prior values (e.g., an obsession with emulating animals or natural processes). For the same reason, most decision criteria a recursively self-improving AI could possess wouldn’t cause it to ditch its own values in favor of ours.

If no amount of insight into biology would make you want to steer clear of contraceptives and optimize purely for reproduction, why expect any amount of insight into human values to compel an AGI to abandon all its hopes and dreams and become a humanist? ‘We created you to help humanity!’ we might protest. Yet if evolution could cry out ‘I created you to reproduce!’, we would be neither rationally obliged nor psychologically impelled to comply. There isn’t any theorem of decision theory or probability theory saying ‘rational agents must promote the same sorts of outcomes as the processes that created them, else fail in formally defined tasks’.

 

Epistemic and instrumental fallibility v. moral fallibility

I don’t know of any actual AGI researcher who endorses Loosemore’s “Doctrine of Logical Infallibility”. (He equates Muehlhauser and Helm’s “Literalness” doctrine with Infallibility in passing, but the link isn’t clear to me, and I don’t see any argument for the identification. The Doctrine is otherwise uncited.) One of the main organizations he critiques, MIRI, actually specializes in researching formal agents that can’t trust their own reasoning, or can’t trust the reasoning of future versions of themselves. This includes work on logical uncertainty (briefly introduced here, at length here) and ’tiling’ self-modifying agents (here).

Loosemore imagines a programmer chiding an AI for the “design error” of pursuing human-harming goals. The human tells the AI that it should fix this error, since it fixed other errors in its code. But Loosemore is conflating programming errors the human makes with errors of reasoning the AI makes. He’s assuming unargued that flaws in an agent’s epistemic and instrumental rationality are of a kind with defects in its moral character or docility.

Any efficient goal-oriented system has convergent instrumental reasons to fix ‘errors of reasoning’ of the kind that are provably obstacles to its own goals. Bostrom discusses this in “The Superintelligent Will,” and Omohundro discusses it in “Rational Artificial Intelligence for the Greater Good,” under the name ‘Basic AI Drives’.

‘Errors of reasoning,’ in the relevant sense, aren’t just things humans think are bad. They’re general obstacles to achieving any real-world goal, and ‘correct reasoning’ is an attractor for systems (e.g., self-improving humans, institutions, or AIs) that can alter their own ability to achieve such goals. If a moderately intelligent self-modifying program lacks the goal ‘generally avoid confirmation bias’ or ‘generally avoid acquiring new knowledge when it would put my life at risk,’ it will add that goal (or something tantamount to it) to its goal set, because it’s instrumental to almost any other goal it might have started with.

On the other hand, if a moderately intelligent self-modifying AI lacks the goal ‘always and forever do exactly what my programmer would ideally wish,’ the number of goals for which it’s instrumental to add that goal to the set is very small, relative to the space of all possible goals. This is why MIRI is worried about AGI; ‘defer to my programmer’ doesn’t appear to be an attractor goal in the way ‘improve my processor speed’ and ‘avoid jumping off cliffs’ are attractor goals. A system that appears amazingly ‘well-designed’ (because it keeps hitting goal after goal of the latter sort) may be poorly-designed to achieve any complicated outcome that isn’t an instrumental attractor, including safety protocols. This is the basis for disaster scenarios like Bostrom on AI deception.

That doesn’t mean that ‘defer to my programmer’ is an impossible goal. It’s just something we have to do the hard work of figuring out ourselves; we can’t delegate the entire task to the AI. It’s a mathematical open problem to define a way for adaptive autonomous AI with otherwise imperfect motivations to defer to programmer oversight and not look for loopholes in its restrictions. People at MIRI and FHI have been thinking about this issue for the past few years; there’s not much published about the topic, though I notice Yudkowsky mentions issues in this neighborhood off-hand in a 2008 blog post about morality.

 

Do what I mean by ‘do what I mean’!

Loosemore doesn’t discuss in any technical detail how an AI could come to improve its goals over time, but one candidate formalism is Daniel Dewey’s value learning. Following Dewey’s work, Bostrom notes that this general approach (‘outsource some of the problem to the AI’s problem-solving ability’) is promising, but needs much more fleshing out. Bostrom discusses some potential obstacles to value learning in his new book Superintelligence (pp. 192-201):

[T]he difficulty is not so much how to ensure that the AI can understand human intentions. A superintelligence should easily develop such understanding. Rather, the difficulty is ensuring that the AI will be motivated to pursue the described values in the way we intended. This is not guaranteed by the AI’s ability to understand our intentions: an AI could know exactly what we meant and yet be indifferent to that interpretation of our words (being motivated instead by some other interpretation of the words or being indifferent to our words altogether).

The difficulty is compounded by the desideratum that, for reasons of safety, the correct motivation should ideally be installed in the seed AI before it becomes capable of fully representing human concepts or understanding human intentions.

We do not know how to build a general intelligence whose goals are a stable function of human brain states, or patterns of ink on paper, or any other encoding of our preferences. Moreover, merely making the AGI’s goals a function of brain states or ink marks doesn’t help if we make it the wrong function. If the AGI starts off with the wrong function, there’s no reason to expect it to self-correct in the direction of the right one, because (a) having the right function is a prerequisite for caring about self-modifying toward the relevant kind of ‘rightness,’ and (b) having goals that are an ersatz function of human brain-states or ink marks seems consistent with being superintelligent (e.g., with having veridical world-models).

When Loosemore’s hypothetical programmer attempts to argue her AI into friendliness, the AI replies, “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.” MIRI and FHI’s view is that the AI’s actual reply (assuming it had some reason to reply, and to be honest) would invoke something more like “the Doctrine of Not-All-Children-Assigning-Infinite-Value-To-Obeying-Their-Parents.” The task ‘across arbitrary domains, get an AI-in-progress to defer to its programmers when its programmers dislike what it’s doing’ is poorly understood, and looks extremely difficult. Getting a corrigible AI of that sort to ‘learn’ the right values is a second large problem. Loosemore seems to treat corrigibility as trivial, and to equate corrigibility with all other AGI goal content problems.

A random AGI self-modifying to improve its own efficiency wouldn’t automatically self-modify to acquire the values of its creators. We have to actually do the work of coding the AI to have a safe decision-making subsystem. Loosemore is right that it’s desirable for the AI to incrementally learn over time what its values are, so we can make some use of its intelligence to solve the problem; but raw intelligence on its own isn’t the solution, since we need to do the work of actually coding the AI to value executing the desired interpretation of our instructions.

“Correct interpretation” and “instructions” are both monstrously difficult to turn into lines of code. And, crucially, we can’t pass the buck to the superintelligence here. If you can teach an AI to “do what I mean,” you can proceed to teach it anything else; but if you can’t teach it to “do what I mean,” you can’t get the bootstrapping started. In particular, it’s a pretty sure bet you also can’t teach it “do what I mean by ‘do what I mean'”.

Unless you can teach it to do what you mean, teaching it to understand what you mean won’t help. Even teaching an AI to “do what you believe I mean” assumes that we can turn the complex concept “mean” into code.

 

Loose ends

I’ll run more quickly through some other points Loosemore makes:

a. He criticizes Legg and Hutter’s definition of ‘intelligence,’ arguing that it trivially applies to an unfriendly AI that self-destructs. However, Legg and Hutter’s definition seems to (correctly) exclude agents that self-destruct. On the face of it, Loosemore should be criticizing MIRI for positing an unintelligent AGI, not for positing a trivially intelligent AGI. For a fuller discussion, see Legg and Hutter’s “A Collection of Definitions of Intelligence“.

b. He argues that safe AGI would be “swarm-like,” with elements that are “unpredictably dependent” on non-representational “internal machinery,” because “logic-based AI” is “brittle”. This seems to contradict the views of many specialists in present-day high-assurance AI systems. As Gerwin Klein writes, “everything that makes it easier for humans to think about a system, will help to verify it.” Indiscriminately adding uncertainty or randomness or complexity to a system makes it harder to model the system and check that it has required properties. It may be less “brittle” in some respects, but we have no particular reason to expect safety to be one of those respects. For a fuller discussion, see Muehlhauser’s “Transparency in Safety-Critical Systems“.

c. MIRI thinks we should try to understand safety-critical general reasoning systems as far in advance as possible, and mathematical logic and rational agent models happen to be useful tools on that front. However, MIRI isn’t invested in “logical AI” in the manner of Good Old-Fashioned AI. Yudkowsky and other MIRI researchers are happy to use neural networks when they’re useful for solving a given problem, and equally happy to use other tools for problems neural networks aren’t well-suited to. For a fuller discussion, see Yudkowsky’s “The Nature of Logic” and “Logical or Connectionist AI?

d. One undercurrent of Loosemore’s article is that we should model AI after humans. MIRI and FHI worry that this would be very unsafe if it led to neuromorphic AI. On the other hand, modeling AI very closely after human brains (approaching the fidelity of whole-brain emulation) might well be a safer option than de novo AI. For a fuller discussion, see Bostrom’s Superintelligence.

On the whole, Loosemore’s article doesn’t engage much with the arguments of other AI theorists regarding risks from AGI.

Is ‘consciousness’ simple? Is it ancient?

 

Assigning less than 5% probability to ‘cows are moral patients’ strikes me as really overconfident. Ditto, assigning greater than 95% probability. (A moral patient is something that can be harmed or benefited in morally important ways, though it may not be accountable for its actions in the way a moral agent is.)

I’m curious how confident others are, and I’m curious about the most extreme confidence levels they’d consider ‘reasonable’.

I also want to hear more about what theories and backgrounds inform people’s views. I’ve seen some relatively extreme views defended recently, and the guiding intuitions seem to have come from two sources:


 

(1) How complicated is consciousness? In the space of possible minds, how narrow a target is consciousness?

Humans seem to be able to have very diverse experiences — dreams, orgasms, drug-induced states — that they can remember in some detail, and at least appear to be conscious during. That’s some evidence that consciousness is robust to modification and can take many forms. So, perhaps, we can expect a broad spectrum of animals to be conscious.

But what would our experience look like if it were fragile and easily disrupted? There would probably still be edge cases. And, from inside our heads, it would look like we had amazingly varied possibilities for experience — because we couldn’t use anything but our own experience as a baseline. It certainly doesn’t look like a human brain on LSD differs as much from a normal human brain as a turkey brain differs from a human brain.

There’s some risk that we’re overestimating how robust consciousness is, because when we stumble on one of the many ways to make a human brain unconscious, we (for obvious reasons) don’t notice it as much. Drastic changes in unconscious neurochemistry interest us a lot less than minor tweaks to conscious neurochemistry.

And there’s a further risk that we’ll underestimate the complexity of consciousness because we’re overly inclined to trust our introspection and to take our experience at face value. Even if our introspection is reliable in some domains, it has no access to most of the necessary conditions for experience. So long as they lie outside our awareness, we’re likely to underestimate how parochial and contingent our consciousness is.


 

(2) How quick are you to infer consciousness from ‘intelligent’ behavior?

People are pretty quick to anthropomorphize superficially human behaviors, and our use of mental / intentional language doesn’t clearly distinguish between phenomenal consciousness and behavioral intelligence. But if you work on AI, and have an intuition that a huge variety of systems can act ‘intelligently’, you may doubt that the linkage between human-style consciousness and intelligence is all that strong. If you think it’s easy to build a robot that passes various Turing tests without having full-fledged first-person experience, you’ll also probably (for much the same reason) expect a lot of non-human species to arrive at strategies for intelligently planning, generalizing, exploring, etc. without invoking consciousness. (Especially if your answer to question 1 is ‘consciousness is very complex’. Evolution won’t put in the effort to make a brain conscious unless it’s extremely necessary for some reproductive advantage.)

… But presumably there’s some intelligent behavior that was easier for a more-conscious brain than for a less-conscious one — at least in our evolutionary lineage, if not in all possible lineages that reproduce our level of intelligence. We don’t know what cognitive tasks forced our ancestors to evolve-toward-consciousness-or-perish. At the outset, there’s no special reason to expect that task to be one that only arose for proto-humans in the last few million years.

Even if we accept that the machinery underlying human consciousness is very complex, that complex machinery could just as easily have evolved hundreds of millions of years ago, rather than tens of millions. We’d then expect it to be preserved in many nonhuman lineages, not just in humans. Since consciousness-of-pain is mostly what matters for animal welfare (not, e.g., consciousness-of-complicated-social-abstractions), we should look into hypotheses like:

first-person consciousness is an adaptation that allowed early brains to represent simple policies/strategies and visualize plan-contingent sensory experiences.

Do we have a specific cognitive reason to think that something about ‘having a point of view’ is much more evolutionarily necessary for human-style language or theory of mind than for mentally comparing action sequences or anticipating/hypothesizing future pain? If not, the data of ethology plus ‘consciousness is complicated’ gives us little reason to favor the one view over the other.

We have relatively direct positive data showing we’re conscious, but we have no negative data showing that, e.g., salmon aren’t conscious. It’s not as though we’d expect them to start talking or building skyscrapers if they were capable of experiencing suffering — at least, any theory that predicts as much has some work to do to explain the connection. At present, it’s far from obvious that the world would look any different than it does even if all vertebrates were conscious.

So… the arguments are a mess, and I honestly have no idea whether cows can suffer. The probability seems large enough to justify ‘don’t torture cows (including via factory farms)’, but that’s a pretty low bar, and doesn’t narrow the probability down much.

To the extent I currently have a favorite position, it’s something like: ‘I’m pretty sure cows are unconscious on any simple, strict, nondisjunctive definition of “consciousness;” but what humans care about is complicated, and I wouldn’t be surprised if a lot of ‘unconscious’ information-processing systems end up being counted as ‘moral patients’ by a more enlightened age. … But that’s a pretty weird view of mine, and perhaps deserves a separate discussion.

I could conclude with some crazy video of a corvid solving a rubik’s cube or an octopus breaking into a bank vault or something, but I somehow find this example of dog problem-solving more compelling:

Bostrom on AI deception

Oxford philosopher Nick Bostrom has argued, in “The Superintelligent Will,” that advanced AIs are likely to diverge in their terminal goals (i.e., their ultimate decision-making criteria), but converge in some of their instrumental goals (i.e., the policies and plans they expect to indirectly further their terminal goals). An arbitrary superintelligent AI would be mostly unpredictable, except to the extent that nearly all plans call for similar resources or similar strategies. The latter exception may make it possible for us to do some long-term planning for future artificial agents.

Bostrom calls the idea that AIs can have virtually any goal the orthogonality thesis, and he calls the idea that there are attractor strategies shared by almost any goal-driven system (e.g., self-preservation, knowledge acquisition) the instrumental convergence thesis.

Bostrom fleshes out his worries about smarter-than-human AI in the book Superintelligence: Paths, Dangers, Strategies, which came out in the US a few days ago. He says much more there about the special technical and strategic challenges involved in general AI. Here’s one of the many scenarios he discusses, excerpted:

[T]he orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans — scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible — and in fact technically a lot easier — to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that — absent a specific effort — the first superintelligence may have some such random or reductionistic final goal.

[… T]he instrumental convergence thesis entails that we cannot blithely assume that a superintelligence with the final goal of calculating the decimals of pi (or making paperclips, or counting grains of sand) would limit its activities in such a way as not to infringe on human interests. An agent with such a final goal would have a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources and, if possible, to eliminate potential threats to itself and its goal system. Human beings might constitute potential threats; they certainly constitute physical resources. […]

It might seem incredible that a project would build or release an AI into the world without having strong grounds for trusting that the system will not cause an existential catastrophe. It might also seem incredible, even if one project were so reckless, that wider society would not shut it down before it (or the AI it was building) attains a decisive strategic advantage. But as we shall see, this is a road with many hazards. […]

With the help of the concept of convergent instrumental value, we can see the flaw in one idea for how to ensure superintelligence safety. The idea is that we validate the safety of a superintelligent AI empirically by observing its behavior while it is in a controlled, limited environment (a “sandbox”) and that we only let the AI out of the box if we see it behaving in a friendly, cooperative, responsible manner.

The flaw in this idea is that behaving nicely while in the box is a convergent instrumental goal for friendly and unfriendly AIs alike. An unfriendly AI of sufficient intelligence realizes that its unfriendly final goals will be best realized if it behaves in a friendly manner initially, so that it will be let out of the box. It will only start behaving in a way that reveals its unfriendly nature when it no longer matters whether we find out; that is, when the AI is strong enough that human opposition is ineffectual.

Consider also a related set of approaches that rely on regulating the rate of intelligence gain in a seed AI by subjecting it to various kinds of intelligence tests or by having the AI report to its programmers on its rate of progress. At some point, an unfriendly AI may become smart enough to realize that it is better off concealing some of its capability gains. It may underreport on its progress and deliberately flunk some of the harder tests, in order to avoid causing alarm before it has grown strong enough to attain a decisive strategic advantage. The programmers may try to guard against this possibility by secretly monitoring the AI’s source code and the internal workings of its mind; but a smart-enough AI would realize that it might be under surveillance and adjust its thinking accordingly. The AI might find subtle ways of concealing its true capabilities and its incriminating intent. (Devising clever escape plans might, incidentally, also be a convergent strategy for many types of friendly AI, especially as they mature and gain confidence in their own judgments and capabilities. A system motivated to promote our interests might be making a mistake if it allowed us to shut it down or to construct another, potentially unfriendly AI.)

We can thus perceive a general failure mode, wherein the good behavioral track record of a system in its juvenile stages fails utterly to predict its behavior at a more mature stage. Now, one might think that the reasoning described above is so obvious that no credible project to develop artificial general intelligence could possibly overlook it. But one should not be too overconfident that this is so.

Consider the following scenario. Over the coming years and decades, AI systems become gradually more capable and as a consequence find increasing real-world application: they might be used to operate trains, cars, industrial and household robots, and autonomous military vehicles. We may suppose that this automation for the most part has the desired effects, but that the success is punctuated by occasional mishaps — a driverless truck crashes into oncoming traffic, a military drone fires at innocent civilians. Investigations reveal the incidents to have been caused by judgment errors by the controlling AIs. Public debate ensues. Some call for tighter oversight and regulation, others emphasize the need for research and better-engineered systems — systems that are smarter and have more common sense, and that are less likely to make tragic mistakes. Amidst the din can perhaps also be heard the shrill voices of doomsayers predicting many kinds of ill and impending catastrophe. Yet the momentum is very much with the growing AI and robotics industries. So development continues, and progress is made. As the automated navigation systems of cars become smarter, they suffer fewer accidents; and as military robots achieve more precise targeting, they cause less collateral damage. A broad lesson is inferred from these observations of real-world outcomes: the smarter the AI, the safer it is. It is a lesson based on science, data, and statistics, not armchair philosophizing. Against this backdrop, some group of researchers is beginning to achieve promising results in their work on developing general machine intelligence. The researchers are carefully testing their seed AI in a sandbox environment, and the signs are all good. The AI’s behavior inspires confidence — increasingly so, as its intelligence is gradually increased.

At this point, any remaining Cassandra would have several strikes against her:

A history of alarmists predicting intolerable harm from the growing capabilities of robotic systems and being repeatedly proven wrong. Automation has brought many benefits and has, on the whole, turned out safer than human operation.

ii  A clear empirical trend: the smarter the AI, the safer and more reliable it has been. Surely this bodes well for a project aiming at creating machine intelligence more generally smart than any ever built before — what is more, machine intelligence that can improve itself so that it will become even more reliable.

iii  Large and growing industries with vested interests in robotics and machine intelligence. These fields are widely seen as key to national economic competitiveness and military security. Many prestigious scientists have built their careers laying the groundwork for the present applications and the more advanced systems being planned.

iv  A promising new technique in artificial intelligence, which is tremendously exciting to those who have participated in or followed the research. Although safety issues and ethics are debated, the outcome is preordained. Too much has been invested to pull back now. AI researchers have been working to get to human-level artificial intelligence for the better part of a century: of course there is no real prospect that they will now suddenly stop and throw away all this effort just when it finally is about to bear fruit.

v  The enactment of some safety rituals, whatever helps demonstrate that the participants are ethical and responsible (but nothing that significantly impedes the forward charge).

vi  A careful evaluation of seed AI in a sandbox environment, showing that it is behaving cooperatively and showing good judgment. After some further adjustments, the test results are as good as they could be. It is a green light for the final step . . .

And so we boldly go — into the whirling knives.

We observe here how it could be the case that when dumb, smarter is safe; yet when smart, smarter is more dangerous. There is a kind of pivot point, at which a strategy that has previously worked excellently suddenly starts to backfire.

For more on terminal goal orthogonality, see Stuart Armstrong’s “General Purpose Intelligence“. For more on instrumental goal convergence, see Steve Omohundro’s “Rational Artificial Intelligence for the Greater Good“.