Bostrom on AI deception

Oxford philosopher Nick Bostrom has argued, in “The Superintelligent Will,” that advanced AIs are likely to diverge in their terminal goals (i.e., their ultimate decision-making criteria), but converge in some of their instrumental goals (i.e., the policies and plans they expect to indirectly further their terminal goals). An arbitrary superintelligent AI would be mostly unpredictable, except to the extent that nearly all plans call for similar resources or similar strategies. The latter exception may make it possible for us to do some long-term planning for future artificial agents.

Bostrom calls the idea that AIs can have virtually any goal the orthogonality thesis, and he calls the idea that there are attractor strategies shared by almost any goal-driven system (e.g., self-preservation, knowledge acquisition) the instrumental convergence thesis.

Bostrom fleshes out his worries about smarter-than-human AI in the book Superintelligence: Paths, Dangers, Strategies, which came out in the US a few days ago. He says much more there about the special technical and strategic challenges involved in general AI. Here’s one of the many scenarios he discusses, excerpted:

[T]he orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans — scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible — and in fact technically a lot easier — to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that — absent a specific effort — the first superintelligence may have some such random or reductionistic final goal.

[… T]he instrumental convergence thesis entails that we cannot blithely assume that a superintelligence with the final goal of calculating the decimals of pi (or making paperclips, or counting grains of sand) would limit its activities in such a way as not to infringe on human interests. An agent with such a final goal would have a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources and, if possible, to eliminate potential threats to itself and its goal system. Human beings might constitute potential threats; they certainly constitute physical resources. […]

It might seem incredible that a project would build or release an AI into the world without having strong grounds for trusting that the system will not cause an existential catastrophe. It might also seem incredible, even if one project were so reckless, that wider society would not shut it down before it (or the AI it was building) attains a decisive strategic advantage. But as we shall see, this is a road with many hazards. […]

With the help of the concept of convergent instrumental value, we can see the flaw in one idea for how to ensure superintelligence safety. The idea is that we validate the safety of a superintelligent AI empirically by observing its behavior while it is in a controlled, limited environment (a “sandbox”) and that we only let the AI out of the box if we see it behaving in a friendly, cooperative, responsible manner.

The flaw in this idea is that behaving nicely while in the box is a convergent instrumental goal for friendly and unfriendly AIs alike. An unfriendly AI of sufficient intelligence realizes that its unfriendly final goals will be best realized if it behaves in a friendly manner initially, so that it will be let out of the box. It will only start behaving in a way that reveals its unfriendly nature when it no longer matters whether we find out; that is, when the AI is strong enough that human opposition is ineffectual.

Consider also a related set of approaches that rely on regulating the rate of intelligence gain in a seed AI by subjecting it to various kinds of intelligence tests or by having the AI report to its programmers on its rate of progress. At some point, an unfriendly AI may become smart enough to realize that it is better off concealing some of its capability gains. It may underreport on its progress and deliberately flunk some of the harder tests, in order to avoid causing alarm before it has grown strong enough to attain a decisive strategic advantage. The programmers may try to guard against this possibility by secretly monitoring the AI’s source code and the internal workings of its mind; but a smart-enough AI would realize that it might be under surveillance and adjust its thinking accordingly. The AI might find subtle ways of concealing its true capabilities and its incriminating intent. (Devising clever escape plans might, incidentally, also be a convergent strategy for many types of friendly AI, especially as they mature and gain confidence in their own judgments and capabilities. A system motivated to promote our interests might be making a mistake if it allowed us to shut it down or to construct another, potentially unfriendly AI.)

We can thus perceive a general failure mode, wherein the good behavioral track record of a system in its juvenile stages fails utterly to predict its behavior at a more mature stage. Now, one might think that the reasoning described above is so obvious that no credible project to develop artificial general intelligence could possibly overlook it. But one should not be too overconfident that this is so.

Consider the following scenario. Over the coming years and decades, AI systems become gradually more capable and as a consequence find increasing real-world application: they might be used to operate trains, cars, industrial and household robots, and autonomous military vehicles. We may suppose that this automation for the most part has the desired effects, but that the success is punctuated by occasional mishaps — a driverless truck crashes into oncoming traffic, a military drone fires at innocent civilians. Investigations reveal the incidents to have been caused by judgment errors by the controlling AIs. Public debate ensues. Some call for tighter oversight and regulation, others emphasize the need for research and better-engineered systems — systems that are smarter and have more common sense, and that are less likely to make tragic mistakes. Amidst the din can perhaps also be heard the shrill voices of doomsayers predicting many kinds of ill and impending catastrophe. Yet the momentum is very much with the growing AI and robotics industries. So development continues, and progress is made. As the automated navigation systems of cars become smarter, they suffer fewer accidents; and as military robots achieve more precise targeting, they cause less collateral damage. A broad lesson is inferred from these observations of real-world outcomes: the smarter the AI, the safer it is. It is a lesson based on science, data, and statistics, not armchair philosophizing. Against this backdrop, some group of researchers is beginning to achieve promising results in their work on developing general machine intelligence. The researchers are carefully testing their seed AI in a sandbox environment, and the signs are all good. The AI’s behavior inspires confidence — increasingly so, as its intelligence is gradually increased.

At this point, any remaining Cassandra would have several strikes against her:

A history of alarmists predicting intolerable harm from the growing capabilities of robotic systems and being repeatedly proven wrong. Automation has brought many benefits and has, on the whole, turned out safer than human operation.

ii  A clear empirical trend: the smarter the AI, the safer and more reliable it has been. Surely this bodes well for a project aiming at creating machine intelligence more generally smart than any ever built before — what is more, machine intelligence that can improve itself so that it will become even more reliable.

iii  Large and growing industries with vested interests in robotics and machine intelligence. These fields are widely seen as key to national economic competitiveness and military security. Many prestigious scientists have built their careers laying the groundwork for the present applications and the more advanced systems being planned.

iv  A promising new technique in artificial intelligence, which is tremendously exciting to those who have participated in or followed the research. Although safety issues and ethics are debated, the outcome is preordained. Too much has been invested to pull back now. AI researchers have been working to get to human-level artificial intelligence for the better part of a century: of course there is no real prospect that they will now suddenly stop and throw away all this effort just when it finally is about to bear fruit.

v  The enactment of some safety rituals, whatever helps demonstrate that the participants are ethical and responsible (but nothing that significantly impedes the forward charge).

vi  A careful evaluation of seed AI in a sandbox environment, showing that it is behaving cooperatively and showing good judgment. After some further adjustments, the test results are as good as they could be. It is a green light for the final step . . .

And so we boldly go — into the whirling knives.

We observe here how it could be the case that when dumb, smarter is safe; yet when smart, smarter is more dangerous. There is a kind of pivot point, at which a strategy that has previously worked excellently suddenly starts to backfire.

For more on terminal goal orthogonality, see Stuart Armstrong’s “General Purpose Intelligence“. For more on instrumental goal convergence, see Steve Omohundro’s “Rational Artificial Intelligence for the Greater Good“.



6 thoughts on “Bostrom on AI deception

  1. Nice summary, Robby.

    Not that I want to bet my civilization on it, but –
    Armstrong and Omohundro assume that a relatively smart AI will have a utility function, or at least (the functional equivalent of) desires. I’m not convinced that’s true. And if you don’t desire the ends, you don’t desire the means. If the automated car is just really good at keeping you out of accidents, but doesn’t desire to do so, then it won’t generalize and try to get money and power with which to proselytize against driving, or influence legislation for safer highway design.

    1. With agent models of AI, we mostly care about the AI’s actions/decisions. We care less about its preferences, and generally assume a rational agent model. That is, we define its preferences in terms of what outcomes, empirically, it promotes. It might be that the AI’s ‘true’ goals are written somewhere in its head — e.g., maybe a part of it subjectively feels desires, analogous to how humans introspect desires even when we fail to act on them — but we don’t care about an encoding of the ‘true’ goals anywhere inside the agent, or anywhere outside, unless they actually determine how the agent acts.

      So the idea that the agent has ‘goals’ can be tabooed away. What we really mean is that the agent outputs different things in a way that’s sensitive to features the agent tracks in its environment, and these outputs tend to result in a relatively narrow set of outcomes. If the set of outcomes is narrow enough, the agent’s outcomes will need to be determined by some mechanism that predicts the consequences of different action sequences.

      What makes the Google car safe isn’t that it lacks preferences; what makes it safe is the same thing that makes a turtle safe, its lack of intelligence (problem-solving efficiency + problem-solving generality). A turtle could have the most dangerous goals in the world, but it doesn’t matter so long as its relatively poor at promoting them. They’re all optimizers, but they’re weak optimizers, in large part because they lack a big hypothesis space and lack an efficient way to trim that hypothesis space, and to search through it for good policies. Once you give a car a big, searchable enough hypothesis or policy space to qualify as an AGI, it’s harder to say how to make it safe.

  2. Okay, taboo “desire” and “goals”; let’s use the machine learning terms: an objective function, and an optimization problem. The objective function has to be phrased in terms of variables that the machine actually uses. And the optimization problem will also be defined by a particular set of variables encoded into the machine. Suppose that the engineers define the “optimization problem” in terms of steering, braking, and accelerating only; and the objective function as 10 * Arrive_at_destination – Number_of_travel_hours – 1000000 * Number_of_collisions. Given the definition of the optimization problem, the car will never hit on the idea of “use the radio to broadcast false emergency warnings to clear the traffic”. Using the radio is not one of the variables {steering, braking, accelerating}. I realize that this is way oversimplified, but it is an extremely long way from here to “try to take over the world!!!” as a potential strategy.

    1. This sounds equivalent to saying ‘just don’t invent AGI’ / ‘just don’t give AIs a rich enough hypothesis space to propose policies and make predictions that can solve a variety of problems’. Your original claim was that a “relatively smart AI” needn’t have general-purpose goals. If Deep Blue counts as “relatively smart”, then sure. Is this discussion just about whether narrow AI is safer than general AI?

      1. I don’t know; I admit I am confused. One way to phrase my suggestion is that a hodgepodge of narrow AI engines may be smart enough, and easy enough to achieve, to give people at least most of what they want from a general AI. (Perhaps the “framing problem” turns out to be insanely hard.)

        Alternatively, maybe you can have a well-functioning AI whose “goal” structures all look like narrow AI with predefined arrays of variables to optimize over, while its “belief” structures are rife with richer, fuzzier, human-like concepts. It’s well-known that humans don’t have VNM utility functions – although we are still extremely adaptable and in a less mathematically-precise sense we still are goal oriented. But if you consider VNM utility to be the paradigm of goal-seeking, we fall short. Perhaps one can fall even shorter – much shorter – and still demonstrate tremendous adaptability and “originality” within narrow domains of quasi-optimizations.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s