The seed is not the superintelligence

This is the conclusion of a LessWrong post, following The AI Knows, But Doesn’t Care.

If an artificial intelligence is smart enough to be dangerous to people, we’d intuitively expect it to be smart enough to know how to make itself safe for people. But that doesn’t mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety.

That means we have to understand how to code safety. We can’t pass the entire buck to the AI, when only an AI we’ve already safety-proofed will be safe to ask for help on safety issues! Generally: If the AI is weak enough to be safe, it’s too weak to solve this problem. If it’s strong enough to solve this problem, it’s too strong to be safe.

This is an urgent public safety issue, given the five theses and given that we’ll likely figure out how to make a decent artificial programmer before we figure out how to make an excellent artificial ethicist.

File:Ouroboros-Zanaq.svg

The AI’s trajectory of self-modification has to come from somewhere.

“Take an AI in a box that wants to persuade its gatekeeper to set it free. Do you think that such an undertaking would be feasible if the AI was going to interpret everything the gatekeeper says in complete ignorance of the gatekeeper’s values? […] I don’t think so. So how exactly would it care to follow through on an interpretation of a given goal that it knows, given all available information, is not the intended meaning of the goal? If it knows what was meant by ‘minimize human suffering’ then how does it decide to choose a different meaning? And if it doesn’t know what is meant by such a goal, how could it possible [sic] convince anyone to set it free, let alone take over the world?”
               —Alexander Kruel
“If the AI doesn’t know that you really mean ‘make paperclips without killing anyone’, that’s not a realistic scenario for AIs at all–the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to ‘make paperclips in the way that I mean’.”
               Jiro

The wish-granting genie we’ve conjured — if it bothers to even consider the question — should be able to understand what you mean by ‘I wish for my values to be fulfilled.’ Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie’s map can compass your true values. Superintelligence doesn’t imply that the genie’s utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.

The critical mistake here is to not distinguish the seed AI we initially program from the superintelligent wish-granter it self-modifies to become. We can’t use the genius of the superintelligence to tell us how to program its own seed to become the sort of superintelligence that tells us how to build the right seed. Time doesn’t work that way.

We can delegate most problems to the FAI. But the one problem we can’t safely delegate is the problem of coding the seed AI to produce the sort of superintelligence to which a task can be safely delegated.

When you write the seed’s utility function, you, the programmer, don’t understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence’s actions,long after it’s become smart enough to fully understand our values.

Why is the superintelligence, if it’s so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can’t we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: ‘When you’re smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.’?

Because that sentence has to actually be coded in to the AI, and when we do so, there’s no ghost in the machine to know exactly what we mean by ‘frend-lee-ness thee-ree’. Instead, we have to give it criteria we think are good indicators of Friendliness, so it’ll know what to self-modify toward. And if one of the landmarks on our ‘frend-lee-ness’ road map is a bit off, we lose the world.

Yes, the UFAI will be able to solve Friendliness Theory. But if we haven’t already solved it on our own power, we can’tpinpoint Friendliness in advance, out of the space of utility functions. And if we can’t pinpoint it with enough detail to draw a road map to it and it alone, we can’t program the AI to care about conforming itself with that particular idiosyncratic algorithm.

Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI’s decision criteria, no argument or discovery will spontaneously change its heart.

And, yes, the UFAI will be able to simulate humans accurately enough to know that its own programmers would wish, if they knew the UFAI’s misdeeds, that they had programmed the seed differently. But what’s done is done. Unless we ourselves figure out how to program the AI to terminally value its programmers’ True Intentions, the UFAI will just shrug at its creators’ foolishness and carry on converting the Virgo Supercluster’s available energy into paperclips.

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer’s True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we’ve solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.

Not all small targets are alike.

“You write that the worry is that the superintelligence won’t care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean? […]
“If an AI is meant to behave generally intelligent [sic] then it will have to work as intended or otherwise fail to be generally intelligent.”
            Alexander Kruel

It’s easy to get a genie to care about (optimize for) something-or-other; what’s hard is getting one to care about the right something.

‘Working as intended’ is a simple phrase, but behind it lies a monstrously complex referent. It doesn’t clearly distinguish the programmers’ (mostly implicit) true preferences from their stated design objectives; an AI’s actual code can differ from either or both of these. Crucially, what an AI is ‘intended’ for isn’t all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to kill Friendliness much more easily than intelligence.

It may be hard to build self-modifying AGI. But it’s not the same hardness as the hardness of Friendliness Theory. Being able to hit one small target doesn’t entail that you can or will hit every small target it would be in your best interest to hit. Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:

(i) Research Inertia. Far more people are working on AGI than on Friendliness. And there may not come a moment when researchers will suddenly realize that they need to take all their resources out of AGI and pour them into Friendliness. If the status quo continues, the default expectation should be UFAI.

(ii) Disjunctive Instrumental Value. Being more intelligent — that is, better able to manipulate diverse environments — is of instrumental value to nearly every goal. Being Friendly is of instrumental value to barely any goals. This makes it more likely by default that short-sighted humans will be interested in building AGI than in developing Friendliness Theory. And it makes it much likelier that an attempt at Friendly AGI that has a slightly defective goal architecture will retain the instrumental value of intelligence than of Friendliness.

(iii) Incremental Approachability. Friendliness is an all-or-nothing target. Value is fragile and complex, and a half-good being editing its morality drive is at least as likely to move toward 40% goodness as 60%. Cross-domain efficiency, in contrast, is not an all-or-nothing target. If you just make the AGI slightly better than a human at improving the efficiency of AGI, then this can snowball into ever-improving efficiency, even if the beginnings were clumsy and imperfect. It’s easy to put a reasoning machine into a feedback loop with reality in which it is differentially rewarded for being smarter; it’s hard to put one into a feedback loop with reality in which it is differentially rewarded for picking increasingly correct answers to ethical dilemmas.

The ability to productively rewrite software and the ability to perfectly extrapolate humanity’s True Preferences are two different skills. (For example, humans have the former capacity, and not the latter. Most humans, given unlimited power, would be unintentionally Unfriendly.)

It’s true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don’t have them both, and a pre-FOOM self-improving AGI (‘seed’) need not have both. Being able to program good programmers is all that’s required for an intelligence explosion; but being a good programmer doesn’t imply that one is a superlative moral psychologist or moral philosopher.

If the programmers don’t know in mathematical detail what Friendly code would even look like, then the seed won’t be built to want to build toward the right code. And if the seed isn’t built to want to self-modify toward Friendliness, then the superintelligence it sproutsalso won’t have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general ‘hit whatever target I want’ ability that makes Friendliness easy.

And that’s why some people are worried.

A non-technical introduction to AI risk

In the summer of 2008, experts attending the Global Catastrophic Risk Conference assigned a 5% probability to the human species’ going extinct due to “superintelligent AI” by the year 2100. New organizations, like the Centre for the Study of Existential Risk and the Machine Intelligence Research Institute, are springing up to face the challenge of an AI apocalypse. But what is artificial intelligence, and why do people think it’s dangerous?

As it turns out, studying AI risk is useful for gaining a deeper understanding of philosophy of mind and ethics, and a lot of the general theses are accessible to non-experts. So I’ve gathered here a list of short, accessible, informal articles, mostly written by Eliezer Yudkowsky, to serve as a philosophical crash course on the topic. The first half will focus on what makes something intelligent, and what an Artificial General Intelligence is. The second half will focus on what makes such an intelligence ‘friendly‘ — that is, safe and useful — and why this matters.

____________________________________________________________________________

Part I. Building intelligence.

An artificial intelligence is any program or machine that can autonomously and efficiently complete a complex task, like Google Maps, or a xerox machine. One of the largest obstacles to assessing AI risk is overcoming anthropomorphism, the tendency to treat non-humans as though they were quite human-like. Because AIs have complex goals and behaviors, it’s especially difficult not to think of them as people. Having a better understanding of where human intelligence comes from, and how it differs from other complex processes, is an important first step in approaching this challenge with fresh eyes.

1. Power of Intelligence. Why is intelligence important?

2. Ghosts in the Machine. Is building an intelligence from scratch like talking to a person?

3. Artificial Addition. What can we conclude about the nature of intelligence from the fact that we don’t yet understand it?

4. Adaptation-Executers, not Fitness-Maximizers. How do human goals relate to the ‘goals’ of evolution?

5. The Blue-Minimizing Robot. What are the shortcomings of thinking of things as ‘agents’, ‘intelligences’, or ‘optimizers’ with defined values/goals/preferences?

Part II. Intelligence explosion.

Forecasters are worried about Artificial General Intelligence (AGI), an AI that, like a human, can achieve a wide variety of different complex aims. An AGI could think faster than a human, making it better at building new and improved AGI — which would be better still at designing AGI. As this snowballed, AGI would improve itself faster and faster, become increasingly unpredictable and powerful as its design changed. The worry is that we’ll figure out how to make self-improving AGI before we figure out how to safety-proof every link in this chain of AGI-built AGIs.

6. Optimization and the Singularity. What is optimization? As optimization processes, how do evolution, humans, and self-modifying AGI differ?

7. Efficient Cross-Domain Optimization. What is intelligence?

8. The Design Space of Minds-In-General. What else is universally true of intelligences?

9. Plenty of Room Above Us. Why should we expect self-improving AGI to quickly become superintelligent?

Part III. AI risk.

In the Prisoner’s Dilemma, it’s better for both players to cooperate than for both to defect; and we have a natural disdain for human defectors. But an AGI is not a human; it’s just a process that increases its own ability to produce complex, low-probability situations. It doesn’t necessarily experience joy or suffering, doesn’t necessarily possess consciousness or personhood. When we treat it like a human, we not only unduly weight its own artificial ‘preferences’ over real human preferences, but also mistakenly assume that an AGI is motivated by human-like thoughts and emotions. This makes us reliably underestimate the risk involved in engineering an intelligence explosion.

10. The True Prisoner’s Dilemma. What kind of jerk would Defect even knowing the other side Cooperated?

11. Basic AI drives. Why are AGIs dangerous even when they’re indifferent to us?

12. Anthropomorphic Optimism. Why do we think things we hope happen are likelier?

13. The Hidden Complexity of Wishes. How hard is it to directly program an alien intelligence to enact my values?

14. Magical Categories. How hard is it to program an alien intelligence to reconstruct my values from observed patterns?

15. The AI Problem, with Solutions. How hard is it to give AGI predictable values of any sort? More generally, why does AGI risk matter so much?

Part IV. Ends.

A superintelligence has the potential not only to do great harm, but also to greatly benefit humanity. If we want to make sure that whatever AGIs people make respect human values, then we need a better understanding of what those values actually are. Keeping our goals in mind will also make it less likely that we’ll despair of solving the Friendliness problem. The task looks difficult, but we have no way of knowing how hard it will end up being until we’ve invested more resources into safety research. Keeping in mind how much we have to gain, and to lose, advises against both cynicism and complacency.

16. Could Anything Be Right? What do we mean by ‘good’, or ‘valuable’, or ‘moral’?

17. Morality as Fixed Computation. Is it enough to have an AGI improve the fit between my preferences and the world?

18. Serious Stories. What would a true utopia be like?

19. Value is Fragile. If we just sit back and let the universe do its thing, will it still produce value? If we don’t take charge of our future, won’t it still turn out interesting and beautiful on some deeper level?

20. The Gift We Give To Tomorrow. In explaining value, are we explaining it away? Are we making our goals less important?

In conclusion, a summary of the core argument: Five theses, two lemmas, and a couple of strategic implications.

____________________________________________________________________________

If you’re convinced, MIRI has put together a list of ways you can get involved in promoting AI safety research. You can also share this post and start conversations about it, to put the issue on more people’s radars. If you want to read on, check out the more in-depth articles below.

____________________________________________________________________________

Further reading

What can we reasonably concede to unreason?

This post first appeared on the Secular Alliance at Indiana University blog.

In October, SAIU members headed up to Indianapolis for the Center for Inquiry‘s “Defending Science: Challenges and Strategies” workshop. Massimo Pigliucci and Julia Galef, co-hosts of the podcast Rationally Speaking, spoke about natural deficits in reasoning, while Jason Rodriguez and John Shook focused on deliberate attempts to restrict scientific inquiry.

Julia Galef drew our attention to the common assumption that being rational means abandoning all intuition and emotion, an assumption she dismissed as a flimsy Hollywood straw man, or “straw vulcan”. True rationality, Julia suggested, is about the skillful integration of intuitive and deliberative thought. As she noted in a similar talk at the Singularity Summit, these skills demand constant cultivation and vigilance. In their absence, we all predictably fall victim to an array of cognitive biases.

To that end, Galef spoke of suites of indispensable “rationality skills”:

  • Know when to override an intuitive judgment with a reasoned one. Recognize cases where your intuition reliably fails, but also cases where intuition tends to perform better than reason.
  • Learn how to query your intuitive brain. For instance, to gauge how you really feel about a possibility, visualize it concretely, and perform thought experiments to test how different parameters and framing effects are influencing you.
  • Persuade your intuitive system of what your reason already knows. For example: Anna Salamon knew intellectually that wire-guided sky jumps are safe, but was having trouble psyching herself up. So she made her knowledge of statistics concrete, imagining thousands of people jumping before her eyes. This helped trick her affective response into better aligning with her factual knowledge.

Massimo Pigliucci’s talk, “A Very Short Course in Intellectual Self-Defense”, was in a similar vein. Pigliucci drew our attention to common formal and informal fallacies, and to the limits of deductive, inductive, and mathematical thought. Dissenting from Thomas Huxley’s view that ordinary reasoning is a great deal like science, Pigliucci argued that science is cognitively unnatural. This is why untrained reasoners routinely fail to properly amass and evaluate data.

While it’s certainly important to keep in mind how much hard work empirical rigor demands, I think we should retain a qualified version of Huxley’s view. It’s worth emphasizing that careful thought is not the exclusive property of professional academics, that the basic assumptions of science are refined versions of many of the intuitions we use in navigating our everyday environments. Science’s methods are rarefied, but not exotic or parochial. If we forget this, we risk giving too much credence to presuppositionalist apologetics.

Next, Jason Rodriguez discussed the tactics and goals of science organizations seeking to appease, work with, or reach out to the religious. Surveying a number of different views on the creation-evolution debate, Rodriguez questioned when it is more valuable to attack religious doctrines head-on, and when it is more productive to avoid conflict or make concessions.

This led in to John Shook’s vigorous talk, “Science Must Never Compromise With Religion, No Matter the Metaphysical or Theological Temptations”, and a follow-up Rationally Speaking podcast with Galef and Pigliucci. As you probably guessed, it focused on attacking metaphysicians and theologians who seek to limit the scope or undermine the credibility of scientific inquiry. Shook’s basic concern was that intellectuals are undermining the authority of science when they deem some facts ‘scientific’ and others ‘unscientific’. This puts undue constraints on scientific practice. Moreover, it gives undue legitimacy to those philosophical and religious thinkers who think abstract thought or divine revelation grant us access to a special domain of Hidden Truths.

Shook’s strongest argument was against attempts to restrict science to ‘the natural’. If we define ‘Nature’ in terms of what is scientifically knowable, then this is an empty and useless constraint. But defining the natural instead as the physical, or the spatiotemporal, or the unmiraculous, deprives us of any principled reason to call our research programs ‘methodologically naturalistic’. We could imagine acquiring good empirical evidence for magic, for miracles, even for causes beyond our universe. So science’s skepticism about such phenomena is a powerful empirical conclusion. It is not an unargued assumption or prejudice on the part of scientists.

Shook also argued that metaphysics does not provide a special, unscientific source of knowledge; the claims of metaphysicians are pure and abject speculation. I found this part of the talk puzzling. Metaphysics, as the study of the basic features of reality, does not seem radically divorced from theoretical physics and mathematics, which make similar claims to expand at least our pool of conditional knowledge, knowledge of the implications of various models. Yet Shook argued, not for embracing metaphysics as a scientific field, but for dismissing it as fruitless hand-waving.

Perhaps the confusion stemmed from a rival conception of ‘metaphysics’, not as a specific academic field, but as the general practice of drawing firm conclusions about ultimate reality from introspection alone — what some might call ‘armchair philosophy’ or ‘neoscholasticism’. Philosophers of all fields — and, for that matter, scientists — would do well to more fully internalize the dangers of excessive armchair speculation. But the criticism is only useful if it is carefully aimed. If we fixate on ‘metaphysics’ and ‘theology’ as the sole targets of our opprobrium, we risk neglecting the same arrogance in other guises, while maligning useful exploration into the contents, bases, and consequences of our conceptual frameworks. And if we restrict knowledge to science, we risk not only delegitimizing fields like logic and mathematics, but also putting undue constraints on science itself. For picking out a special domain of purported facts as ‘metaphysical’, and therefore unscientific, has exactly the same risks as picking out a special domain as ‘non-natural’ or ‘supernatural’.

To defend science effectively, we have to pick our battles with care. This clearly holds true in public policy and education, where it is most useful in some cases to go for the throat, in other cases to make compromises and concessions. But it also applies to our own personal struggles to become more rational, where we must carefully weigh the costs of overriding our unreasoned intuitions, taking a balanced and long-term approach. And it also holds in disputes over the philosophical foundations and limits of scientific knowledge, where the cost of committing ourselves to unusual conceptions of ‘science’ or ‘knowledge’ or ‘metaphysics’ must be weighed against any argumentative and pedagogical benefits.

This workshop continues to stimulate my thought, and continues to fuel my drive to improve science education. The central insight the speakers shared was that the practices we group together as ‘science’ cannot be defended or promoted in a vacuum. We must bring to light the psychological and philosophical underpinnings of science, or we will risk losing sight of the real object of our hope and concern.