A non-technical introduction to AI risk

In the summer of 2008, experts attending the Global Catastrophic Risk Conference assigned a 5% probability to the human species’ going extinct due to “superintelligent AI” by the year 2100. New organizations, like the Centre for the Study of Existential Risk and the Machine Intelligence Research Institute, are springing up to face the challenge of an AI apocalypse. But what is artificial intelligence, and why do people think it’s dangerous?

As it turns out, studying AI risk is useful for gaining a deeper understanding of philosophy of mind and ethics, and a lot of the general theses are accessible to non-experts. So I’ve gathered here a list of short, accessible, informal articles, mostly written by Eliezer Yudkowsky, to serve as a philosophical crash course on the topic. The first half will focus on what makes something intelligent, and what an Artificial General Intelligence is. The second half will focus on what makes such an intelligence ‘friendly‘ — that is, safe and useful — and why this matters.


Part I. Building intelligence.

An artificial intelligence is any program or machine that can autonomously and efficiently complete a complex task, like Google Maps, or a xerox machine. One of the largest obstacles to assessing AI risk is overcoming anthropomorphism, the tendency to treat non-humans as though they were quite human-like. Because AIs have complex goals and behaviors, it’s especially difficult not to think of them as people. Having a better understanding of where human intelligence comes from, and how it differs from other complex processes, is an important first step in approaching this challenge with fresh eyes.

1. Power of Intelligence. Why is intelligence important?

2. Ghosts in the Machine. Is building an intelligence from scratch like talking to a person?

3. Artificial Addition. What can we conclude about the nature of intelligence from the fact that we don’t yet understand it?

4. Adaptation-Executers, not Fitness-Maximizers. How do human goals relate to the ‘goals’ of evolution?

5. The Blue-Minimizing Robot. What are the shortcomings of thinking of things as ‘agents’, ‘intelligences’, or ‘optimizers’ with defined values/goals/preferences?

Part II. Intelligence explosion.

Forecasters are worried about Artificial General Intelligence (AGI), an AI that, like a human, can achieve a wide variety of different complex aims. An AGI could think faster than a human, making it better at building new and improved AGI — which would be better still at designing AGI. As this snowballed, AGI would improve itself faster and faster, become increasingly unpredictable and powerful as its design changed. The worry is that we’ll figure out how to make self-improving AGI before we figure out how to safety-proof every link in this chain of AGI-built AGIs.

6. Optimization and the Singularity. What is optimization? As optimization processes, how do evolution, humans, and self-modifying AGI differ?

7. Efficient Cross-Domain Optimization. What is intelligence?

8. The Design Space of Minds-In-General. What else is universally true of intelligences?

9. Plenty of Room Above Us. Why should we expect self-improving AGI to quickly become superintelligent?

Part III. AI risk.

In the Prisoner’s Dilemma, it’s better for both players to cooperate than for both to defect; and we have a natural disdain for human defectors. But an AGI is not a human; it’s just a process that increases its own ability to produce complex, low-probability situations. It doesn’t necessarily experience joy or suffering, doesn’t necessarily possess consciousness or personhood. When we treat it like a human, we not only unduly weight its own artificial ‘preferences’ over real human preferences, but also mistakenly assume that an AGI is motivated by human-like thoughts and emotions. This makes us reliably underestimate the risk involved in engineering an intelligence explosion.

10. The True Prisoner’s Dilemma. What kind of jerk would Defect even knowing the other side Cooperated?

11. Basic AI drives. Why are AGIs dangerous even when they’re indifferent to us?

12. Anthropomorphic Optimism. Why do we think things we hope happen are likelier?

13. The Hidden Complexity of Wishes. How hard is it to directly program an alien intelligence to enact my values?

14. Magical Categories. How hard is it to program an alien intelligence to reconstruct my values from observed patterns?

15. The AI Problem, with Solutions. How hard is it to give AGI predictable values of any sort? More generally, why does AGI risk matter so much?

Part IV. Ends.

A superintelligence has the potential not only to do great harm, but also to greatly benefit humanity. If we want to make sure that whatever AGIs people make respect human values, then we need a better understanding of what those values actually are. Keeping our goals in mind will also make it less likely that we’ll despair of solving the Friendliness problem. The task looks difficult, but we have no way of knowing how hard it will end up being until we’ve invested more resources into safety research. Keeping in mind how much we have to gain, and to lose, advises against both cynicism and complacency.

16. Could Anything Be Right? What do we mean by ‘good’, or ‘valuable’, or ‘moral’?

17. Morality as Fixed Computation. Is it enough to have an AGI improve the fit between my preferences and the world?

18. Serious Stories. What would a true utopia be like?

19. Value is Fragile. If we just sit back and let the universe do its thing, will it still produce value? If we don’t take charge of our future, won’t it still turn out interesting and beautiful on some deeper level?

20. The Gift We Give To Tomorrow. In explaining value, are we explaining it away? Are we making our goals less important?

In conclusion, a summary of the core argument: Five theses, two lemmas, and a couple of strategic implications.


If you’re convinced, MIRI has put together a list of ways you can get involved in promoting AI safety research. You can also share this post and start conversations about it, to put the issue on more people’s radars. If you want to read on, check out the more in-depth articles below.


Further reading


Moral theory is for moral practice

Sam Harris has argued that we should treat situations as morally desirable in proportion to their share of experiential well-being. In a debate, William Lane Craig objected:

On the next-to-last page of his book, Dr. Harris makes the telling admission that if people like rapists, liars, and thieves could be just as happy as good people, then his “moral landscape” would no longer be a moral landscape. Rather, it would just be a continuum of well-being whose peaks are occupied by good and bad people, or evil people, alike. […] The peaks of well-being could be occupied by evil people. But that entails that in the actual world, the continuum of well-being and the moral landscape are not identical either. For identity is a necessary relation.

I think the real problem here isn’t that it could be moral to make evil people happy. Harris and I gladly bite that bullet. The deeper worry is that, in a world teeming with pathological sadists, torturing a minority might well increase aggregate psychological welfare. Yet it would be absurd to conclude that torturing an innocent in such a world is moral.

This is a perfectly fair argument. But Harris simply responds, “Not a realistic concern.

Why the lack of interest? Because, I think, any claim that the English-language word ‘good’ means ‘well-being’, picking it out across all possible worlds, is beside the point for Harris.

A world of sociopaths or sadists would be trapped in a valley of the moral landscape. Fixating on a few tiny hills at the bottom of that valley is missing the big picture, which is that the truly moral act would be to cure the world of its antisocial tendencies, not to indulge them. It’s sort of ‘moral’ for a doctor to spend most of her time making delicious pies for her rapidly deteriorating patients. I mean, baking for others is a good deed, right? But it’s immoral on a deeper level if it distracts the doctor from diagnosing or treating her patients. Craig’s example is alien enough to do some violence to an exact identification of ‘good’ with ‘well-being’, but it does nothing to undermine the enterprise of improving psychological welfare, because it misses the landscape for the hills in much the way the baker-doctor does.

So what is  Harris’ goal in The Moral Landscape ? He seems to want to establish four main theses:

1. Positive experience is what we value.
All the things we care about are instances of experiential well-being.

2. So we should value all positive experience.
Our strongest unreflective desires will be furthered if we come to value such experience in general, however and wherever it manifests. For this binds all of our values together, encouraging us to work together on satisfying them.

3. Morality is about satisfying that universal value.
Since this is the most inclusive normative project we could all legitimately collaborate on, and since it overlaps a great deal with our most rationally defensible moral intuitions, it makes consummate sense to call this project ‘morality’.

4. So science is essential for getting morality right.
The best way to fulfill this valuing-of-experienced-value is to empirically study the conditions for strongly valenced experience.

I’m very skeptical about 1 on any strong interpretation, but I’ll talk about that another time. (EDIT: See Loving the merely physical.) Though Harris places a lot of emphasis on 1, I don’t think it is needed to affirm 2, 3, or 4. Suppose we learn that some people really do value living outside the Matrix, keeping natural wonders intact, promoting ‘purity‘, obeying Yahweh, or doing the right thing for its own sake, and not solely the possible experiential effects of those things. Still Harris could argue that, say…

  • … those goals form a much less consistent whole than do the experiential ones. Perhaps, for instance, subjective projects come into conflict less often than objective ones because we have separate mental lives, but only one shared physical world.
  • education or philosophical reflection tends to make those goals less appealing.
  • … those goals make dubious metaphysical assumptions, in a way experiential goals don’t.
  • … those goals depend for their justification on experiential ones.
  • … those goals causally depend on experiential ones.
  • … those goals are somehow defective variants on, or limiting cases of, experiential ones.
  • … those goals are unusually rare, unusually temporally unstable, or unusually low-intensity.
  • … those goals are so different from experiential ones that they can’t all reasonably be lumped into a single category.

Some combination of the above conclusions could establish that experience-centered goals form a natural group that should, for pragmatic or theoretical reasons, be discussed in isolation. Once we’ve got such a group, we can then argue that our most prized goals will be furthered if we generically endorse the entire category (2), and that these goals will be further furthered if we reserve ethical language for this category (3). 4 will then fall out of 2 and 3 easily, as an empirical conclusion about the usefulness of empiricism itself.

On my view, then, the real action is in the case for 2 and 3. What is that case?

Why value value?

It’s important to highlight here that Harris doesn’t think everyone already generically values all positive experience. It would be a fallacy to deduce ‘everyone values every positive experience’ from ‘everything that’s valued by anyone is a positive experience’.

[I]n the moral sphere, it is safe to begin with the premise that it is good to avoid behaving in such a way as to produce the worst possible misery for everyone. I am not claiming that most of us personally care about the experience of all conscious beings; I am saying that a universe in which all conscious beings suffer the worst possible misery is worse than a universe in which they experience well-being. This is all we need to speak about “moral truth” in the context of science.

So Harris is proposing that we change our priorities. They should change in pretty much the same way our ancestors’ linguistic, political, and intellectual practices changed to affirm the scientific character and universal value of health.

Why change? Because it will allow us to better collaborate on the things we already care about most. Again, why should we prize health in general, as opposed to caring specifically about the health of certain groups of people, or certain body parts? Why not have medicine focus disproportionately on our right legs, disregarding our left legs almost completely? Well, I suppose there are no unconditional, metaphysically fundamental reasons to value health in general, or to build sciences and social institutions dedicated to understanding and improving it. But it’s simpler that way, and it benefits us both individually and collectively, so… why not?

Valuing every experienced value, in proportion to its intensity and frequency, is egalitarian in spirit. Practically democratic. That doesn’t make it ‘objective’ in any mysterious cosmic sense. But it does make it an extraordinarily useful Schelling point, a slightly arbitrary but stable and fair-minded convention for resolving disputes.

Of course, if we just think of it as an arbitrary convention, without ascribing it any importance — if we ‘mere‘ it — then the whole point of the convention will be lost. If no one had any respect for democracy, democracy would dissolve overnight. It may be very important for the practice of valuing value that we adopt moral realism or consequentialism as an absolute law, even if the justification for doing so isn’t so much philosophical first principles or linguistic definitions as our lived, pragmatic concern for our own and others’ actual welfare. Good conventions save lives.

It’s because we do in fact have conflicting desires that it’s important to have a general framework for resolving disputes, and Harris’ is a surprisingly flexible yet sturdy one. On Harris’ view, we do factor values like nepotism and egoism into our calculus, and try to help even sociopaths live a joyful, fulfilling, beautiful life — within limits.

What limits? Simply that it come at no cost to everyone’s joy, fulfillment, and beauty. In that respect, the system is more fair than a democracy, since unpopular values get equal weight; and at the same time less exploitable than one, since that weight is determined by psychological fact, not by popular opinion.

So most malign values are quelched or stymied not because they’re intrinsically Evil but because they don’t scale well. They don’t interact in such a way that they form sustainable ecosystems of positively valenced experience. On Harris’ view, we shouldn’t block or assist sadists and war criminals merely because it pre-reflectively ‘feels righteous’ to do so; for our sense of righteousness can go horribly astray. Rather, we should do so because an ecumenical ‘value all values’ project demands it, and because abandoning this meta-value means abandoning our best hope for fully general cooperation between sentients.

What’s on the table is less a moral theory than a humanitarian superproject. Harris reinterprets our language of ‘ought’ and ‘should’ not with the goal of solving Kantian paradoxes but with the goal of defining and motivating a long-term civilizational research program, all while bringing our intellectual drives and traditions into a more intimate conversation with our moral drives and traditions, at the individual as well as the societal scale.

Why call this ‘morality’?

For a person who wrote a book about meta-ethics, Harris is remarkably unconcerned with meta-ethics. He takes note of it only to do a bit of conceptual and rhetorical tidying up. At all times, his sights remain firmly fixed on applied ethics, on politics, on, well, real life.

[T]he fact that millions of people use the term “morality” as a synonym for religious dogmatism, racism, sexism, or other failures of insight and compassion should not oblige us to merely accept their terminology until the end of time.

But if there’s real disagreement here, why speak in terms of ‘ought’ and ‘bad’ at all?

The problem isn’t that those are univocal, clearly-defined terms whose entrenched meanings Harris is flouting. The more realistic worry, rather, is that they’re horribly confused terms with only a limited amount of consistency within and across linguistic communities. Folk morality is a mess. Heck, academic morality is a mess. And folk meta-ethics and folk normative ethics (and their academic counterparts) are particularly confused and divergent — far more so than object-level morality. So if Harris’ goal is to inject some clarity and points of basic consensus into this conceptual cacophony, why enter the fray we call ‘ethics’, with its centuries of accumulated obscurity, at all? Why not just invent a new set of terms for what he has in mind, like ‘flought’ and ‘flad’? Then, stipulatively, we could have our flobligation cake and eat it too. If he did that, you can be sure that you’d see fewer people treating ‘but you’re just defining morality as “the maximization of well-being”‘ as an objection.

Although it’s tempting to reboot ethics and start over with a clean slate, I think that the risks should we completely forsake the moral conversation are too dire. Moral language is just a language. (What’s ethical remains ethical, whether we call it ‘ethical’ or ‘flethical’, or ‘unethical’, or ‘linoleum’.) But language matters. Our intuitions are language-shaped. Even if we say that ‘florality’ or ‘neuro-eudaimonics‘ is far more humanly important and conceptually deep than traditional ‘morality’, people raised on the ‘morality’ lexicon will still reliably misconstrue how high the stakes are, misconstrue even their own preferences, if we toss out moral language.

Many [highly educated men and women …] claim that a scientific foundation for morality would serve no purpose in any case. They think we can combat human evil all the while knowing that our notions of “good” and “evil” are completely unwarranted. It is always amusing when the same people then hesitate to condemn specific instances of patently abominable behavior. I don’t think one has fully enjoyed the life of the mind until one has seen a celebrated scholar defend the “contextual” legitimacy of the burqa, or of female genital mutilation, a mere thirty seconds after announcing that mortal relativism does nothing to diminish a person’s commitment to making the world a better place.

Moreover, our traditional talk of goodness and badness has some very useful features, like its correlation with our deepest concerns and its built-in universality. Certainly we could redefine morality in, say, egoist terms. ‘Justice’ and ‘ought’ could be made to refer to the speaker’s interests, as opposed to the overall interests of sentient beings. But then it would be less useful as a language, since the meanings of the terms would vary from person to person, like pronouns do, and since we already have adequate ways to express personal preferences.

Ethical discourse is our only established way to concisely refer to aggregate preference satisfaction. So streamlining the expression-conditions of this discourse, stripping it of the parochial or metaphysically dubious associations it has in certain linguistic communities, may be a very valuable project if we have a sufficiently important candidate meaning to adopt. Harris thinks that psychological well-being meets that condition.

I’ve emphasized the revisionary nature of Harris’ project, because I want to make it clear why objections like Craig’s are beside the point. Harris’ goal is to provide a framework for thinking and talking clearly about humanity’s most important (i.e., most widely and deeply valued) problems and possibilities. His goal isn’t to provide a novel theory that can ground all our naïve normative intuitions, ordinary prescriptive language, or sophisticated ethical theories, because he thinks that all three of these are frequently useless, internally inconsistent, even outright contentless.

Everyone has an intuitive “physics,” but much of our intuitive physics is wrong (with respect to the goal of describing the behavior of matter). Only physicists have a deep understanding of the laws that govern the behavior of matter in our universe. I am arguing that everyone also has an intuitive “morality,” but much of our intuitive morality is clearly wrong (with respect to the goal of maximizing personal and collective well-being).

At the same time, I don’t want to suggest that Harris’ framework is all that ethically novel or strange. We really do care with unparalleled ferocity about suffering, rapture, beauty, tranquility, and all the other qualities of experience Harris is interested in. And our everyday moral intuitions and conventions really do orbit the distribution of extreme forms of these experiences.

My qualification is that that’s a contingent fact, and it’s not the core reason Harris is so interested in this project. If our moral intuitions had turned out to be consistently detrimental to our psychological welfare, Harris would have advocated the destruction of morality, not its reconceptualization! But, for all that, the conservatism of Harris’ proposal is very much worth keeping in mind. If nothing else, it shows that Harris’ project isn’t as difficult as it might seem. All we need is a small but vocal pool of intellectuals and public figures on our side, just large enough to reverse the current cultural trend towards blind relativism and lame nihilism.

Harris’ aim, then, isn’t to give a fully general semantic theory of what the word ‘good’ means in English, or to provide metaphysical truth-conditions for all our intuitive judgments. It’s to recommend a simple framework for collaborating on issues of deep humanistic import. It’s to repurpose an increasingly unproductive discourse to express the urgency of scientifically inquiring into the nature of anything and everything that matters to us. And then actually doing something about it.

Regimenting our concept of “morality” with simplicity will make it easy to teach and explain the value of value, regimenting it with elegance will make it easy to theoretically and pragmatically defend the value of value, and regimenting it with egalitarianism will ensure that we do not disregard any of the core concerns of any of the beings capable of having concerns. If Harris’ own proposal is not ideal for this aim, still it seems clear that something has to fill the void that is modern ethical thought, lest this void continue to encroach upon the things we love.

Further reading: