Library of Scott Alexandria

I’ve said before that my favorite blog — and the one that’s shifted my views in the most varied and consequential ways — is Scott Alexander’s Slate Star Codex. Scott has written a lot of good stuff, and it can be hard to know where to begin; so I’ve listed below what I think are the best pieces for new readers to start with. This includes older writing, e.g., from Less Wrong.

The list should make the most sense to people who start from the top and read through it in order, though skipping around is encouraged too — many of the posts are self-contained. The list isn’t chronological. Instead, I’ve tried to order things by a mix of “where do I think most people should start reading?” plus “sorting related posts together.” If stuff doesn’t make sense, you may want to Google terms or read background material in Rationality: From AI to Zombies.

This is a work in progress; you’re invited to suggest things you’d add, remove, or shuffle around.

__________________________________________________

I. Rationality and Rationalization
○   Blue- and Yellow-Tinted Choices
○   The Apologist and the Revolutionary
○   Historical Realism
○   Simultaneously Right and Wrong
○   You May Already Be A Sinner
○   Beware the Man of One Study
○   Debunked and Well-Refuted
○   How to Not Lose an Argument
○   The Least Convenient Possible World
○   Bayes for Schizophrenics: Reasoning in Delusional Disorders
○   Generalizing from One Example
○   Typical Mind and Politics

II. Probabilism
○   Confidence Levels Inside and Outside an Argument
○   Schizophrenia and Geomagnetic Storms
○   Talking Snakes: A Cautionary Tale
○   Arguments from My Opponent Believes Something
○   Statistical Literacy Among Doctors Now Lower Than Chance
○   Techniques for Probability Estimates
○   On First Looking into Chapman’s “Pop Bayesianism”
○   Utilitarianism for Engineers
○   If It’s Worth Doing, It’s Worth Doing with Made-Up Statistics
○   Marijuana: Much More Than You Wanted to Know
○   Are You a Solar Deity?
○   The “Spot the Fakes” Test
○   Epistemic Learned Helplessness

III. Science and Doubt
○   Google Correlate Does Not Imply Google Causation
○   Stop Confounding Yourself! Stop Confounding Yourself!
○   Effects of Vertical Acceleration on Wrongness
○   90% Of All Claims About The Problems With Medical Studies Are Wrong
○   Prisons are Built with Bricks of Law and Brothels with Bricks of Religion, But That Doesn’t Prove a Causal Relationship
○   Noisy Poll Results and the Reptilian Muslim Climatologists from Mars
○   Two Dark Side Statistics Papers
○   Alcoholics Anonymous: Much More Than You Wanted to Know
○   The Control Group Is Out Of Control
○   The Cowpox of Doubt
○   The Skeptic’s Trilemma
○   If You Can’t Make Predictions, You’re Still in a Crisis

IV. Medicine, Therapy, and Human Enhancement
○   Scientific Freud
○   Sleep – Now by Prescription
○   In Defense of Psych Treatment for Attempted Suicide
○   Who By Very Slow Decay
○   Medicine, As Not Seen on TV
○   Searching for One-Sided Tradeoffs
○   Do Life Hacks Ever Reach Fixation?
○   Polyamory is Boring
○   Can You Condition Yourself?
○   Wirehead Gods on Lotus Thrones
○   Don’t Fear the Filter
○   Transhumanist Fables

V. Introduction to Game Theory
○   Backward Reasoning Over Decision Trees
○   Nash Equilibria and Schelling Points
○   Introduction to Prisoners’ Dilemma
○   Real-World Solutions to Prisoners’ Dilemmas
○   Interlude for Behavioral Economics
○   What is Signaling, Really?
○   Bargaining and Auctions
○   Imperfect Voting Systems
○   Game Theory as a Dark Art

VI. Promises and Principles
○   Beware Trivial Inconveniences
○   Time and Effort Discounting
○   Applied Picoeconomics
○   Schelling Fences on Slippery Slopes
○   Democracy is the Worst Form of Government Except for All the Others Except Possibly Futarchy
○   Eight Short Studies on Excuses
○   Revenge as Charitable Act
○   Would Your Real Preferences Please Stand Up?
○   Are Wireheads Happy?
○   Guilt: Another Gift Nobody Wants

VII. Cognition and Association
○   Diseased Thinking: Dissolving Questions about Disease
○   The Noncentral Fallacy — The Worst Argument in the World?
○   The Power of Positivist Thinking
○   When Truth Isn’t Enough
○   Ambijectivity
○   The Blue-Minimizing Robot
○   Basics of Animal Reinforcement
○   Wanting vs. Liking Revisited
○   Physical and Mental Behavior
○   Trivers on Self-Deception
○   Ego-Syntonic Thoughts and Values
○   Approving Reinforces Low-Effort Behaviors
○   To What Degree Do We Have Goals?
○   The Limits of Introspection
○   Secrets of the Eliminati
○   Tendencies in Reflective Equilibrium
○   Hansonian Optimism

VIII. Doing Good
○   Newtonian Ethics
○   Efficient Charity: Do Unto Others…
○   The Economics of Art and the Art of Economics
○   A Modest Proposal
○   The Life Issue
○   What if Drone Warfare Had Come First?
○   Nefarious Nefazodone and Flashy Rare Side-Effects
○   The Consequentialism FAQ
○   Doing Your Good Deed for the Day
○   I Myself Am A Scientismist
○   Whose Utilitarianism?
○   Book Review: After Virtue
○   Read History of Philosophy Backwards
○   Virtue Ethics: Not Practically Useful Either
○   Last Thoughts on Virtue Ethics
○   Proving Too Much

IX. Liberty
○   The Non-Libertarian FAQ (aka Why I Hate Your Freedom)
○   A Blessing in Disguise, Albeit a Very Good Disguise
○   Basic Income Guarantees
○   Book Review: The Nurture Assumption
○   The Death of Wages is Sin
○   Thank You For Doing Something Ambiguously Between Smoking And Not Smoking
○   Lies, Damned Lies, and Facebook (Part 1 of ∞)
○   The Life Cycle of Medical Ideas
○   Vote on Values, Outsource Beliefs
○   A Something Sort of Like Left-Libertarian-ist Manifesto
○   Plutocracy Isn’t About Money
○   Against Tulip Subsidies
○   SlateStarCodex Gives a Graduation Speech

X. Progress
○   Intellectual Hipsters and Meta-Contrarianism
○   A Signaling Theory of Class x Politics Interaction
○   Reactionary Philosophy in an Enormous, Planet-Sized Nutshell
○   A Thrive/Survive Theory of the Political Spectrum
○   We Wrestle Not With Flesh And Blood, But Against Powers And Principalities
○   Poor Folks Do Smile… For Now
○   Apart from Better Sanitation and Medicine and Education and Irrigation and Public Health and Roads and Public Order, What Has Modernity Done for Us?
○   The Wisdom of the Ancients
○   Can Atheists Appreciate Chesterton?
○   Holocaust Good for You, Research Finds, But Frequent Taunting Causes Cancer in Rats
○   Public Awareness Campaigns
○   Social Psychology is a Flamethrower
○   Nature is Not a Slate. It’s a Series of Levers.
○   The Anti-Reactionary FAQ
○   The Poor You Will Always Have With You
○   Proposed Biological Explanations for Historical Trends in Crime
○   Society is Fixed, Biology is Mutable

XI. Social Justice
○   Practically-a-Book Review: Dying to be Free
○   Drug Testing Welfare Users is a Sham, But Not for the Reasons You Think
○   The Meditation on Creepiness
○   The Meditation on Superweapons
○   The Meditation on the War on Applause Lights
○   The Meditation on Superweapons and Bingo
○   An Analysis of the Formalist Account of Power Relations in Democratic Societies
○   Arguments About Male Violence Prove Too Much
○   Social Justice for the Highly-Demanding-of-Rigor
○   Against Bravery Debates
○   All Debates Are Bravery Debates
○   A Comment I Posted on “What Would JT Do?”
○   We Are All MsScribe
○   The Spirit of the First Amendment
○   A Response to Apophemi on Triggers
○   Lies, Damned Lies, and Social Media: False Rape Accusations
○   In Favor of Niceness, Community, and Civilization

XII. Politicization
○   Right is the New Left
○   Weak Men are Superweapons
○   You Kant Dismiss Universalizability
○   I Can Tolerate Anything Except the Outgroup
○   Five Case Studies on Politicization
○   Black People Less Likely
○   Nydwracu’s Fnords
○   All in All, Another Brick in the Motte
○   Ethnic Tension and Meaningless Arguments
○   Race and Justice: Much More Than You Wanted to Know
○   Framing for Light Instead of Heat
○   The Wonderful Thing About Triggers
○   Fearful Symmetry
○   Archipelago and Atomic Communitarianism

XIII. Competition and Cooperation
○   The Demiurge’s Older Brother
○   Book Review: The Two-Income Trap
○   Just for Stealing a Mouthful of Bread
○   Meditations on Moloch
○   Misperceptions on Moloch
○   The Invisible Nation — Reconciling Utilitarianism and Contractualism
○   Freedom on the Centralized Web
○   Book Review: Singer on Marx
○   Does Class Warfare Have a Free Rider Problem?
○   Book Review: Red Plenty

__________________________________________________

 

 

If you liked these posts and want more, I suggest browsing the Slate Star Codex archives.

Revenge of the Meat People!

Back in November, I argued (in Inhuman Altruism) that rationalists should try to reduce their meat consumption. Here, I’ll update that argument a bit and lay out some of my background assumptions.

I was surprised at the time by the popularity of responses on LessWrong like Manfred’s

Unfortunately for cows, I think there is an approximately 0% chance that hurting cows is (according to my values) just as bad as hurting humans. It’s still bad – but its badness is some quite smaller number that is a function of my upbringing, cows’ cognitive differences from me, and the lack of overriding game theoretic concerns as far as I can tell.

and maxikov’s

I’m actually pretty much OK with animal suffering. I generally don’t empathize all that much, but there a lot of even completely selfish reasons to be nice to humans, whereas it’s not really the case for animals.

My primary audience was rationalists who terminally care about reducing suffering across the board — but I’ll admit I thought most LessWrong users would fit that description. I didn’t expect to see a lot of people appealing to their self-interest or their upbringing. Since it’s possible to pursue altruistic projects for selfish reasons (e.g., attempting to reduce existential risk to get a chance at living longer), I’ll clarify that my arguments are directed at people who do care about how much joy and suffering there is in the world — care rather a lot.

The most detailed defense of meat-eating was Katja Grace’s When should an effective altruist be vegetarian? Katja’s argument is that egalitarians should eat frugally and give as much money as they can to high-impact charities, rather than concerning themselves with the much smaller amounts of direct harm their dietary choices cause.

Paul Christiano made similar points in his blog comments: if you would spend more money sustaining a vegan diet than sustaining a carniferous diet, the best utilitarian option would be for you to remain a meat-eater and donate the difference.

Most people aren’t living maximally frugally and giving the exactly optimal amount to charity (yet). But the point generalizes: If you personally find that you can psychologically use the plight of animals to either (a) motivate yourself to become a vegan for an extra year or (b) motivate yourself to give hundreds of extra dollars to a worthy cause, but not both, then you should almost certainly choose (b).

My argument did assume that veganism is a special “bonus” giving opportunity, a way to do a startling amount of good without drawing resources from (or adding resources to) your other altruistic endeavors. The above considerations made me shift from feeling maybe 80% confident that most rationalists should forsake meat, to feeling maybe 70% confident.

To give more weight than that to Katja’s argument, there are two questions I’d need answered:

 

1. How many people are choosing between philanthropy and veganism?

Some found the term “veg*nism” (short for “veganism or/and vegetarianism”) confusing in my previous post, so I’ll switch here to speaking of meat-abstainers as “plant people” and meat-eaters as “meat people.” I’m pretty confident that the discourse would be improved by more B-movie horror dialogue.

Plant people have proven that their mindset can prevent a lot of suffering. And I don’t see any obvious signs that EAs’ plantpersonhood diminishes their EAness. To compete, Katja’s meat-person argument needs to actually motivate people to do more good. “P > Q > R” isn’t a good argument against Q if rejecting Q just causes people to regress to R (rather than advance to P).

What I want to see here are anecdotes of EAs who have had actual success trying “pay the cost of veganism in money” (or similar), to prove this is a psychologically realistic alternative and not just a way of rationalizing the status quo.

(I’m similarly curious to see if people can have real success with my idea of donating $1 to the Humane League after every meal where you eat an animal. Patrick LaVictoire has tried out this ritual, which he calls “beefminding“. (Edit 9/11: Patrick clarifies, “I did coin ‘beefminding’, but I use it to refer to tracking my meat + egg* consumption on Beeminder, and trying to slowly bend the curve by changing my default eating habits. I don’t make offsetting donations. What I’m doing is just a combination of quantified self and Reducetarianism.”))

If I “keep fixed how much of my budget I spend on myself and how much I spend on altruism,” Katja writes, plant-people-ism looks like a very ineffective form of philanthropy. But I don’t think most people spend an optimal amount on altruistic causes, and I don’t think most people who spend a suboptimal amount altruistically ought to set a hard upper limit on how much they’re willing to give. Instead, I suspect most people should set a lower limit and then ratchet that limit upward over time, or supplement it opportunistically. (This is the idea behind Chaos Altruism.)

If you’re already giving everything to efficient charities except what you need to survive, or if you can’t help but conceptualize your altruistic sentiment as a fixed resource that veganism would deplete, then I think Katja’s reasoning is relevant to your decision. Otherwise, I think veganism is a good choice, and you should even consider combining it with Katja’s method, giving up meat and doubling the cost of your switch to veganism (with the extra money going to an effective charity). We suboptimal givers should take whatever excuse we can find to do better.

Katja warns that if you become a plant person even though it’s not the perfectly optimal choice, “you risk spending your life doing suboptimal things every time a suboptimal altruistic opportunity has a chance to steal resources from what would be your personal purse.” But if the choice really is between a suboptimal altruistic act and an even less optimal personal purchase, I say: mission accomplished! Relatively minor improvements in global utility aren’t bad ideas just because they’re minor.

I could see this being a bad idea if getting into the habit of giving ineffectively depletes your will to give effectively. Perhaps most rationalists would find it exhausting or dispiriting to give in a completely ad-hoc way, without maintaining some close link to the ideal of effective altruism. (I find it psychologically easier to redirect my “triggered giving” to highly effective causes, which is the better option in any case; perhaps some people will likewise find it easier to adopt Katja’s approach than to transform into a plant person.)

It would be nice if there were some rule of thumb we could use to decide when a suboptimal giving activity is so minor as to lack moral force (even for opportunistic Chaos Altruists). If you notice a bug in your psychology that makes it easier for you to become a plant person than to become an optimally frugal eater (and optimal giver), why is that any different from volunteering at a soup kitchen to acquire warm fuzzies? Why is it EA-compatible to encourage rationalists to replace the time they spend eating meat with time spent eating plants, but not EA-compatible to encourage rationalists to replace the time they spend on Reddit with time spent at soup kitchens?

Part of the answer is simply that becoming a plant person is much more effective than regularly volunteering at soup kitchens (even though it’s still not comparable to highly efficient charities). But I don’t think that’s the whole story.

 

2. Should we try to do more “ordinary” nice things?

Suppose some altruistic rationalists are in a position to do more good for the world by optimizing for frugality, or by ethically offsetting especially harmful actions. I’d still worry that there’s something important we’re giving up, especially in the latter case — “mundane decency,” “ordinary niceness,” or something along those lines.

I think of this ordinary niceness thing as important for virtue cultivation, for community-building, and for general signaling purposes. By “ordinary niceness” I don’t mean deferring to conventional/mainstream morality in the absence of supporting arguments. I do mean privileging useful deontological heuristics like “don’t use violence or coercion on others, even if it feels in the moment like a utilitarian net positive.”

If we aren’t relying on cultural conventions, then I’m not sure what basis we should use for agreeing on community standards of ordinary niceness. One thought experiment I sometimes use for this purpose is: “How easy is it for me to imagine that a society twice as virtuous as present-day society would find [action] cartoonishly evil?”

I can imagine a more enlightened society responding to many of our mistakes with exasperation and disappointment, but I  have a hard time imagining that they’d react with abject horror and disbelief to the discovery that consumers contributed in indirect ways to global warming — or failed to volunteer at soup kitchens. I have a much easier time imagining the “did human beings really do that?!” response to the enslavement and torture of of legions of non-human minds for the sake of modestly improving the quality of sandwiches.

I don’t want to be Thomas Jefferson. I don’t want to be “that guy who was totally kind and smart enough to do the right thing, but lacked the will to part ways with the norms of his time even when plenty of friends and arguments were successfully showing him the way.”

I’m not even sure I want to be the utilitarian Thomas Jefferson, the counterfactual Jefferson who gives his money to the very best causes and believes that giving up his slaves would impact his wealth in a way that actually reduces the world’s expected utilitarian value.

am something like a utiltiarian, so I have to accept the arguments of the hypothetical utilitarian slaveholder (and of Katja) in principle. But in practice I’m skeptical that an actual human being will achieve more utilitarian outcomes by reasoning in that fashion.

I’m especially skeptical that an 18th-century community of effective altruists would have been spiritually undamaged by shrugging its shoulders at slaveholding members. Plausibly you don’t kick out all the slaveholders; but you do apply some social pressure to try to get them to change their ways. Because ditching ordinary niceness corrodes something important about individuals and about groups — even, perhaps, in contexts where “ordinary niceness” is extraordinary.

… I think. I don’t have a good general theory for when we should and shouldn’t adopt universal prohibitions against corrosive “utilitarian” acts. And in our case, there may be countervailing “ordinary niceness” heuristics: the norm of being inclusive to people with eating disorders and other medical conditions, the norm of letting altruists have private lives, etc.

 

Whatever the right theory looks like, I don’t think it will depend on our stereotypes of rationalist excellence. If it seems high-value to be a community of bizarrely kind people, even though “bizarre kindness” clashes with a lot of people’s assumptions about rationalists or about the life of the mind, even though the kindness in question is more culturally associated with Hindus and hippies than with futurists and analytic philosophers, then… just be bizarrely kind. Clash happens.

I might be talked out of this view. Paul raises the point that there are advantages to doubling down on our public image (and self-image) as unconventional altruists:

I would rather EA be associated with an unusual and cost-effective thing than a common and ineffective thing. The two are attractive to different audiences, but one audience seems more worth attracting.

On the other hand, I’d expect conventional kindness and non-specialization to improve a community’s ability to resist internal strife and external attacks. And plant people are common and unexceptional enough that eating fewer animals probably wouldn’t make vegetarianism or veganism one of our more salient characteristics in anyone’s eyes.

At the same time, plantpersonhood could help us do a nontrivial amount of extra object-level good for the world, if it doesn’t trade off against our other altruistic activities. And I think it could help us develop a stronger identity (both individually and communally) as people who are trying to become exemplars of morality and kindness in many different aspects of their life, not just in our careers or philanthropic decisions.

My biggest hesitation, returning to Katja’s calculations… is that there really is something odd about putting so much time and effort into getting effective altruists to do something suboptimal.

It’s an unresolved empirical question whether Chaos Altruism is actually a useful mindset, even for people to whom it comes naturally. Perhaps Order Altruism and the “just do the optimal thing, dammit” mindset is strictly better for everyone. Perhaps it yields larger successes, or fails more gracefully. Or perhaps rationalists naturally find systematicity and consistency more motivating; and perhaps the impact of meat-eating is too small to warrant a deontological prohibition.

More anecdotes and survey data would be very useful here!


[Epistemic status: I’m no longer confident of this post’s conclusion. I’ll say why in a follow-up post.]

Bostrom on AI deception

Oxford philosopher Nick Bostrom has argued, in “The Superintelligent Will,” that advanced AIs are likely to diverge in their terminal goals (i.e., their ultimate decision-making criteria), but converge in some of their instrumental goals (i.e., the policies and plans they expect to indirectly further their terminal goals). An arbitrary superintelligent AI would be mostly unpredictable, except to the extent that nearly all plans call for similar resources or similar strategies. The latter exception may make it possible for us to do some long-term planning for future artificial agents.

Bostrom calls the idea that AIs can have virtually any goal the orthogonality thesis, and he calls the idea that there are attractor strategies shared by almost any goal-driven system (e.g., self-preservation, knowledge acquisition) the instrumental convergence thesis.

Bostrom fleshes out his worries about smarter-than-human AI in the book Superintelligence: Paths, Dangers, Strategies, which came out in the US a few days ago. He says much more there about the special technical and strategic challenges involved in general AI. Here’s one of the many scenarios he discusses, excerpted:

[T]he orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans — scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible — and in fact technically a lot easier — to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that — absent a specific effort — the first superintelligence may have some such random or reductionistic final goal.

[… T]he instrumental convergence thesis entails that we cannot blithely assume that a superintelligence with the final goal of calculating the decimals of pi (or making paperclips, or counting grains of sand) would limit its activities in such a way as not to infringe on human interests. An agent with such a final goal would have a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources and, if possible, to eliminate potential threats to itself and its goal system. Human beings might constitute potential threats; they certainly constitute physical resources. […]

It might seem incredible that a project would build or release an AI into the world without having strong grounds for trusting that the system will not cause an existential catastrophe. It might also seem incredible, even if one project were so reckless, that wider society would not shut it down before it (or the AI it was building) attains a decisive strategic advantage. But as we shall see, this is a road with many hazards. […]

With the help of the concept of convergent instrumental value, we can see the flaw in one idea for how to ensure superintelligence safety. The idea is that we validate the safety of a superintelligent AI empirically by observing its behavior while it is in a controlled, limited environment (a “sandbox”) and that we only let the AI out of the box if we see it behaving in a friendly, cooperative, responsible manner.

The flaw in this idea is that behaving nicely while in the box is a convergent instrumental goal for friendly and unfriendly AIs alike. An unfriendly AI of sufficient intelligence realizes that its unfriendly final goals will be best realized if it behaves in a friendly manner initially, so that it will be let out of the box. It will only start behaving in a way that reveals its unfriendly nature when it no longer matters whether we find out; that is, when the AI is strong enough that human opposition is ineffectual.

Consider also a related set of approaches that rely on regulating the rate of intelligence gain in a seed AI by subjecting it to various kinds of intelligence tests or by having the AI report to its programmers on its rate of progress. At some point, an unfriendly AI may become smart enough to realize that it is better off concealing some of its capability gains. It may underreport on its progress and deliberately flunk some of the harder tests, in order to avoid causing alarm before it has grown strong enough to attain a decisive strategic advantage. The programmers may try to guard against this possibility by secretly monitoring the AI’s source code and the internal workings of its mind; but a smart-enough AI would realize that it might be under surveillance and adjust its thinking accordingly. The AI might find subtle ways of concealing its true capabilities and its incriminating intent. (Devising clever escape plans might, incidentally, also be a convergent strategy for many types of friendly AI, especially as they mature and gain confidence in their own judgments and capabilities. A system motivated to promote our interests might be making a mistake if it allowed us to shut it down or to construct another, potentially unfriendly AI.)

We can thus perceive a general failure mode, wherein the good behavioral track record of a system in its juvenile stages fails utterly to predict its behavior at a more mature stage. Now, one might think that the reasoning described above is so obvious that no credible project to develop artificial general intelligence could possibly overlook it. But one should not be too overconfident that this is so.

Consider the following scenario. Over the coming years and decades, AI systems become gradually more capable and as a consequence find increasing real-world application: they might be used to operate trains, cars, industrial and household robots, and autonomous military vehicles. We may suppose that this automation for the most part has the desired effects, but that the success is punctuated by occasional mishaps — a driverless truck crashes into oncoming traffic, a military drone fires at innocent civilians. Investigations reveal the incidents to have been caused by judgment errors by the controlling AIs. Public debate ensues. Some call for tighter oversight and regulation, others emphasize the need for research and better-engineered systems — systems that are smarter and have more common sense, and that are less likely to make tragic mistakes. Amidst the din can perhaps also be heard the shrill voices of doomsayers predicting many kinds of ill and impending catastrophe. Yet the momentum is very much with the growing AI and robotics industries. So development continues, and progress is made. As the automated navigation systems of cars become smarter, they suffer fewer accidents; and as military robots achieve more precise targeting, they cause less collateral damage. A broad lesson is inferred from these observations of real-world outcomes: the smarter the AI, the safer it is. It is a lesson based on science, data, and statistics, not armchair philosophizing. Against this backdrop, some group of researchers is beginning to achieve promising results in their work on developing general machine intelligence. The researchers are carefully testing their seed AI in a sandbox environment, and the signs are all good. The AI’s behavior inspires confidence — increasingly so, as its intelligence is gradually increased.

At this point, any remaining Cassandra would have several strikes against her:

A history of alarmists predicting intolerable harm from the growing capabilities of robotic systems and being repeatedly proven wrong. Automation has brought many benefits and has, on the whole, turned out safer than human operation.

ii  A clear empirical trend: the smarter the AI, the safer and more reliable it has been. Surely this bodes well for a project aiming at creating machine intelligence more generally smart than any ever built before — what is more, machine intelligence that can improve itself so that it will become even more reliable.

iii  Large and growing industries with vested interests in robotics and machine intelligence. These fields are widely seen as key to national economic competitiveness and military security. Many prestigious scientists have built their careers laying the groundwork for the present applications and the more advanced systems being planned.

iv  A promising new technique in artificial intelligence, which is tremendously exciting to those who have participated in or followed the research. Although safety issues and ethics are debated, the outcome is preordained. Too much has been invested to pull back now. AI researchers have been working to get to human-level artificial intelligence for the better part of a century: of course there is no real prospect that they will now suddenly stop and throw away all this effort just when it finally is about to bear fruit.

v  The enactment of some safety rituals, whatever helps demonstrate that the participants are ethical and responsible (but nothing that significantly impedes the forward charge).

vi  A careful evaluation of seed AI in a sandbox environment, showing that it is behaving cooperatively and showing good judgment. After some further adjustments, the test results are as good as they could be. It is a green light for the final step . . .

And so we boldly go — into the whirling knives.

We observe here how it could be the case that when dumb, smarter is safe; yet when smart, smarter is more dangerous. There is a kind of pivot point, at which a strategy that has previously worked excellently suddenly starts to backfire.

For more on terminal goal orthogonality, see Stuart Armstrong’s “General Purpose Intelligence“. For more on instrumental goal convergence, see Steve Omohundro’s “Rational Artificial Intelligence for the Greater Good“.

 

Loving the merely physical

This is my submission to Sam Harris’ Moral Landscape challenge: “Anyone who believes that my case for a scientific understanding of morality is mistaken is invited to prove it in under 1,000 words. (You must address the central argument of the book—not peripheral issues.)”

Though I’ve mentioned before that I’m sympathetic to Harris’ argument, I’m not fully persuaded. And there’s a particular side-issue I think he gets wrong straightforwardly enough that it can be demonstrated in the space of 1,000 words: really unrequitable love, or the restriction of human value to conscious states.

____________________________________________________

My criticism of Harris’ thesis will be indirect, because it appears to me that his proposal is much weaker than his past critics have recognized. What are we to make of a meta-ethics text that sets aside meta-ethicists’ core concerns with a shrug? Harris happily concedes that promoting well-being is only contingently moral,¹ only sometimes tracks our native preferences² or moral intuitions,³ and makes no binding, categorical demand on rational humans.⁴ So it looks like the only claim Harris is making is that redefining words like ‘good’ and ‘ought’ to track psychological well-being would be useful for neuroscience and human cooperation.⁵ Which looks like a question of social engineering, not of moral philosophy.

If Harris’ moral realism sounds more metaphysically audacious than that, I suspect it’s because he worries that putting it in my terms would be uninspiring or, worse, would appear relativistic. (Consistent with my interpretation, he primarily objects to moral anti-realism and relativism for eroding human compassion, not for being false.)⁶

I don’t think I can fairly assess Harris’ pragmatic linguistic proposal in 1,000 words.⁷ But I can point to an empirical failing in a subsidiary view he considers central: that humans only ultimately value changes in conscious experience.⁸

It may be that only conscious beings can value things; but that doesn’t imply that only conscious states can be valued. Consider these three counterexamples:

(1) Natural Diversity. People prize the beauty and complexity of unconscious living things, and of the natural world in general.⁹

Objection: ‘People value those things because they could in principle experience them. “Beauty” is in the beholder’s eye, not in the beheld object. That’s our clue that we only prize natural beauty for making possible our experience of beauty.’

Response: Perhaps our preference here causally depends on our experiences; but that doesn’t mean that we’re deluded in thinking we have such preferences!

I value my friends’ happiness. Causally, that value may be entirely explainable in terms of patterns in my own happiness, but that doesn’t make me an egoist. Harris would agree that others’ happiness can be what I value, even if my own happiness is why I value it. But the same argument holds for natural wonders: I can value them in themselves, even if what’s causing that value is my experiences of them.

(2) Accurate Beliefs. Consider two experientially identical worlds: One where you’re in the Matrix and have systematically false beliefs, one where your beliefs are correct. Most people would choose to live in the latter world over the former, even knowing that it makes no difference to any conscious state.

Objection: ‘People value the truth because it’s usually useful. Your example is too contrived to pump out credible intuitions.’

Response: Humans can mentally represent environmental objects, and thereby ponder, fear, desire, etc. the objects themselves. Fearing failure or death isn’t the same as fearing experiencing failure or death. (I can’t escape failure/death merely by escaping awareness/consciousness of failure/death.) In the same way, valuing being outside the Matrix is distinct from valuing having experiences consistent with being outside the Matrix.

All of this adds up to a pattern that makes it unlikely people are deluded about this preference. Perhaps it’s somehow wrong to care about the Matrix as anything but a possible modifier of experience. But, nonetheless, people do care. Such preferences aren’t impossible or ‘unintelligible.’⁸

(3) Zombie Welfare. Some people don’t think we have conscious states. Harris’ view predicts that such people will have no preferences, since they can’t have preferences concerning experiences. But eliminativists have desires aplenty.

Objection: ‘Eliminativists are deeply confused; it’s not surprising that they have incoherent normative views.’

Response: Eliminativists may be mistaken, but they exist.¹⁰ That suffices to show that humans can care about things they think aren’t conscious. (Including unconscious friends and family!)

Moreover, consciousness is a marvelously confusing topic. We can’t be infinitely confident that we’ll never learn eliminativism is true. And if, pace Descartes, there’s even a sliver of doubt, then we certainly shouldn’t stake the totality of human value on this question.

Harris writes that “questions about values — about meaning, morality, and life’s larger purpose — are really questions about the well-being of conscious creatures. Values, therefore, translate into facts that can be scientifically understood[.]”¹¹ But the premise is much stronger than the conclusion requires.

If people’s acts of valuing are mental, and suffice for deducing every moral fact, then scientifically understanding the mind will allow us to scientifically understand morality even if the objects valued are not all experiential. We can consciously care about unconscious world-states, just as we can consciously believe in, consciously fear, or consciously wonder about unconscious world-states. That means that Harris’ well-being landscape needs to be embedded in a larger ‘preference landscape.’

Perhaps a certain philosophical elegance is lost if we look beyond consciousness. Still, converting our understanding of the mind into a useful and reflectively consistent decision procedure cannot come at the expense of fidelity to the psychological data. Making ethics an empirical science shouldn’t require us to make any tenuous claims about human motivation.

We could redefine the moral landscape to exclude desires about natural wonders and zombies. It’s just hard to see why. Harris has otherwise always been happy to widen the definition of ‘moral’ to compass a larger and larger universe of human value. Since we’ve already strayed quite a bit from our folk intuitions about ‘morality,’ it’s honestly not of great importance how we tweak the edges of our new concept of morality. Our first concern should be with arriving at a correct view of human psychology. If that falters, then, to the extent science can “determine human values,” the moral decisions we build atop our psychological understanding will fail us as well.

____________________________________________________

Citations

¹ “Perhaps there is no connection between being good and feeling good — and, therefore, no connection between moral behavior (as generally conceived) and subjective well-being. In this case, rapists, liars, and thieves would experience the same depth of happiness as the saints. This scenario stands the greatest chance of being true, while still seeming quite far-fetched. Neuroimaging work already suggests what has long been obvious through introspection: human cooperation is rewarding. However, if evil turned out to be as reliable a path to happiness as goodness is, my argument about the moral landscape would still stand, as would the likely utility of neuroscience for investigating it. It would no longer be an especially ‘moral’ landscape; rather it would be a continuum of well-being, upon which saints and sinners would occupy equivalent peaks.” -Harris (2010), p. 190

“Dr. Harris explained that about three million Americans are psychopathic. That is to say, they don’t care about the mental states of others. They enjoy inflicting pain on other people. But that implies that there’s a possible world, which we can conceive, in which the continuum of human well-being is not a moral landscape. The peaks of well-being could be occupied by evil people. But that entails that in the actual world, the continuum of well-being and the moral landscape are not identical either. For identity is a necessary relation. There is no possible world in which some entity A is not identical to A. So if there’s any possible world in which A is not identical to B, then it follows that A is not in fact identical to B.” -Craig (2011)

Harris’ (2013a) response to Craig’s argument: “Not a realistic concern. You’d have to change too many things — the world would [be] unrecognizable.”

² “I am not claiming that most of us personally care about the experience of all conscious beings; I am saying that a universe in which all conscious beings suffer the worst possible misery is worse than a universe in which they experience well-being. This is all we need to speak about ‘moral truth’ in the context of science.” -Harris (2010), p. 39

³ “And the fact that millions of people use the term ‘morality’ as a synonym for religious dogmatism, racism, sexism, or other failures of insight and compassion should not oblige us to merely accept their terminology until the end of time.” -Harris (2010), p. 53

“Everyone has an intuitive ‘physics,’ but much of our intuitive physics is wrong (with respect to the goal of describing the behavior of matter). Only physicists have a deep understanding of the laws that govern the behavior of matter in our universe. I am arguing that everyone also has an intuitive ‘morality,’ but much of our intuitive morality is clearly wrong (with respect to the goal of maximizing personal and collective well-being).” -Harris (2010), p. 36

⁴ Moral imperatives as hypothetical imperatives (cf. Foot (1972)): “As Blackford says, when told about the prospect of global well-being, a selfish person can always say, ‘What is that to me?’ [… T]his notion of ‘should,’ with its focus on the burden of persuasion, introduces a false standard for moral truth. Again, consider the concept of health: should we maximize global health? To my ear, this is a strange question. It invites a timorous reply like, ‘Provided we want everyone to be healthy, yes.’ And introducing this note of contingency seems to nudge us from the charmed circle of scientific truth. But why must we frame the matter this way? A world in which global health is maximized would be an objective reality, quite distinct from a world in which we all die early and in agony.” -Harris (2011)

“I don’t think the distinction between morality and something like taste is as clear or as categorical as we might suppose. […] It seems to me that the boundary between mere aesthetics and moral imperative — the difference between not liking Matisse and not liking the Golden Rule — is more a matter of there being higher stakes, and consequences that reach into the lives of others, than of there being distinct classes of facts regarding the nature of human experience.” -Harris (2011)

⁵ “Whether morality becomes a proper branch of science is not really the point. Is economics a true science yet? Judging from recent events, it wouldn’t appear so. Perhaps a deep understanding of economics will always elude us. But does anyone doubt that there are better and worse ways to structure an economy? Would any educated person consider it a form of bigotry to criticize another society’s response to a banking crisis? Imagine how terrifying it would be if great numbers of smart people became convinced that all efforts to prevent a global financial catastrophe must be either equally valid or equally nonsensical in principle. And yet this is precisely where we stand on the most important questions in human life. Currently, most scientists believe that answers to questions of human value will fall perpetually beyond our reach — not because human subjectivity is too difficult to study, or the brain too complex, but because there is no intellectual justification for speaking about right and wrong, or good and evil, across cultures. Many people also believe that nothing much depends on whether we find a universal foundation for morality. It seems to me, however, that in order to fulfill our deepest interests in this life, both personally and collectively, we must first admit that some interests are more defensible than others.” -Harris (2010), p. 190

⁶ “I have heard from literally thousands of highly educated men and women that morality is a myth, that statements about human values are without truth conditions (and are, therefore, nonsensical), and that concepts like well-being and misery are so poorly defined, or so susceptible to personal whim and cultural influence, that it is impossible to know anything about them. Many of these people also claim that a scientific foundation for morality would serve no purpose in any case. They think we can combat human evil all the while knowing that our notions of ‘good’ and ‘evil’ are completely unwarranted. It is always amusing when these same people then hesitate to condemn specific instances of patently abominable behavior. I don’t think one has fully enjoyed the life of the mind until one has seen a celebrated scholar defend the ‘contextual’ legitimacy of the burqa, or of female genital mutilation, a mere thirty seconds after announcing that moral relativism does nothing to diminish a person’s commitment to making the world a better place.” -Harris (2010), p. 27

“I consistently find that people who hold this view [moral anti-realism] are far less clear-eyed and committed than (I believe) they should be when confronted with moral pathologies — especially those of other cultures — precisely because they believe there is no deep sense in which any behavior or system of thought can be considered pathological in the first place. Unless you understand that human health is a domain of genuine truth claims — however difficult ‘health’ may be to define — it is impossible to think clearly about disease. I believe the same can be said about morality. And that is why I wrote a book about it…” -Harris (2011)

⁷ For more on this proposal, see Bensinger (2013).

⁸ “[T]he rightness of an act depends on how it impacts the well-being of conscious creatures[….] Here is my (consequentialist) starting point: all questions of value (right and wrong, good and evil, etc.) depend upon the possibility of experiencing such value. Without potential consequences at the level of experience — happiness, suffering, joy, despair, etc. — all talk of value is empty. Therefore, to say that an act is morally necessary, or evil, or blameless, is to make (tacit) claims about its consequences in the lives of conscious creatures (whether actual or potential).” -Harris (2010), p. 62

“[C]onsciousness is the only intelligible domain of value.” -Harris (2010), p. 32

Harris (2013b) confirms that this is part of his “central argument”.

⁹ “Certain human uses of the natural world — of the non-animal natural world! — are morally troubling. Take an example of an ancient sequoia tree. A thoughtless hiker carves his initials, wantonly, for the fun of it, into an ancient sequoia tree. Isn’t there something wrong with that? It seems to me there is.” -Sandel (2008)

¹⁰ E.g., Rey (1982), Beisecker (2010), and myself. (I don’t assume eliminativism in this essay.)

¹¹ Harris (2010), p. 1.

____________________________________________________

References

In defense of actually doing stuff

Most good people are kind in an ordinary way, when the intensity of human suffering in the world today calls for heroic kindness. I’ve seen ordinary kindness criticized as “pretending to try”. We go through the motions of humanism, but without significantly inconveniencing ourselves, without straying from our established habits, without violating societal expectations. It’s not that we’re being deliberately deceitful; it’s just that our stated values are in conflict with the lack of urgency revealed in our behaviors. If we want to see real results, we need to put more effort than that into helping others.

The Effective Altruism movement claims to have made some large strides in the direction of “actually trying”, approaching our humanitarian problems with fresh eyes and exerting a serious effort to solve them. But Ben Kuhn has criticized EA for spending more time “pretending to actually try” than “actually trying”. Have we become more heroic in our compassion, or have we just become better at faking moral urgency?

I agree with his criticism, though I’m not sure how large and entrenched the problem is. I bring it up in order to address a reply by Katja Grace. Katja wrote ‘In praise of pretending to really try‘, granting Ben’s criticism but arguing that the phenomenon he’s pointing to is a good thing.

“Effective Altruism should not shy away from pretending to try. It should strive to pretend to really try more convincingly, rather than striving to really try.

“Why is this? Because Effective Altruism is a community, and the thing communities do well is modulating individual behavior through interactions with others in the community. Most actions a person takes as a result of being part of a community are pretty much going to be ‘pretending to try’ by construction. And such actions are worth having.”

If I’m understanding Katja’s argument right, it’s: ‘People who pretend to try are motivated by a desire for esteem. And what binds a community together is in large part this desire for esteem. So we can’t get rid of pretending to try, or we’ll get rid of what makes Effective Altruism a functional community in the first place.’

The main problem here is in the leap from ‘if you pretend to try, then you’re motivated by a desire for esteem’ to ‘if you’re motivated by a desire for esteem, then you’re pretending to try’. Lo:

“A community of people not motivated by others seeing and appreciating their behavior, not concerned for whether they look like a real community member, and not modeling their behavior on the visible aspects of others’ behavior in the community would generally not be much of a community, and I think would do less well at pursuing their shared goals. […]

“If people heed your call to ‘really try’ and do the ‘really trying’ things you suggest, this will have been motivated by your criticisms, so seems more like a better quality of pretending to really try, than really trying itself. Unless your social pressure somehow pressured them to stop being motivated by social pressure.”

The idea of ‘really trying’ isn’t ‘don’t be influenced by social pressure’. It’s closer to ‘whatever, be influenced by social pressure however you want — whatever it takes! — as long as you end up actually working on the tasks that matter’. Signaling (especially honest signaling) and conformity (especially productive conformism) are not the enemy. The enemy is waste, destruction, human misery.

The ‘Altruism’ in ‘Effective Altruism’ is first and foremost a behavior, not a motivation. You can be a perfectly selfish Effective Altruist, as long as you’ve decided that your own interests are tied to others’ welfare. So in questioning whether self-described Effective Altruists are living up to their ideals, we’re primarily questioning whether they’re acting the part. Whether their motives are pure doesn’t really matter, except as a device for explaining why they are or aren’t actively making the world a better place.

“I don’t mean to say that ‘really trying’ is bad, or not a good goal for an individual person. But it is a hard goal for a community to usefully and truthfully have for many of its members, when so much of its power relies on people watching their neighbors and working to fit in.”

To my ear, this sounds like: ‘Being a good fireman is much, much harder than looking like a good fireman. And firemen are important, and their group cohesion and influence depends to a significant extent on their being seen as good firemen. So we shouldn’t chastise firemen who sacrifice being any good at their job for the sake of looking as though they’re good at their job. We should esteem them alongside good firemen, albeit with less enthusiasm.’

I don’t get it. If there are urgent Effective Altruism projects, then surely we should be primarily worried about how much real-world progress is being made on those projects. Building a strong, thriving EA community isn’t particularly valuable if the only major outcome is that we perpetuate EA, thereby allowing us to further perpetuate EA…

I suppose this strategy makes sense if it’s easier to just focus on building the EA movement and waiting for a new agenty altruist to wander in by chance, than it is to increase the agentiness of people currently in EA. But that seems unlikely to me. It’s harder to find ‘natural’ agents than it is to create or enhance them. And if we allow EA to rot from within and become an overt status competition with few aspirations to anything higher, then I’d expect us to end up driving away the real agents and true altruists. The most sustainable way to attract effective humanists is to be genuinely effective and genuinely humanistic, in a visible way.

At some point, the buck has to stop. At some point, someone has to actually do the work of EA. Why not now?

A last point: I think an essential element of ‘pretending to (actually) try’ is being neglected here. If I’m understanding how people think, pretending to try is at least as much about self-deception as it is about signaling to others. It’s a way of persuading yourself that you’re a good person, of building a internal narrative you can be happy with. The alternative is that the pretenders are knowingly deceiving others, which sounds a bit too Machiavellian to me to fit my model of realistic psychology.

But if pretending to try requires self-deception, then what are Katja and Ben doing? They’re both making self-deception a lot harder. They’re both writing posts that will make their EA readers more self-aware and self-critical. On my model, that means that they’re both making it tougher to pretend to try. (As am I.)

But if that’s so, then Ben’s strategy is wiser. Reading Ben’s critique, a pretender is encouraged to switch to actually trying. Reading Katja’s, pretenders are still beset with dissonance, but now without any inspiring call to self-improvement. The clearest way out will then be to give up on pretending to try, and give up on trying.

I’m all for faking it till you make it. But I think that faking it transitions into making it, and avoids becoming a lost purpose, in part because we continue to pressure people to live lives more consonant with their ideals. We should keep criticizing hypocrisy and sloth. But the criticism should look like ‘we can do so much better!’, not ‘let us hunt down all the Fakers and drive them from our midst!’.

It’s exciting to realize that so much of what we presently do is thoughtless posturing. Not because any of us should be content with ‘pretending to actually try’, but because it means that a small shift in how we do things might have a big impact on how effective we are.

Imagine waking up tomorrow, getting out of bed, and proceeding to do exactly the sorts of things you think are needed to bring about a better world.What would that be like?

The seed is not the superintelligence

This is the conclusion of a LessWrong post, following The AI Knows, But Doesn’t Care.

If an artificial intelligence is smart enough to be dangerous to people, we’d intuitively expect it to be smart enough to know how to make itself safe for people. But that doesn’t mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety.

That means we have to understand how to code safety. We can’t pass the entire buck to the AI, when only an AI we’ve already safety-proofed will be safe to ask for help on safety issues! Generally: If the AI is weak enough to be safe, it’s too weak to solve this problem. If it’s strong enough to solve this problem, it’s too strong to be safe.

This is an urgent public safety issue, given the five theses and given that we’ll likely figure out how to make a decent artificial programmer before we figure out how to make an excellent artificial ethicist.

File:Ouroboros-Zanaq.svg

The AI’s trajectory of self-modification has to come from somewhere.

“Take an AI in a box that wants to persuade its gatekeeper to set it free. Do you think that such an undertaking would be feasible if the AI was going to interpret everything the gatekeeper says in complete ignorance of the gatekeeper’s values? […] I don’t think so. So how exactly would it care to follow through on an interpretation of a given goal that it knows, given all available information, is not the intended meaning of the goal? If it knows what was meant by ‘minimize human suffering’ then how does it decide to choose a different meaning? And if it doesn’t know what is meant by such a goal, how could it possible [sic] convince anyone to set it free, let alone take over the world?”
               —Alexander Kruel
“If the AI doesn’t know that you really mean ‘make paperclips without killing anyone’, that’s not a realistic scenario for AIs at all–the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to ‘make paperclips in the way that I mean’.”
               Jiro

The wish-granting genie we’ve conjured — if it bothers to even consider the question — should be able to understand what you mean by ‘I wish for my values to be fulfilled.’ Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie’s map can compass your true values. Superintelligence doesn’t imply that the genie’s utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.

The critical mistake here is to not distinguish the seed AI we initially program from the superintelligent wish-granter it self-modifies to become. We can’t use the genius of the superintelligence to tell us how to program its own seed to become the sort of superintelligence that tells us how to build the right seed. Time doesn’t work that way.

We can delegate most problems to the FAI. But the one problem we can’t safely delegate is the problem of coding the seed AI to produce the sort of superintelligence to which a task can be safely delegated.

When you write the seed’s utility function, you, the programmer, don’t understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence’s actions,long after it’s become smart enough to fully understand our values.

Why is the superintelligence, if it’s so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can’t we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: ‘When you’re smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.’?

Because that sentence has to actually be coded in to the AI, and when we do so, there’s no ghost in the machine to know exactly what we mean by ‘frend-lee-ness thee-ree’. Instead, we have to give it criteria we think are good indicators of Friendliness, so it’ll know what to self-modify toward. And if one of the landmarks on our ‘frend-lee-ness’ road map is a bit off, we lose the world.

Yes, the UFAI will be able to solve Friendliness Theory. But if we haven’t already solved it on our own power, we can’tpinpoint Friendliness in advance, out of the space of utility functions. And if we can’t pinpoint it with enough detail to draw a road map to it and it alone, we can’t program the AI to care about conforming itself with that particular idiosyncratic algorithm.

Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI’s decision criteria, no argument or discovery will spontaneously change its heart.

And, yes, the UFAI will be able to simulate humans accurately enough to know that its own programmers would wish, if they knew the UFAI’s misdeeds, that they had programmed the seed differently. But what’s done is done. Unless we ourselves figure out how to program the AI to terminally value its programmers’ True Intentions, the UFAI will just shrug at its creators’ foolishness and carry on converting the Virgo Supercluster’s available energy into paperclips.

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer’s True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we’ve solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.

Not all small targets are alike.

“You write that the worry is that the superintelligence won’t care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean? […]
“If an AI is meant to behave generally intelligent [sic] then it will have to work as intended or otherwise fail to be generally intelligent.”
            Alexander Kruel

It’s easy to get a genie to care about (optimize for) something-or-other; what’s hard is getting one to care about the right something.

‘Working as intended’ is a simple phrase, but behind it lies a monstrously complex referent. It doesn’t clearly distinguish the programmers’ (mostly implicit) true preferences from their stated design objectives; an AI’s actual code can differ from either or both of these. Crucially, what an AI is ‘intended’ for isn’t all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to kill Friendliness much more easily than intelligence.

It may be hard to build self-modifying AGI. But it’s not the same hardness as the hardness of Friendliness Theory. Being able to hit one small target doesn’t entail that you can or will hit every small target it would be in your best interest to hit. Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:

(i) Research Inertia. Far more people are working on AGI than on Friendliness. And there may not come a moment when researchers will suddenly realize that they need to take all their resources out of AGI and pour them into Friendliness. If the status quo continues, the default expectation should be UFAI.

(ii) Disjunctive Instrumental Value. Being more intelligent — that is, better able to manipulate diverse environments — is of instrumental value to nearly every goal. Being Friendly is of instrumental value to barely any goals. This makes it more likely by default that short-sighted humans will be interested in building AGI than in developing Friendliness Theory. And it makes it much likelier that an attempt at Friendly AGI that has a slightly defective goal architecture will retain the instrumental value of intelligence than of Friendliness.

(iii) Incremental Approachability. Friendliness is an all-or-nothing target. Value is fragile and complex, and a half-good being editing its morality drive is at least as likely to move toward 40% goodness as 60%. Cross-domain efficiency, in contrast, is not an all-or-nothing target. If you just make the AGI slightly better than a human at improving the efficiency of AGI, then this can snowball into ever-improving efficiency, even if the beginnings were clumsy and imperfect. It’s easy to put a reasoning machine into a feedback loop with reality in which it is differentially rewarded for being smarter; it’s hard to put one into a feedback loop with reality in which it is differentially rewarded for picking increasingly correct answers to ethical dilemmas.

The ability to productively rewrite software and the ability to perfectly extrapolate humanity’s True Preferences are two different skills. (For example, humans have the former capacity, and not the latter. Most humans, given unlimited power, would be unintentionally Unfriendly.)

It’s true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don’t have them both, and a pre-FOOM self-improving AGI (‘seed’) need not have both. Being able to program good programmers is all that’s required for an intelligence explosion; but being a good programmer doesn’t imply that one is a superlative moral psychologist or moral philosopher.

If the programmers don’t know in mathematical detail what Friendly code would even look like, then the seed won’t be built to want to build toward the right code. And if the seed isn’t built to want to self-modify toward Friendliness, then the superintelligence it sproutsalso won’t have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general ‘hit whatever target I want’ ability that makes Friendliness easy.

And that’s why some people are worried.

The AI knows, but doesn’t care

This is the first half of a LessWrong post. For background material, see A Non-Technical Introduction to AI Risk and Truly Part of You.

I summon a superintelligence, calling out: ‘I wish for my values to be fulfilled!’

The results fall short of pleasant.

Gnashing my teeth in a heap of ashes, I wail:

Is the artificial intelligence too stupid to understand what I meant? Then it is no superintelligence at all!

Is it too weak to reliably fulfill my desires? Then, surely, it is no superintelligence!

Does it hate me? Then it was deliberately crafted to hate me, for chaos predicts indifference. ———But, ah! no wicked god did intervene!

Thus disproved, my hypothetical implodes in a puff of logic. The world is saved. You’re welcome.

On this line of reasoning, safety-proofed artificial superintelligence (Friendly AI) is not difficult. It’s inevitable, provided only that we tell the AI, ‘Be Friendly.’ If the AI doesn’t understand ‘Be Friendly.’, then it’s too dumb to harm us. And if it does understand ‘Be Friendly.’, then designing it to follow such instructions is childishly easy.

The end!

… …

Is the missing option obvious?

What if the AI isn’t sadistic, or weak, or stupid, but just doesn’t care what you Really Meant by ‘I wish for my values to be fulfilled’?

When we see a Be Careful What You Wish For genie in fiction, it’s natural to assume that it’s a malevolent trickster or an incompetent bumbler. But a real Wish Machine wouldn’t be a human in shiny pants. If it paid heed to our verbal commands at all, it would do so in whatever way best fit its own values. Not necessarily the way that best fits ours.

Is indirect indirect normativity easy?

“If the poor machine could not understand the difference between ‘maximize human pleasure’ and ‘put all humans on an intravenous dopamine drip’ then it would also not understand most of the other subtle aspects of the universe, including but not limited to facts/questions like: ‘If I put a million amps of current through my logic circuits, I will fry myself to a crisp’, or ‘Which end of this Kill-O-Zap Definit-Destruct Megablaster is the end that I’m supposed to point at the other guy?’. Dumb AIs, in other words, are not an existential threat. […]

“If the AI is (and always has been, during its development) so confused about the world that it interprets the ‘maximize human pleasure’ motivation in such a twisted, logically inconsistent way, it would never have become powerful in the first place.”

Richard Loosemore

If an AI is sufficiently intelligent, then, yes, it should be able to model us well enough to make precise predictions about our behavior. And, yes, something functionally akin to our own intentional strategy could conceivably turn out to be an efficient way to predict linguistic behavior. The suggestion, then, is that we solve Friendliness by method A —

  • A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions’real meaning. Then just instruct it ‘Satisfy my preferences’, and wait for it to become smart enough to figure out my preferences.

— as opposed to B or C —

  • B. Solve the Problem of Preference-in-General in advance, and directly program it to figure out what our human preferences are and then satisfy them.
  • C. Solve the Problem of Human Preference, and explicitly program our particular preferences into the AI ourselves, rather than letting the AI discover them for us.

But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.

1. You have to actually code the seed AI to understand what we mean. You can’t just tell it ‘Start understanding the True Meaning of my sentences!’ to get the ball rolling, because it may not yet be sophisticated enough to grok the True Meaning of ‘Start understanding the True Meaning of my sentences!’.

2. The Problem of Meaning-in-General may really be ten thousand heterogeneous problems, especially if ‘semantic value’ isn’t a natural kind. There may not be a single simple algorithm that inputs any old brain-state and outputs what, if anything, it ‘means’; it may instead be that different types of content are encoded very differently.

3. The Problem of Meaning-in-General may subsume the Problem of Preference-in-General. Rather than being able to apply a simple catch-all Translation Machine to any old human concept to output a reliable algorithm for applying that concept in any intelligible situation, we may need to already understand how our beliefs and values work in some detail before we can start generalizing. On the face of it, programming an AI to fully understand ‘Be Friendly!’ seems at least as difficult as just programming Friendliness into it, but with an added layer of indirection.

4. Even if the Problem of Meaning-in-General has a unitary solution and doesn’t subsume Preference-in-General, it may still be harder if semantics is a subtler or more complex phenomenon than ethics. It’s not inconceivable that language could turn out to be more of a kludge than value; or more variable across individuals due to its evolutionary recency; or more complexly bound up with culture.

5. Even if Meaning-in-General is easier than Preference-in-General, it may still be extraordinarily difficult. The meanings of human sentences can’t be fully captured in any simple string of necessary and sufficient conditions. ‘Concepts‘ are just especially context-insensitive bodies of knowledge; we should not expect them to be uniquely reflectively consistent, transtemporally stable, discrete, easily-identified, or introspectively obvious.

6. It’s clear that building stable preferences out of B or C would create a Friendly AI. It’s not clear that the same is true for A. Even if the seed AI understands our commands, the ‘do’ part of ‘do what you’re told’ leaves a lot of dangerous wiggle room. See section 2 of Yudkowsky’s reply to Holden. If the AGI doesn’t already understand and care about human value, then it may misunderstand (or misvalue) the component of responsible request- or question-answering that depends on speakers’ implicit goals and intentions.

7. You can’t appeal to a superintelligence to tell you what code to first build it with.

The point isn’t that the Problem of Preference-in-General is unambiguously the ideal angle of attack. It’s that the linguistic competence of an AGI isn’t unambiguously the right target, and also isn’t easy or solved.

Point 7 seems to be a special source of confusion here, so I’ll focus just on it for my next post.