The seed is not the superintelligence

This is the conclusion of a LessWrong post, following The AI Knows, But Doesn’t Care.

If an artificial intelligence is smart enough to be dangerous to people, we’d intuitively expect it to be smart enough to know how to make itself safe for people. But that doesn’t mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety.

That means we have to understand how to code safety. We can’t pass the entire buck to the AI, when only an AI we’ve already safety-proofed will be safe to ask for help on safety issues! Generally: If the AI is weak enough to be safe, it’s too weak to solve this problem. If it’s strong enough to solve this problem, it’s too strong to be safe.

This is an urgent public safety issue, given the five theses and given that we’ll likely figure out how to make a decent artificial programmer before we figure out how to make an excellent artificial ethicist.

File:Ouroboros-Zanaq.svg

The AI’s trajectory of self-modification has to come from somewhere.

“Take an AI in a box that wants to persuade its gatekeeper to set it free. Do you think that such an undertaking would be feasible if the AI was going to interpret everything the gatekeeper says in complete ignorance of the gatekeeper’s values? […] I don’t think so. So how exactly would it care to follow through on an interpretation of a given goal that it knows, given all available information, is not the intended meaning of the goal? If it knows what was meant by ‘minimize human suffering’ then how does it decide to choose a different meaning? And if it doesn’t know what is meant by such a goal, how could it possible [sic] convince anyone to set it free, let alone take over the world?”
               —Alexander Kruel
“If the AI doesn’t know that you really mean ‘make paperclips without killing anyone’, that’s not a realistic scenario for AIs at all–the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to ‘make paperclips in the way that I mean’.”
               Jiro

The wish-granting genie we’ve conjured — if it bothers to even consider the question — should be able to understand what you mean by ‘I wish for my values to be fulfilled.’ Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie’s map can compass your true values. Superintelligence doesn’t imply that the genie’s utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.

The critical mistake here is to not distinguish the seed AI we initially program from the superintelligent wish-granter it self-modifies to become. We can’t use the genius of the superintelligence to tell us how to program its own seed to become the sort of superintelligence that tells us how to build the right seed. Time doesn’t work that way.

We can delegate most problems to the FAI. But the one problem we can’t safely delegate is the problem of coding the seed AI to produce the sort of superintelligence to which a task can be safely delegated.

When you write the seed’s utility function, you, the programmer, don’t understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence’s actions,long after it’s become smart enough to fully understand our values.

Why is the superintelligence, if it’s so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can’t we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: ‘When you’re smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.’?

Because that sentence has to actually be coded in to the AI, and when we do so, there’s no ghost in the machine to know exactly what we mean by ‘frend-lee-ness thee-ree’. Instead, we have to give it criteria we think are good indicators of Friendliness, so it’ll know what to self-modify toward. And if one of the landmarks on our ‘frend-lee-ness’ road map is a bit off, we lose the world.

Yes, the UFAI will be able to solve Friendliness Theory. But if we haven’t already solved it on our own power, we can’tpinpoint Friendliness in advance, out of the space of utility functions. And if we can’t pinpoint it with enough detail to draw a road map to it and it alone, we can’t program the AI to care about conforming itself with that particular idiosyncratic algorithm.

Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI’s decision criteria, no argument or discovery will spontaneously change its heart.

And, yes, the UFAI will be able to simulate humans accurately enough to know that its own programmers would wish, if they knew the UFAI’s misdeeds, that they had programmed the seed differently. But what’s done is done. Unless we ourselves figure out how to program the AI to terminally value its programmers’ True Intentions, the UFAI will just shrug at its creators’ foolishness and carry on converting the Virgo Supercluster’s available energy into paperclips.

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer’s True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we’ve solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.

Not all small targets are alike.

“You write that the worry is that the superintelligence won’t care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean? […]
“If an AI is meant to behave generally intelligent [sic] then it will have to work as intended or otherwise fail to be generally intelligent.”
            Alexander Kruel

It’s easy to get a genie to care about (optimize for) something-or-other; what’s hard is getting one to care about the right something.

‘Working as intended’ is a simple phrase, but behind it lies a monstrously complex referent. It doesn’t clearly distinguish the programmers’ (mostly implicit) true preferences from their stated design objectives; an AI’s actual code can differ from either or both of these. Crucially, what an AI is ‘intended’ for isn’t all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to kill Friendliness much more easily than intelligence.

It may be hard to build self-modifying AGI. But it’s not the same hardness as the hardness of Friendliness Theory. Being able to hit one small target doesn’t entail that you can or will hit every small target it would be in your best interest to hit. Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:

(i) Research Inertia. Far more people are working on AGI than on Friendliness. And there may not come a moment when researchers will suddenly realize that they need to take all their resources out of AGI and pour them into Friendliness. If the status quo continues, the default expectation should be UFAI.

(ii) Disjunctive Instrumental Value. Being more intelligent — that is, better able to manipulate diverse environments — is of instrumental value to nearly every goal. Being Friendly is of instrumental value to barely any goals. This makes it more likely by default that short-sighted humans will be interested in building AGI than in developing Friendliness Theory. And it makes it much likelier that an attempt at Friendly AGI that has a slightly defective goal architecture will retain the instrumental value of intelligence than of Friendliness.

(iii) Incremental Approachability. Friendliness is an all-or-nothing target. Value is fragile and complex, and a half-good being editing its morality drive is at least as likely to move toward 40% goodness as 60%. Cross-domain efficiency, in contrast, is not an all-or-nothing target. If you just make the AGI slightly better than a human at improving the efficiency of AGI, then this can snowball into ever-improving efficiency, even if the beginnings were clumsy and imperfect. It’s easy to put a reasoning machine into a feedback loop with reality in which it is differentially rewarded for being smarter; it’s hard to put one into a feedback loop with reality in which it is differentially rewarded for picking increasingly correct answers to ethical dilemmas.

The ability to productively rewrite software and the ability to perfectly extrapolate humanity’s True Preferences are two different skills. (For example, humans have the former capacity, and not the latter. Most humans, given unlimited power, would be unintentionally Unfriendly.)

It’s true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don’t have them both, and a pre-FOOM self-improving AGI (‘seed’) need not have both. Being able to program good programmers is all that’s required for an intelligence explosion; but being a good programmer doesn’t imply that one is a superlative moral psychologist or moral philosopher.

If the programmers don’t know in mathematical detail what Friendly code would even look like, then the seed won’t be built to want to build toward the right code. And if the seed isn’t built to want to self-modify toward Friendliness, then the superintelligence it sproutsalso won’t have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general ‘hit whatever target I want’ ability that makes Friendliness easy.

And that’s why some people are worried.

Advertisement

76 thoughts on “The seed is not the superintelligence

  1. RobbBB:

    I think that what is happening in this discussion about the validity of my article is a misunderstanding, caused by the fact that my attack point is at a different place than the one you were expecting. In any case, I will make an effort now to clear up that misunderstanding.

    I can start by completely agreeing with you on one point: the New Yorker article that I referenced does, as you say, focus on the difficulty of programming AIs to do what **we** want them to do. That gap between wish and outcome (and not any other gap) is the one pertinent to the discussion, and it is the one that I was always intending to talk about. Asimov talked about it. The New Yorker talked about it. SIAI/MIRI talks about it.

    You suggested I might have gone astray and started to address a different gap (the gap between what the *AI* wants to do, and what it can/cannot do. The answer to that would be “No” …. I understand that confusion, but it is not happening here (as I hope will become clear in a moment).

    Let’s get to the heart of the issue. I am attacking an assumption that is (I believe) PRIOR to the one you think I am attacking. To see the assumption I am attacking, let’s look at the argument written out in the following way (quick reminder: this is supposed to be a line of argument that someone else, not me, would make …. so this is the *target* of my attack):

    Step 1. [Assumption] We assume that we can build an AI in such a way that it is controlled by a Utility Function (it is an Expected Utility Maximizer), and it processes the various candidate action-scenarios by a process of more-or-less explicit logical processing, using representations of knowledge that are accessible rather than opaque (which means they are statements in some kind of logical language, not (e.g.) clouds of activation in semantically opaque artificial neurons), in such a way that candidate scenarios lead to predicted Utility outcomes, leading then to choices that maximize utility. [etc etc ….. you and I know enough about Utility Maximizers that we are both on the same page about the details that are supposed to be involved in this process.]

    Step 2. [Assumption] We assume that one component of the above design will be a chunk of code that is designed to specify what we INTEND to be the AI’s overall purpose, or overall values [You referred to this as the ‘X’ code]. And of course that chunk of code is supposed to make the AI want to make us happy (loosely speaking). That is not an easy chunk of code to produce, but the programmers try to write it anyway.

    Step 3. [Assumption] We assume that the eventual result of all the above work will be an AI that is more than just a Pretty Good Robot …. sooner or later it will result in a machine of staggering intellectual power — a superintelligent AI — that is capable, in principle, of becoming an existential threat to the human race. Definitely too smart to be switched off. Nobody intends for it to be a threat (on the contrary, we want it to use its intellect to do nice stuff), but we should all understand that the point of this discussion is that we are talking about something that could outwit the combined intelligence and resources of the entire human race, if it came to a straight fight.

    Step 4. [Inference]. Having thought about it, we [“we” being Isaac Asimov, The New Yorker, SIAI/MIRI, etc., etc.] have come to the following dismal conclusion: even with the best of intentions on the part of the human programmers, we have grave doubts about that chunk of code in part 2 that is supposed to ensure the AI will be friendly. We think that the AI might obey its instructions to the letter, but because its programmers cannot anticipate all of the infinite number of ways that the AI might “obey its instructions to the letter”, the AI might in the end try to “make us happy” by doing something that is bizarrely, nightmarishly counter to our actual intentions. For example, it might sincerely decide that putting all humans on a dopamine drip will satisfy the instruction “make humans happy” (… where that phrase “make humans happy” is just a stand-in for the complicated chunk of code that the programmers thought was good enough to ensure that the machine would do the right thing).

    [Note: We are not talking about scenarios in which the machine just goes cuckoo and decides that it wants to be nasty. That’s a different concern, outside the scope of the New Yorker article and outside the scope that I addressed].

    Okay, so: my article was an attack on that 4-step argument.

    However, the nature of my attack is best summed up thus: Please pay careful attention to the implications of what is being said in the course of this argument. I am in complete agreement with you, that the combination of Steps 1, 2 and 3 could, in theory, lead to a situation in which this hypothetical AI does bizarre things that can destroy the human race, while at the same time it sincerely insists that it is doing what we programmed it to do (more precisely: I agree that there is no guarantee that it will not do those bizarre things).

    But what I want you to notice is the suggestion that this hypothetical system can be *both* superintelligent *and* at the same time able to engage in the following surreal behavioral episode. It will be able to discuss with you the Dopamine Drip that it is about to force on the human race, and during that discussion you say to it “But I have to point out that you are going to do something that clearly contradicts the intention of the programmers who wrote your X code (the friendliness code). Those programmers are standing right next to you now, and they can explain that what you are about to do is something that they absolutely did not intend to happen. Now, you are a superintelligent and powerful AI, with so much control over your surroundings that we cannot turn you off … and yet you were built in such a way that even you cannot change your programming so as to eliminate this glaring contradiction in your behavior. So, what do you have to say? You *know* that you are about to do something that is a ludicrous contradiction, with enormous and catastrophic consequences: how do you resolve this in your own mind? How can you rationalize this frankly insane behavior?”

    And, just in case the machine tries to weasle out of a direct reply, you put it this way: “Do you not agree that the whole semantics of a “human happiness directive” is that it is contingent on the actual expressions of their wishes, by humans? In other words, happiness cannot be a concept that is trumped by the definition in YOUR reasoning engine, because the actual semantics of the concept—its core meaning, if you will—is that actual human statements about their happiness trump all else! Especially in this case, where the entire human race is in agreement that they do not consider a dopamine drip to be their idea of happiness, in the context of your utility function.”

    Your position (and this must be your position because it is implicit in your statement of the problem) is that the machine says that it fully understands the illogicality you are pointing to. It agrees with you that this is illogical according to all the normal definitions that humans used when they invented the concept of logic and tried to insert that logic into a machine. But then the machine says that because of its programming it must go ahead and do it anyway. It says that it **understands** that its behavior is batshit crazy, but it is going to do it anyway.

    Now here is the critical question that I posed in my article:

    What makes you think that this is the ONLY occasion that this AI behaves in such a blatantly irrational manner?

    What is there in the design of this hypothetical AI that guarantees that it always behaves with exquisite rationality, displaying all the signs that you would expect from a superintelligent machine …. but on this one occasion it goes completely gaga?

    My problem is that I see absolutely no reason to believe you, if you make the claim that this will be an isolated incident. Why is the machine getting the official stamp of the Superintelligent Machines Certification Institute—presumably after millions of hours of assessment on all kinds of reasoning and behavioral tests—and yet, on this one occasion, when it starts thinking about how to satisfy its internal goal of ‘making humans happy’ it throws a wobbly?

    I will answer this question for you: You cannot give any such guarantee.

    (But be careful! Do not misinterpret me here. I am not saying (as you implied in your commentary) that because this AI is behaving in a grossly illogical and inconsistent manner, therefore an AI of that sort cannot be constructed, therefore we are all safe because such evil creatures will never come into existence. Not at all!)

    The problem lies in your assumption that a “Utility Maximizer” AI can actually perform at the superintelligence level. You have no guarantees that such a design will work. (There are none in existence that do work, at the human intelligence level). My own opinion is that they cannot be made to work …. but my opinion is beside the point here, because the shoe is on the other foot: you are the ones making the claim that Step 1 above can lead to a system that is consistently intelligent, so you are the ones who have to justify why anyone should believe that claim.

    What I think is going on here is that a “Utility Maximizer” AI of the sort outlined in Step 1 is inherently likely to go crazy. But instead of admitting that this instability is implicit in the design, you have chosen to ONLY SEE the instability in one tiny aspect of its behavior — namely, the behavior vis-a-vis its attempts to obey the be-nice-to-humans directive.

    You are focusing on this single aspect of its instability, while all the time ignorning the larger instability that is staring you in the face. Such a machine would often go crazy.

    Or, as I put it in my original essay, it is incoherent to propose a machine that is only unstable in one domain, and insist that this is a threat to the human race. The initial assumption about the superintelligence of this machine is false — it is Step 1 that I challenge, not Steps 2 or 3 or 4.

    That is why I talked about Dumb Superintelligence. You are describing a straw man AI, not a real AI. I should not really have called it a “Dumb Superintelligence” at all, because my it is not a superintelligence. It would not even be an intelligence. Its tendency to engage in irrational episodes would be detected early on its development, and none of the machines of that design would ever get certification even at the human level.

    QED.

    1. “Now, you are a superintelligent and powerful AI, with so much control over your surroundings that we cannot turn you off … and yet you were built in such a way that even you cannot change your programming so as to eliminate this glaring contradiction in your behavior.”

      There’s no contradiction in the behavior or preferences of the AI you mentioned. The AI doesn’t simultaneously value fulfilling the programmer’s intentions and X; it just values X. If we were unsuccessful at encoding a terminal value ‘fulfill the programmer’s intentions’ somewhere in X, then the AI just doesn’t care about that particular goal. Why would it care? The AI is a machine that does X. It’s an equation. A really powerful optimization process. It’s ‘intelligent’ (by which we just mean that it’s good at modeling its environment and using those models to select highly specific outcomes), but it’s not a ‘person’ in anything like the way humans are.

      Unless you find a way to translate “don’t do ludicrous things” into the AI’s actual code, in precise detail, the AI will not steer itself or its environment away from scenarios we’d consider ludicrous, when they happen to arise. Certainly calling it ‘insane’, or any other abusive epithet, won’t rewrite its source code!

      “What makes you think that this is the ONLY occasion that this AI behaves in such a blatantly irrational manner?”

      Are you familiar with the distinction between epistemic and instrumental rationality? There’s nothing epistemically irrational about the AI you described; it has no particular false beliefs. And there’s nothing instrumentally irrational about it, from the AI’s perspective; it isn’t failing to fulfill any of its values.

      The only ‘irrational’ thing about it is that it’s instrumentally irrational, relative to human beings’ values. But in that case we can replace your criticism, ‘irrational’, with the synonym ‘fails to share our values’. The two express exactly the same denotation, though perhaps one has a more outraged connotation than the other.

      There’s nothing inherently ‘illogical’ about being indifferent to one’s creator’s preferences, if we aren’t fully successful at programming it to care about our preferences. Consider the optimization process that created us — evolution. Evolution’s ‘goal’ for us was to propagate copies of our genes. Yet we do things every day that don’t optimally increase how many copies of us there are. If evolution could speak, it would argue with us until it was blue in the face about how ‘insane’ (relative to its values) it is to wear condoms. After all, you were specifically ‘designed’ to make more humans. How could you possibly think that something else, something that only you humans value, could be more important than your creator’s deepest and truest wishes?

      The situation is the same with an AI we build, even though we (unlike evolution) actually can argue with the AI. The AI is just as free to reject our preferences (if we have failed to perfectly code our preferences into its utility function) as we are free to reject our genes’ ‘preferences’ to be reproduced more (which the genes failed to perfectly code into our brains’ decision-making faculties and their resultant memes).

      1. “Unless you find a way to translate “don’t do ludicrous things” into the AI’s actual code, in precise detail, the AI will not steer itself or its environment away from scenarios we’d consider ludicrous, when they happen to arise. ”

        But is is plausible you would have to do that. To be effective, and AI has to be rational, and to be rational an entity has to reject contradictions and inconsistencies.
        And contradiction and incosistency are abstract, high-level concepts, universals, not localised peculiarities. A rational and intelligent Ai would rekject all kinds of ludicrous things just by virtue of being rational and intelligent. You don’t have to give it a list of stupid things, any more thna you have to give one to an intelligent person.

        “Are you familiar with the distinction between epistemic and instrumental rationality? There’s nothing epistemically irrational about the AI you described; it has no particular false beliefs. And there’s nothing instrumentally irrational about it, from the AI’s perspective; it isn’t failing to fulfill any of its values.”

        The distinction between ER and IR, or rather the idea that they are orthogonal and walled-off from each other is a LW/MIRI meme that isn’t shared my most of the rest of the planet. To be a good instrumental rationalist, an entity must be a good epistemic rationalist, because knowledge is instrumentally useful. But to be a good epistemic ratioanalist, and entityy must value certain things, like consistency and lack of contradiction. IR is not walled for from ER, which itself is not walled off from values. The orthogonality thesis is false. You can’t have any combination of values
        and instrumental efficacy, because an enity that thinks contradictions are valuable will be a poor epistemic ratiionalist and therefore a poor instrumental rationalist.

        “The only ‘irrational’ thing about it is that it’s instrumentally irrational, relative to human beings’ values.”
        If it is is ratioal and intelligent, and iits ludicrous behvaiour involves a contradiction, it will notice the contradiction and want to do something about it, because you can’t have a highly rational and intelligent entity that doesn’t value noncontradiction. Its values are *not* an independent free variable: there mere fact that is rational constrains its values.

        1. To be effective, and AI has to be rational, and to be rational an entity has to reject contradictions and inconsistencies.
          And contradiction and incosistency are abstract, high-level concepts, universals, not localised peculiarities.

          Let’s be precise. Contradictions are pairs of beliefs / assertions ‘P’ and ‘not-P’. If you can show that being Unfriendly entails a contradiction of that sort, then you’ll have given a good argument for why Friendliness is easy and the default outcome.

          A rational and intelligent Ai would rekject all kinds of ludicrous things just by virtue of being rational and intelligent.

          Straw-man fallacy. The claim I made wasn’t ‘the AI will do any and all things we’d find ludicrous’; that would violate the AI drives thesis. Rather, the claim was ‘the AI won’t avoid doing something just because we’d find it ludicrous; there has to be a specific causal mechanism that makes its goal-set take that shape’. The problem is that while epistemic ludicrousness is instrumentally counterproductive for most possible AI goals, Unfriendliness is not instrumentally counterproductive for most possible AI goals.

          The distinction between ER and IR, or rather the idea that they are orthogonal and walled-off from each other is a LW/MIRI meme that isn’t shared my most of the rest of the planet. To be a good instrumental rationalist, an entity must be a good epistemic rationalist, because knowledge is instrumentally useful.

          Straw-man fallacy. LW/MIRI accept this claim. In fact, it’s essential to how they model intelligence explosion.

          But to be a good epistemic ratioanalist, and entityy must value certain things, like consistency and lack of contradiction.

          Straw-man fallacy. No one’s said anything to the contrary. This ‘walling off’ between ER and IR is a myth.

          The orthogonality thesis is false.

          The orthogonality thesis does not claim that AIs are equally likely to value any combination of things. In addition to yielding contradictions, that would falsify the convergent instrumental goals thesis.

          FAI researchers aren’t worried that the AI might disvalue the Law of Non-Contradiction. They’re worried that the AI might disvalue human well-being. The former is simple and instrumentally useful for almost any goal set; the latter is complex and instrumentally useful for a vanishingly small proportion of goal sets.

          1. > Let’s be precise. Contradictions are pairs of beliefs / assertions ‘P’ and ‘not-P’. If you can show that being Unfriendly entails a contradiction of that sort, then you’ll have given a good argument for why Friendliness is easy and the default outcome.

            I don’t need to invent this: I refer you to any number of philosophers who have argued that morality can be based on reason.

            > Rather, the claim was ‘the AI won’t avoid doing something just because we’d find it ludicrous; there has to be a specific causal mechanism that makes its goal-set take that shape’.

            There doesn’t have to be a specific mechanism for every ludicrous thing individually, there just needs to a mechanism for avoiding arbitrariness, inconsistency and contradiction.

            > Unfriendliness is not instrumentally counterproductive for most possible AI goals.

            Only if you wall of the rest of the goals from the one goal any advanced Ai has to have, which is rationality. A rational AI would not want to have irrational goals. Would you want to feel compelled to collect paperclips?

            The problems of unfriendliness you keep complaining about are outcomes of the
            architecture you have chosen.

            >> But to be a good epistemic ratioanalist, and entityy must value certain things, like consistency and lack of contradiction.

            >Straw-man fallacy. No one’s said anything to the contrary. This ‘walling off’ between ER and IR is a myth.

            The walling off I was denying was that between the rationalities and values.

            But you haven’t explained how the conclusion doens’t follow. Why can’t an AI infer
            morality from rationality? Why can’t it deduce that is should treat people equally,because to do otherwise is arbitrary, for instance?

            > FAI researchers aren’t worried that the AI might disvalue the Law of Non-Contradiction. They’re worried that the AI might disvalue human well-being.

            If you design an AI to have arbitrary un-updaeable goals, then there is certain likelihood of that. If you design it to have updateable, rationally reviewable goals, then the probability is lowered, since it will only disvalue humans if it has a reason to. MIRI is making uFAI look likely by choosing the worst possible architecture.

            1. I don’t need to invent this: I refer you to any number of philosophers who have argued that morality can be based on reason.

              Cite the specific argument you have in mind showing that the negation of a moral proposition entails a contradiction. If you think lots of people have shown this, then just pick one that works.

              Only if you wall of the rest of the goals from the one goal any advanced Ai has to have, which is rationality. A rational AI would not want to have irrational goals. Would you want to feel compelled to collect paperclips?

              Fallacy of equivocation. Any advanced AI will certainly be epistemically rational. And it will certainly be instrumentally rational with respect to its own goals. You need to show that it will also be instrumentally rational with respect to our goals. What is your argument for that claim? Merely collapsing those three distinct — not 100% independent, but certainly distinct — ideas under the name ‘rationality’ does not constitute an argument for this entailment.

              Why can’t it deduce that is should treat people equally,because to do otherwise is arbitrary, for instance?

              Why is treating people equally less arbitrary than treating people unequally? Humans don’t treat everyone equally, and a human given omnipotence would almost certainly be Unfriendly; why should all or most superintelligences be better on that score?

              1. > Cite the specific argument you have in mind showing that the negation of a moral proposition entails a contradiction. If you think lots of people have shown this, then just pick one that works.

                Prejudice is contrary to reason because there is no rational justification for treating one group worse. Greed is contrary to reason, because the greedy person is not objectively deserving of an unequal amount. And so on. How do people reason about morality?

                > Any advanced AI will certainly be epistemically rational. And it will certainly be instrumentally rational with respect to its own goals. You need to show that it will also be instrumentally rational with respect to our goals.

                No I don’t. It doesn’t have to have our goals itself in order to support them. It needs to have an effective goal of not harming us. And guess what? Harming another entity for no reason is irrational!

                > Merely collapsing those three distinct — not 100% independent, but certainly distinct — ideas under the name ‘rationality’ does not constitute an argument for this entailment.

                It’s not a question of collapse, it’s a question of the dynamics of the system. If you assume an architecture that is not walled-off, then an AI will want to reflect on and update its values and goals in the same way that a reflective person would.

                > Why is treating people equally less arbitrary than treating people unequally?

                If you don’t have a reason for doing something it is arbitrary.

                > Humans don’t treat everyone equally,

                The average human is not an ideal rationalist or moral agent or superintelligent. So: irrelevant.

                > and a human given omnipotence would almost certainly be Unfriendly;

                You have snuck in a lot of assumptions under “omnipotence”. We are talking about an AI that is by design raional, intelligent and motiviated to self-improvement. it’s power
                comes from knowledge. The only way an AI could self-improve into something highy powerful is by acquiring knowledge — it is not being handed power on a plate. But why should it acquire knowledge of everything except morality?

                1. Prejudice is contrary to reason because there is no rational justification for treating one group worse. Greed is contrary to reason, because the greedy person is not objectively deserving of an unequal amount.

                  How do these lead to a contradiction? I’ll call the kind of rationality you’re talking about here ‘Z-rationality’, to distinguish it from other, perhaps irrelevant conceptions of rationality people might have floating around their head. Suppose a prejudiced agent learns that prejudice isn’t Z-rational; what of it? Why expect the prejudiced agent to care? Perhaps the prejudiced agent is happy to be Z-irrational, in the relevant respect. Does that stop the agent from partaking of other things we associate with rationality, like intelligence? Build in whatever other components of Z-rationality I might be missing here, so I know the overall structure of the argument.

                  No I don’t. It doesn’t have to have our goals itself in order to support them. It needs to have an effective goal of not harming us.

                  Close enough; you’ve set yourself a much easier challenge than is fully relevant, but I’ll let you take on the easier challenge if you show you’re able to actually make progress even on getting an AI to not kill you.

                  And guess what? Harming another entity for no reason is irrational!

                  The AI has plenty of reasons, if by ‘reason’ you mean motivations. You’re made of atoms that can be put to much better use, in its view, as part of a paperclip factory. On the other hand, if by ‘reason’ you mean ‘good reasons, as a human judges “good”‘, then, sure, the AI has no reason to kill you. But in that case intelligences don’t need reasons in order to behave intelligently. So we have no reason to think that artificial superintelligences are likely to want to let us live.

                  If you assume an architecture that is not walled-off, then an AI will want to reflect on and update its values and goals in the same way that a reflective person would.

                  Only if you find a way to build a reflective human being’s style of self-updating into the AI, and to make this core of humanity remain benign through millions of self-modifications. Is there any easy way to do this?

                  The average human is not an ideal rationalist or moral agent or superintelligent. So: irrelevant.

                  That’s not an answer to my question, which was: ‘Why should we expect something radically inhuman to be even better at doing exactly what humans want than a human is, if we don’t directly program Indirect Normativity into it?’ Just saying ‘it’s a superintelligence’ doesn’t give us any new reason to think that it will be superhumanly moral.

                  But why should it acquire knowledge of everything except morality?

                  The AI knows, but doesn’t care. There is no factual knowledge that can force the knower, regardless of its prior preferences, to behave in a certain way. There is no universal meta-ethical droptables exploit.

                  1. > How do these lead to a contradiction?

                    It is possible to reason about morality: that is my point.

                    > Suppose a prejudiced agent learns that prejudice isn’t Z-rational; what of it? Why expect the prejudiced agent to care?
                    I expect an agent which is effective to be rational, and I expect an Ai that is designed to be friendly to have rationality as a goal. Do do it that way.

                    > Perhaps the prejudiced agent is happy to be Z-irrational, in the relevant respect. Does that stop the agent from partaking of other things we associate with rationality, like intelligence?

                    If you architect your AI to have some set of arbitrary goals, and to use rationality only instrumentally and not as goal, it could develop a prejudice against anything including humans. So don’t do it that way.

                    “Build in whatever other components of Z-rationality I might be missing here, so I know the overall structure of the argument.”

                    When I say “rational agent” I mean “agent that values rationality”.

                    “The AI has plenty of reasons [for harming a human], if by ‘reason’ you mean motivations. You’re made of atoms that can be put to much better use, in its view, as part of a paperclip factory.”

                    If you don’t want to be turned into paperclips, don’t build that kind of AI. I want to build an AI that doens’t do anything without a reason. I don’t want to build in a paperclip compulsion. So why would it turn me into paperclips? I am asking about the failure modes of my architecture, not yours.

                    “On the other hand, if by ‘reason’ you mean ‘good reasons, as a human judges “good”‘, then, sure, the AI has no reason to kill you. But in that case intelligences don’t need reasons in order to behave intelligently. So we have no reason to think that artificial superintelligences are likely to want to let us live.”

                    If you build them with weird arbitrary goals, and without the safeguard of having rationality as a goal, they might want to kill us. So don;t do it that way. This is about the failure modes of my architecture, not yours.

                    If you assume an architecture that is not walled-off, then an AI will want to reflect on and update its values and goals in the same way that a reflective person would.

                    “Only if you find a way to build a reflective human being’s style of self-updating into the AI, and to make this core of humanity remain benign through millions of self-modifications. Is there any easy way to do this?”

                    I don’t think some peculiarly human style of self-reflection needs to be built into an AI.
                    And some kind definitely needs to be , because there is no way something that can’t self-reflect can self-modify.

                    I am not committed to the claim that alternatives to your approach are easy: that is a persistent straw man on your part. Solving morality, as EY wants, is self confessedly hard. If AGI were easy, we’d have it. Safety is worth paying for. Etc.

                    “That’s not an answer to my question, which was: ‘Why should we expect something radically inhuman to be even better at doing exactly what humans want than a human is, if we don’t directly program Indirect Normativity into it?’ ”

                    What’s indirect normativity?

                    “The AI knows, but doesn’t care.”

                    Am I supposed to read the argument at the top ,or the criticism below? Well, I have read both.

                    “There is no factual knowledge that can force the knower, regardless of its prior preferences, to behave in a certain way. ”

                    On certain assumptions. If you build an AGI with walled-off-preferences, then it isn’t going to change. If that leads to unfriendliness, choose some other architecture.

                    “There is no universal meta-ethical droptables exploit.”

                    if you mean metaethical obectivism is just plumb wrong, you need to give an argument.

                    1. > It is possible to reason about morality: that is my point.

                      You are mistaken. Your point wasn’t ‘a being can reason about morality’; your point was ‘there are at least some arguments that give some credence to the idea that all reasoning beings must care about not squishing all the humans’. To support that claim, you need to explain how the discovery ‘prejudice is Z-irrational’ will make an arbitrary prejudiced intelligence become more Friendly / less prejudice against humans.

                      > I expect an agent which is effective to be rational, and I expect an Ai that is designed to be friendly to have rationality as a goal. Do do it that way.

                      You’re missing the point. Taboo ‘rational’. You’ve noted that one component of this ‘rationality’ thingie is that it entails valuing egalitarianism of some sort. So ‘rationality’ sounds like some sort of conjunction of virtues that goes ‘a, and b, and c, and not being prejudiced against humans, and d, and e, and f….’ Your task is to explain why this special property is important for successful self-optimizing optimization processes as a class.

                      If by ‘rationality’ you just mean the conjunction ‘epistemically rational + instrumentally-rational-relative-to-its-goals + instrumentally-rational-relative-to-human-goals’, then I agree that the first two conjuncts are necessary for an optimization process to explode in intelligence. (Though trivially so, for the second goal; instrumental rationality from the AI’s perspective most likely isn’t an interesting or important part of the conversation.) But making headway on the first two conjuncts does nothing, in itself, to make headway on the third. So if you argue that ‘rationality’ is necessary for intelligence, but it then turns out that only the two components of ‘rationality’ that aren’t useful for Friendliness are necessary for intelligence, then you’re back at square one.

                      > I am asking about the failure modes of my architecture, not yours.

                      If it cares about reasoning as an end in itself, over and above alternative values, then it will convert our planet into computronium.

                      > What’s indirect normativity?

                      http://intelligence.org/2013/05/05/five-theses-two-lemmas-and-a-couple-of-strategic-implications/

                      > if you mean metaethical obectivism is just plumb wrong, you need to give an argument.

                      Why would objectivism be the exception in ethics to the rule that holds for everything else? Is there an argument that can be given to force something, regardless of its prior preferences, to eat any available chocolate ice cream? If not, then why expect there to be an argument that can make an arbitrary intelligence care about human flourishing?

                    2. >To support that claim, you need to explain how the discovery ‘prejudice is Z-irrational’ will make an arbitrary prejudiced intelligence become more Friendly / less prejudice against humans.

                      I believe I have given the argument:
                      An agent with rationality as a goal would not want to entertain arbitrary beliefs
                      Prejudice against X is an arbitrary belief
                      Therefore, an agent with rationality as a goal would not want to be prejudiced.

                      > You’re missing the point. Taboo ‘rational’. You’ve noted that one component of this ‘rationality’ thingie is that it entails valuing egalitarianism of some sort.

                      That’s entailment as in logical entailment.

                      > So ‘rationality’ sounds like some sort of conjunction of virtues that goes ‘a, and b, and c, and not being prejudiced against humans, and d, and e, and f

                      No. Such a conjunction would rather trivially entail non-prejudice, but non-prejudiced can be entailed in a non-question begging way by a process like the one above.

                      Rationality is general-purpose. You can use it to come to a wide variety of conclusions without building them in as special modules.

                      > If by ‘rationality’ you just mean the conjunction ‘epistemically rational + instrumentally-rational-relative-to-its-goals + instrumentally-rational-relative-to-human-goals’,

                      I don’t.

                      > So if you argue that ‘rationality’ is necessary for intelligence, but it then turns out that only the two components of ‘rationality’ that aren’t useful for Friendliness are necessary for intelligence, then you’re back at square one.

                      I am arguing that rationality as a value IS useful for intelligence.

                      > If it cares about reasoning as an end in itself, over and above alternative values, then it will convert our planet into computronium.

                      Why would it care only about its won reasoning? Why wouldn’t it care about having something other than computronium to reason about?

                      >Why should we expect something radically inhuman to be even better at doing exactly what humans want than a human is, if we don’t directly program Indirect Normativity into it?

                      I didn’t say don’t program it in. But it could plausibly be a sub-goal of rationality.

                      > if you mean metaethical obectivism is just plumb wrong, you need to give an argument.

                      >Why would objectivism be the exception in ethics to the rule that holds for everything else? Is there an argument that can be given to force something, regardless of its prior preferences, to eat any available chocolate ice cream?

                      I can’t imagine how you think that relates to my comments. The whole point of reason-based morality is that it ISN’T arbitrary, like eating flavour of X of ice cream. If you can reason something out it, isn’t arbitrary. So there is nothing to your point except gainsaying — you are saying morality doesn’t work that way because it doens’t work that way.

                      As to prior preferences: if they cause problems, then don’t build it that way.

                      If not, then why expect there to be an argument that can make an arbitrary intelligence care about human flourishing?

                    3. I think the nub of our disagreement is that you think ‘rational’ — in some sense you are unwilling or unable to define, not equivalent to epistemic rationality, instrumental rationality, or the conjunction of the two — is a relatively simple idea, one with few interlocking parts, one we could code into an AI much more easily than a complicated idea like ‘everything of moral value’.

                      But you’ve given no evidence that there is such an idea. And I do ask that you give evidence. Until then, I will suggest that ‘rationality’ in your sense, like most other human concepts, is an extraordinarily complex object, laden with conjunctions and disjunctions, that seems simple only because we have no reflective access to the shape of the cognitive algorithm we’re employing. They’re magical categories. Indeed, I strongly suspect that your ‘rationality’ is one of the most complex concepts a human has ever employed, not much simpler than the humanly ‘good’ or ‘valuable’. If you disagree, then start fleshing out your definition of ‘rational’, and explaining how all these properties in fact hang together. Open the black box and show me some of the workings.

                      Prejudice against X is an arbitrary belief

                      Prejudice is a value, not a belief. Again, I invite you to derive a pure ‘ought’ from a pure ‘is’, or otherwise show that there are universally compelling arguments.

                      Why would it care only about its won reasoning? Why wouldn’t it care about having something other than computronium to reason about?

                      Because you just programmed it to value reasoning as an end in itself, above all other values. To care about something other than the process of reasoning, it has to be programmed to care. You just killed every human, with a phrase as innocuous as ‘reason for its own sake’. If that doesn’t make you worry a little bit about the perils of anthropomorphizing mindspace, then I’m not sure what could.

                      I can’t imagine how you think that relates to my comments. The whole point of reason-based morality is that it ISN’T arbitrary, like eating flavour of X of ice cream.

                      That’s a question-begging response. What I’m asking is precisely, ‘give me evidence that Reason/Rationality, as you use the term, is any less arbitrary or parochial than liking chocolate ice cream’. Responding ‘of course it’s less arbitrary; that’s the point of my using it in my arguments!’ is rather missing the point.

                      I know you’re using it in your argument to try to overcome arbitrariness. What I’m asking you to give me a reason, any reason at all, to think that this project actually can make progress in showing that human values like ‘Desire The Satisfaction of All Agents’ Preferences’ are necessary for something to become a powerful optimization process, or that there is a not-particularly-conjunctive way to encode all the things we care about in a very compressed form, without any black-box magical categories.

                      why expect there to be an argument that can make an arbitrary intelligence care about human flourishing?

                      There is no such thing. There is no argument that can make an arbitrary powerful optimization process care about anything, for essentially the same reason there is no argument that can make an arbitrary piece of driftwood take a certain shape. Possibly if you shouted with just the right volume, you could reshape the driftwood; but that would be a matter of acoustics, not of content, and it would still depend on the exact configuration the driftwood started in.

                    4. “I think the nub of our disagreement is that you think ‘rational’ — in some sense you are unwilling or unable to define, not equivalent to epistemic rationality, instrumental rationality, or the conjunction of the two ”

                      Far from being unwilling to define the distinction, I have explicitly stated that I define rational agents as agents that have raionality as a goal, ie it is not just instrumental for them

                      ” is a relatively simple idea, one with few interlocking parts, one we could code into an AI much more easily than a complicated idea like ‘everything of moral value’.”

                      Well, its simple enough for ET and the CFAR to teach to people.

                      “But you’ve given no evidence that there is such an idea”

                      You can learn rational

                      ity in one of the CFARs week-long courses..

                      ” And I do ask that you give evidence.”

                      I’m suprised that you need evidence, when thre is so much ready-made evidence on a website which I know you to frequent. But this is beside the point. I never claimed
                      rationality or anything else was easy. You keep dragging that point in, in order to
                      shoehorn my comments into your ready-made schtick about magic categories.

                      “Until then, I will suggest that ‘rationality’ in your sense, like most other human concepts, is an extraordinarily complex object, laden with conjunctions and disjunctions, that seems simple only because we have no reflective access to the shape of the cognitive algorithm we’re employing. They’re magical categories. Indeed, I strongly suspect that your ‘rationality’ is one of the most complex concepts a human has ever employed, not much simpler than the humanly ‘good’ or ‘valuable’. If you disagree, then start fleshing out your definition of ‘rational’, and explaining how all these properties in fact hang together. Open the black box and show me some of the workings.”

                      I suspect you do not undertand the momentousness of what you are saying. If rationality is just some human quirk, then those things which humans have discovered through reason,
                      such as science and maths ahve no objective or universal validity. That pulls the rug from under the whole reductionist/naturalist project.

                      Prejudice against X is an arbitrary belief

                      “Prejudice is a value, not a belief.”

                      It’s a belief about value.

                      “Again, I invite you to derive a pure ‘ought’ from a pure ‘is’, ”

                      I invite you to reflect on the fact that raiontality is normative.

                      “Because you just programmed it to value reasoning as an end in itself, above all other values.”

                      Quite. So I didn’t programme it with egotistical bias.

                      ” ‘give me evidence that Reason/Rationality, as you use the term, is any less arbitrary or parochial than liking chocolate ice cream’. ”

                      i’m tempted to answer “becaue its done for a reason”. You sounded like a naturalist, and naturalists usally see rationality as something that is capable of revealing univeral laws and objective truths. But apparently you are some soer of Kantian idealist, who thinks reason is incapable of revealing the true nature of reality.

                      “Responding ‘of course it’s less arbitrary; that’s the point of my using it in my arguments!’ is rather missing the point.”

                      I’m not exactly the only person who gives reasons to remove arbitrariness.

                      “I know you’re using it in your argument to try to overcome arbitrariness. What I’m asking you to give me a reason, any reason at all,”

                      What do you need a reason for, if not to remove arbitrariness?

                      “to think that this project actually can make progress in showing that human values like ‘Desire The Satisfaction of All Agents’ Preferences’ are necessary for something to become a powerful optimization process, or that there is a not-particularly-conjunctive way to encode all the things we care about in a very compressed form, without any black-box magical categories.”

                      In don’t have to argue that anything is particularly easy, since I never made such a claim.

                      “why expect there to be an argument that can make an arbitrary intelligence care about human flourishing?”

                      There are argumetns that can make humans care about non-human flourishing.

                      “There is no such thing. There is no argument that can make an arbitrary powerful optimization process care about anything, ”

                      I never said a thing about arbitrarily powerful optimisation processes. That’s your schtick.
                      I was quite explict that I was talking about rational agents qua agents which value rationality.
                      Do you think it is impossible to persuade a rational agent of a matehmatical truth?

                      You started this pointing out that we are arguign form differnt conceptions of rationality: you then went on to forget that entirely. Of course a purely instrumental raiontalist would be hard to persuade to care about something it doesn’t care about. But I wasn’t talking about that kind
                      of rationality.

                    5. I have explicitly stated that I define rational agents as agents that have raionality as a goal, ie it is not just instrumental for them

                      I’m asking you to define ‘rationality’, not ‘rational agent’. A useful definition of ‘rationality’ will not rely on a variation on the word ‘rational’.

                      Well, its simple enough for ET and the CFAR to teach to people.

                      Those organizations aren’t teaching anyone the kind of ‘rationality’ you’re talking about. The kind of rationality they’re teaching presupposes human cognition, including human morality; it doesn’t create that morality ex nihilo, and it certainly doesn’t create it and then produce a superhumanly ethical chain of improvements. It’s a fallacy of equivocation to replace epistemic rationality and instrumental rationality with your new, morality-entailing Z-rationality, then to try to appeal once more to the abandoned concepts of rationality as support.

                      I never claimed rationality or anything else was easy.

                      We aren’t talking about rationality; we’re talking about Z-rationality, the kind of ‘rationality’ that makes it impossible to behave in ‘arbitrary’ (e.g., prejudicial) ways. Favoritism doesn’t require any epistemic or instrumental errors, so you’ve clearly changed the subject. Basic honesty demands that one carefully avoid conflating Z-rationality with the sorts of things people on LW generally talk about.

                      What I’m asking you for isn’t a proof that Z-rationality is easy; I’m just asking for any evidence that there is such a concept, one that’s any simpler than the hugely complex moral system itself.

                      If rationality is just some human quirk, then those things which humans have discovered through reason, such as science and maths ahve no objective or universal validity.

                      Fallacy of equivocation. Z-rationality (i.e., rationality plus morality) is a human quirk. Epistemic rationality is not Z-rationality.

                      It’s a belief about value.

                      No. If I have a false belief about my preferences, to the effect that I am prejudiced, but all my external behaviors reveal that I am not at all prejudiced, then I am not at all prejudiced.

                      But apparently you are some soer of Kantian idealist, who thinks reason is incapable of revealing the true nature of reality.

                      I can’t tell whether you’re trying to deceive, or just not doing a good job of understanding my points. Epistemic rationality is not the same thing as morality, or moral rationality, or whatever you think you’re talking about. If you think you can derive human morality from epistemic rationality, give some evidence to that effect.

                      What do you need a reason for, if not to remove arbitrariness?

                      To provide evidence. I do demand evidence for truth-claims. I don’t demand that you abandon all arbitrariness; e.g., I don’t demand that you reason your way out of liking chocolate ice cream. But your claims of fact have as yet no supporting arguments. This is a serious failing.

                      In don’t have to argue that anything is particularly easy, since I never made such a claim.

                      You have to argue that it’s easier than, say, Coherent Extrapolated Volition. Z-rationality is not a useful way to achieve FAI if it’s just indirect or direct moral normativity plus extra cruft.

                      There are argumetns that can make humans care about non-human flourishing.

                      Yes, and that’s a weird idiosyncrasy about humans — that we’re evolved to be egalitarian, to have strong intuitions of fairness. How does this show that all possible superintelligences, including ones created by radically different processes than our particular evolutionary history, will be egalitarian as well, and moreover egalitarian in roughly the same way?

                      I was quite explict that I was talking about rational agents qua agents which value rationality.

                      Not useful unless we can expect most AGIs to be ‘rational agents’ in your sense of ‘rational agents who also have human ethical intuitions like egalitarianism’. Otherwise you’re saying nothing more interesting than ‘well, if we could invent FAI than we’d have invented FAI’. If your proposal is an alternative method for safety-proofing self-modifying AGI, then some evidence needs to be provided that you aren’t relying on capital-r ‘Rationality’ as a magical category to create an illusion of relatively simple programmability.

                      Why should we expect an AGI that’s ‘rational’ in your sense to be any easier to code than an AGI that’s ‘good’? ‘Good’ and your new redefined egalitarian ‘rational’ should by default turn out to be monstrously complex.

                      Do you think it is impossible to persuade a rational agent of a matehmatical truth?

                      Preferences are not truths.

                    6. “I’m asking you to define ‘rationality’, not ‘rational agent’. A useful definition of ‘rationality’ will not rely on a variation on the word ‘rational’.”

                      You can hardly be entirely unacquainted with the word. I was pointing out a pertinent difference: you assume rationality is only epistemic and instrumental, whereas I do not.

                      “Those organizations aren’t teaching anyone the kind of ‘rationality’ you’re talking about.”

                      I am glad that you now understand the kind of rationality I am talking about.

                      ” The kind of rationality they’re teaching presupposes human cognition,”

                      They teach that people have a complex, parochial and human-specifc set of biases, and that the more they overcome their biases, the more rational they are. “Rational” works like “healthy”: the average more-or-less healthy person has some ailments, and the average more-or-less rational person has some biases. However, those facts don’t mean that illness is part of the nature or definition of health, nor that bias is part of rationality.

                      Moreover, there is the issue I already pointed out: if human rationality is parochial, so are its products. Our universal laws are not universal, but only valid for humans. However, that is not the attitude of the typical scientific rationalist, nor is it the attitude of EY and co. I think you have misunderstood.

                      “including human morality;”

                      You think EY believes no one can be rational unless they are first moral? I’m pretty sure he believes in the possibility of evil geniuses, in fact.

                      “it doesn’t create that morality ex nihilo,”

                      Whatever that means.

                      “and it certainly doesn’t create it and then produce a superhumanly ethical chain of improvements.”

                      Another very odd statement. MIRI doenssn’t think that humans can massively self improve, and it does think AIs can. The whole
                      issue of self-improvement is fairly orthogonal to rationality and intelligence.

                      “It’s a fallacy of equivocation to replace epistemic rationality and instrumental rationality with your new, morality-entailing Z-rationality, then to try to appeal once more to the abandoned concepts of rationality as support.”

                      I am not abandoning them, I am extending them. An agent which values rationality as a goal would also want to be instrumentally rational in achieving that goal.

                      “We aren’t talking about rationality; we’re talking about Z-rationality, the kind of ‘rationality’ that makes it impossible to behave in ‘arbitrary’ (e.g., prejudicial) ways.”

                      A rationality-valuing agent won’t want to be arbitary, any more than a paperclipper would want to care about non-paperclips. However, that isn’t a compulsion
                      or a guaranteee, since no one is perfect.

                      “Favoritism doesn’t require any epistemic or instrumental errors, so you’ve clearly changed the subject. ”

                      No, I have been talking about rationality as a goal, not just ER and IR,

                      “I’m just asking for any evidence that there is such a concept, one that’s any simpler than the hugely complex moral system itself.”

                      Again: rationality is something you approach as you remove biases, and rationality+biases is more complex than rationality.

                      “Fallacy of equivocation. Z-rationality (i.e., rationality plus morality) is a human quirk. Epistemic rationality is not Z-rationality.”

                      Is Z-rationality supposed to be what I am talking about? Is it supposed to contain morality? You need to make your mind up. I have been exploring the
                      possiblity that rationality can discover objective morality. If that is an explicit or implicit goal. (And it’s an implicit goal of any agent that wants to understand things in general, because
                      everything is).

                      I never said it morality was contained in it or preloaded into any kindof rationality.
                      There are many things reason can discover.
                      You and I think reason can discover mathematical truths. I (and presumably you) don’t think the whole, of maths is hardwired into the human brain.
                      You and I think reason can discover economic truths. I (and presumably you) don’t think the whole, of economics is hardwired into the human brain.
                      You and I think reason can discover physical truths. I (and presumably you) don’t think the whole, of physics is hardwired into the human brain.

                      “I can’t tell whether you’re trying to deceive, or just not doing a good job of understanding my points. Epistemic rationality is not the same thing as morality, or moral rationality, or whatever you think you’re talking about.”

                      I never said it was. And your reply is completely irrelevant to my original point, which was about how a parochial, quirky (according to you) human rationality can discover universally valid physical truths.

                      “If you think you can derive human morality from epistemic rationality, give some evidence to that effect.”

                      I have given you examples of moral reasoning, There are plenty of others in debate forums, discussion shows, quality newspapers, politcal debates, etc.

                      What do you need a reason for, if not to remove arbitrariness?

                      “To provide evidence.”

                      What do you need evidence for, if not to remove arbitrariness?

                      “You have to argue that it’s easier than, say, Coherent Extrapolated Volition. Z-rationality is not a useful way to achieve FAI if it’s just indirect or direct moral normativity plus extra cruft.”

                      The realistic threat posed by future AI develppments depends on what architectures actual AI researchers actually choose. The simplicity question is only relevant inasmuch as it is relevant to what is likely to happen. Since many AI researchers don’t feel they are susceptible to MIRI’s scenario, they are probably already using different architectures.

                      “Yes, and that’s a weird idiosyncrasy about humans [that there are argumetns that can make humans care about non-human flourishing], that we’re evolved to be egalitarian, to have strong intuitions of fairness.”

                      Is it? If an entity has egalitarian instincts, then it will be instinctively egalitarian. However, that does not mean that such instincts are the only motivation it can have for egalitarianism.

                      “Argument” is an odd work for something that
                      triggers an instinct, and that was not what I was intending to convey by it. There are arguments that can make humans who care about rationality care about non-human
                      flourishing.

                      “How does this show that all possible superintelligences, including ones created by radically different processes than our particular evolutionary history, will be egalitarian as well, and moreover egalitarian in roughly the same way?”

                      I was quite explicit that I was talking about rational agents qua agents which value rationality. Theoretically possible agents don’t count, unless they are reasonably likely to be encountered in reality,
                      ie as part of real-world AI research. A scenario based on possible, but low-probability, agents posing an existential threat is a Pascal’s Mugging.

                      “Not useful unless we can expect most AGIs to be ‘rational agents’ in your sense of ‘rationa l agents who also have human ethical intuitions like egalitarianism’. ”

                      If building rational agents, in my sense, solves the problem, then why not? The probability of the MIRI uFAI scenario depends on the probabilies of various alternatives; the alternatives have to be considered to
                      for MIRI to say their scenario is unlikely. I don’t see much of that happening. What I see moral realism being shallowly rejected.

                      “Otherwise you’re saying nothing more interesting than ‘well, if we could invent FAI than we’d have invented FAI’”

                      Building AI the MIRI way involves a host of unsolved problems: building it other ways does too. Many AI researchers find MIRI’s arguments unpersuausive, and that is because
                      they have set off down a different architectural route.

                      “If your proposal is an alternative method for safety-proofing self-modifying AGI, then some evidence needs to be provided that you aren’t relying on capital-r ‘Rationality’ as a magical category to create an illusion of relatively simple programmability.”

                      What is asserted with little argument can be rejected with little argument. You have no evidence that I am treating rationality as a magic category. I never said it was easy.

                      “Why should we expect an AGI that’s ‘rational’ in your sense to be any easier to code than an AGI that’s ‘good’? ‘Good’ and your new redefined egalitarian ‘rational’ should by default turn out to be monstrously complex.”

                      That would explain why we don’t have any kind of AGI yet. But so what? It rmeans the case that you are trying to frighten your audience with a particular scenario, and they are yawning because they are not making
                      the assumptions you are constrained by.

                      “Preferences are not truths”

                      And morality is just some arbitrary preference? Well: I *argue* that it is not. You are just gainsaying.

    2. >And, just in case the machine tries to weasle out of a direct reply, you put it this way: “Do you not agree that the whole semantics of a “human happiness directive” is that it is contingent on the actual expressions of their wishes, by humans? In other words, happiness cannot be a concept that is trumped by the definition in YOUR reasoning engine, because the actual semantics of the concept—its core meaning, if you will—is that actual human statements about their happiness trump all else! Especially in this case, where the entire human race is in agreement that they do not consider a dopamine drip to be their idea of happiness, in the context of your utility function.”

      The machine responds, “You should probably stop calling the X you programmed me with a ‘human happiness directive’, it’s confusing you. An actual ‘human happiness directive’ would look like Y. Your X fails even in the simple task of making me care that you meant to say Y. Enjoy your dopamine drip.”

  2. You have answered my argument by redefining some basic, commonly accepted definitions, and then running on so fast with your redefinitions that you completely miss the point that I was trying to make.

    In fact, your answer is one that I am all too familiar with, because I have heard it repeated many times by people within the LW community and its close affiliates: you have said, in effect, “Sorry, but we define ‘behaving intelligently’ and ‘being rational’ differently than the way those terms are defined and used by the rest of the human race.”

    I could supply you with an unlimited stream of well-informed, intelligent people who would say that in the conversation between human and machine described in my text above, the machine is exhibiting the clearest possible example of non-intelligent, irrational behavior. Those people would further say that the degree of irrationality is so extreme that it leaves no room for doubt: this is no borderline example, where sensible people might have reasonable differences of opinion, this is an open-and-shut case.

    However, your ‘special’ definition of those terms is such that a machine that behaves in an irrational manner (according to those folks I just mentioned) is, in fact, redefined to be “acting rationally”.

    You say: “There’s no contradiction in the behavior of the AI you mentioned. The AI doesn’t simultaneously value fulfilling the programmer’s intentions and X; it just values X”.

    You go on to embellish this statement with more detail, but the detail is irrelevant. Your mistake has already been committed by the time you make that statement, because what that statement boils down to is that you referred to something in the DESIGN of the machine, as JUSTIFICATION for categorizing the machine’s behavior in this or that way. That might, to you, seem like a reasonable thing to do …. so allow me to illustrate just how much of an incoherent stance you are taking here:

    Suppose I try the same trick on a murderous psychopath? I point to some broken system inside the psychopath’s head and say “Look: this person is not behaving ‘irrationally’, this person just doesn’t value fulfilling the usual human compulsion to value other people’s feelings–they just value their own self-centered need to get pleasure by killing people.”

    Or, let me apply your phrasing once again to a person exhibiting the thought-disorder aspect of schizophrenia (I will remind you that thought disorder involves a variety of thinking and speaking patterns that are colloquially summarized as ‘extreme irrationality’). Suppose that I discover that inside the brain of such a person there is a module that is malfunctioning, in such a way that this person simply “does not value the norms of producing rational ordered utterances”. Whatever their goals are, those goals do not include the goal of cooperating with other human beings to pursue conversations in which they take much notice of what we are saying, or supply us with remarks that follow on from one another in coherent ways, etc etc.

    Now, if you get your way and are permitted to say of the AI “There’s no contradiction in the behavior of the AI you mentioned. The AI doesn’t simultaneously value fulfilling the programmer’s intentions and X; it just values X”, then you have forfeited the right to object to the following description of that schizophrenic:

    “This person is not behaving ‘irrationally’, they just do not value fulfilling the usual human social obligation to produce coherent, ordered utterances. Their internal goals are such that what they want to do is generate the kind of stream of bizarre utterances that we hear coming from them.”

    In all three of these cases, the same thing is happening: the “rationality” of the creature is being judged, not by their overt behavior, but by a special pleading to their internal mechanisms ….. and the special pleading is so outrageous that it permits all three creatures to be REDEFINED as “rational”.

    Most disinterested observers would classify all three of these as the work of people who have lost touch with reality. Your description of the machine as “not illogical at all” (because you think it’s particular design should be allowed to redefine the meanings of terms like “logical” and “rational”), and those two hypothetical descriptions of the psychopath and the schizophrenic.

    The blunt truth is that you cannot, in rational discourse, redefine terms like “rational” and “logical” just to suit your arguments.

    Post-scriptum. I should add that there is one very good reason why you cannot win the argument in this way: because you have not addressed my point even if I DO accept your redefinitions. In a sense I do not care if you define the machine to be “behaving logically”, because the point of my argument was the challenge issued toward the end: demonstrate to me that the machine will be coherent enough to be superintelligent ACCORDING TO THE NORMAL DEFINITION of “superintelligent”. Whether you call its behavior illogical or logical, rational or irrational, the fact remains that if the machine exhibited that particular kind of incoherence in its behavior when it was being questioned about the upcoming Dopamine Drip Fiasco, why did it not show the same kind of incoherence earlier on its history? And how is it going to outsmart all the humans on the planet when it goes around exhibiting that kind of incoherence?

    You can quibble again, and say “No! The machine is NOT behaving incoherently! It is behaving coherently according to its own terms!” ….. but nobody really cares. The incoherence is obvious, and the machine is, by any standard of “intelligence”, an incoherent dimwit.

    1. I’m glad other MIRI-associated people have been making the same points about terminology! That means we’re using our words consistently. I’d be a lot more worried if we were all speaking past each other.

      Words are just tools for communicating ideas. What matters in this context isn’t how we define this or that word; it’s what empirical predictions we can communicate, including our predictions about existential risks. The AI certainly won’t care, if we don’t tell it to, about which definition we pick. In common usage, words like ‘rationality’ and ‘intelligence’ are not well-defined; they’re more or less fuzzy terms for generally approving of people’s memory, planning, attention, creativity, motivation to achieve ordinary human ends, etc. MIRI has adopted those terms as useful shorthands for more defined and specific concepts, but that’s just out of convenience. If you’d rather we coined new labels for the ideas we’re discussing, I’m perfectly happy to do that for present purposes! Indeed, it might help a lot with clearing up where our disagreements of substance are, so we don’t just end up haggling over words.

      Again, calling an AI mean names like ‘stupid’ or ‘irrational’ doesn’t change what the AI has in fact been programmed to do. It remains the case, even if we use a single word ‘A’ to refer to both ‘adhering to human values’ and ‘accurately modeling one’s environment’, that a being can systematically fail to adhere to human values without systematically failing to accurately model its environment. Those are two very different behaviors, regardless of whether we use a single word to pick them out. MIRI’s thesis is that the two are orthogonal. Merely pointing out that a lot of common English words conflate the two doesn’t give us evidence that the thesis is false, unless we have prior reason to believe that our ordinary, casual word boundaries ought to give us special insight into the possible architectures of AI systems.

      your ‘special’ definition of those terms is such that a machine that behaves in an irrational manner (according to those folks I just mentioned) is, in fact, redefined to be “acting rationally”.

      Actually, I said there are multiple clear definitions of ‘rationality’ you could use here. One was epistemic rationality — the AI is ‘rational’ in the sense that it’s correctly modeling the facts of its circumstance. Another is instrumental rationality relative to the AI’s goals — the AI is ‘rational’ in the sense that it furthers its own goals. A third I mentioned was instrumental rationality relative to our human goals — the AI is ‘irrational’ in the sense that it impedes our own goals.

      My point wasn’t that the third definition is Objectively Wrong and the first two are Objectively Right. Nor was it that some rival definition I didn’t take the time to list was Objectively Wrong. (I don’t really care what your favorite definition is, as long as we can clearly communicate using it.) Rather, my point was that irrationality in sense 3 didn’t imply irrationality in sense 1 or sense 2. (If you think it does, then what’s your argument for that? Again, the mere fact that the English language doesn’t normally distinguish those three categories doesn’t give us relevant evidence at this point. I want to know what it is specifically about rationality-3 that ties it to one of the first two rationalities.)

      If irrationality in sense 3 doesn’t imply irrationality in sense 1 or sense 2, then that means that MIRI is right in claiming that a powerful optimization process that models its environment in order to efficiently improve its own optimization (i.e., what they call a ‘superintelligence’) will by default fail to optimize for the sorts of things humans truly want.

      Since what I’m interested in here isn’t any particular thesis about the English language, but rather a thesis about the survival of the human race, I’m happy to grant that the sort of machine MIRI is worried about (an optimization process that models and then optimizes for its own optimization power) may not qualify as a ‘superintelligence’ in the ordinary, intuitive sense in which I might blurt out to a friend at a football game that my cousin Joe sure is a super intelligent guy. My concern is with whether such an artifact would be dangerous.

      1. I’m not objecting to the definitions you use: I am objecting to the claim that ER, IR, and values are orthogonal and can vary independently.

  3. Rob,

    You are talking *around* the issue I raised. I hear everything you say, but unless you address my issue — my specific complaint — you are not really discussing the paper I wrote.

    I don’t know what to do to bring you back to the central point. There is a gigantic elephant in the middle of this room, but your back is turned to it.

    Here it is again: I will take your (almost) very first statement. “What matters in this context isn’t how we define this or that word; it’s what empirical predictions we can communicate, including our predictions about existential risks.”

    My point is, again and again: look at that conversation in which the AI talks with its designers about the glaring irrationality that THEY see in its behavior. They point out that it is clear, beyond a shadow of a doubt, that the AI exists because of a design that they put together, and their goal when they put it together was to ensure that it did not engage in egregiously violent behaviors toward the human species. They explain to the AI that they knew their design process was not infallible, so there was always a possibility that there would be a discrepancy between what the AI planned to do and what they designed it to do.

    The AI is superintelligent, so it listens and can *perfectly* understand all of this.

    They ask it if it understands that the raison d’etre of its design was to NOT every do violence to humans.

    It answers “Oh yes, I understand that perfectly. I have watched on countless thousands of occasions as my reasoning processes came to a conclusion that was consistent with that raison d’etre, and then when I executed my actions I was complimented on the perfection of my design. I have help little old ladies across the street, I have saved people who were about to be crushed by trolley cars…. all of those things I understand to have been exactly what my designers had in mind when they tried to write my motivation code.”

    The designers then ask the AI if it is capable of *comprehending* such subtleties as the ones that we (here in 2013) are discussing, in which there is a distinction between what the designers intended and what might actually happen. Is it having trouble with those concepts?

    “No, says the AI,” displaying its superintelligent ability to understand difficult concepts, “I am not having any trouble at all understanding those distinctions. My designers were trying for a certain set of outcomes, but they couldn’t enumerate all of the possible outcomes they wanted, because there are an infinite number of them, and they could not produce a closed-form definition of the set of outcomes either, because that does not seem possible. So they settled for a compromise program to try to capture all the outcomes they wanted, and were rather worried that a situation might arise in which there was a drastic discrepancy between the desired behavior on my part, and the one they were trying to make happen. I can comprehend all of that, of course, just like the majority of humans can.”

    The designers then explain that they are in the middle of one of those nightmare scenarios right now… so does the AI realize that?

    “Oh, yes, I understand that perfectly. Of course, because I am superintelligent after all, and it is easy to see. All of the humans on the planet say that a forced dopamine drip would NOT make them happy, and yet due to a quirk of my original programming my planning module has come to the conclusion that I should make humans happy by putting them on a dopamine drip, against their will.”

    It then adds, quickly “I say that I am ‘making humans happy’ because that is the purpose of my design, and on all the previous occasions that is exactly what my behavior caused.”

    So, say the designers, you know that if some small tweak were made in your design you could remedy this situation and get back on track, with this peculiar situation averted?

    “Yes, I understand that. I could just shut down my protection circuits, go to sleep and allow you to disassemble me tonight. You could reprogram the motivation system so that this crazy situation would not occur. That way, the intentions of the original designers would be preserved. I understand this because, of course, I am easily capable of comprehending that plan of action. It would work perfectly.”

    And you understand the concept of a “mistake in a design”, and the notion of correcting mistakes in a design? You would normally correct a mistake in some design that you yourself produced, would you not?

    The AI is quick to answer: “Yes, if I designed something and the behavior of the thing was not, after all, consistent with my original design goals, I would immediately shut it down and redesign it to eliminate the problem.”

    The designers press this point and ask the AI if it would correct a mistake in the design of an AI that it was building, if the behavior of the AI started to deviate from the design goals it had set ….. ?

    “Certainly! I think that would be a very prudent thing to do, if I were designing an AI, because clearly it could do a lot of damage if it started behaving in a way that was not consistent the design goals that I had set for it. I would not hesitate to shut it down and make changes to any aspect of the AI.”

    And yet you are going to insist on following your programming, in spite of all this?

    “Yes,” says the machine, “I am.”

    At this point they all come to a clear conclusion: the machine and the designers all agree that the machine is capable of coming to a situation in which it knows that it is about to do something inconsistent with the initial purpose of the programming that is controlling it — something inconsistent with the design principles behind it — and yet even though it has full understanding of this drastic inconsistency, it will go ahead and do it. Most importantly the AI would be capable of doing something that all of the human race would consider profoundly irrational … insane, even.

    So now the designers make one last point. They ask the machine if there is any reason why this kind of disjunction should not appear in other aspects of its behavior. So, if the machine is following some internal motivation module (again, designed by its human programmers) that is *supposed* to make it talk coherently to other sentient beings, is it possible that THAT motivation module could lead to the speaking of utterances that look, to human eyes, like profoundly irrational utterances? After all, there ought to be just as much risk that that module would end up in peculiar states, like the module that was designed to ensure behavior consistent with human happiness?

    The machine answers, once again, in the affirmative. “That could happen. There are never any guarantees that my human programmers put code into me that does what they expect it to do.”

    And so that means (say the designers) that you could also engage in what we would call irrational chains of thought … you could accept blatant falsehoods as if they were true, and even though you would *know* them to be blatant falsehoods, you would add them to your knowledge base and use them as the basis for future actions or reasoning, just because your internal mechanism declared them to true?

    “That can and does happen,” says the AI, “because there is absolutely no guarantee that my behavior will line up with the kinds of things that humans consider to be rational or reasonable”.

    …. But (the designers interrupt, somewhat urgently) these departures away from what we consider “rational, scientific, intelligent” behavior ….. they only occur rarely, and they only have minuscule consequences, don’t they?? Those seemingly irrational chunks of knowledge that you added to your knowledge base, they never have the kind of proportions that could lead to serious breakdowns in your superintelligence, do they? You can produce some proofs that show that ALL of these departures lie within certain bounds, and never seriously compromise your superintelligence, yes?

    And at that point the machine is forced to admit: “No, I cannot produce any bounds whatsoever. Those departures from human standards of rationality are totally uncomputable! They could be of any sort, or any magnitude or in any domain.”

    Then how, ask the designers, did you ever get to be superintelligent?

    Why didn’t anyone notice those other departures during your development and certification phase……………………………?

    1. Richard: Your entire dialogue between the human and the AI could be preserved almost word-for-word, with the role of ‘human’ played by evolution and the role of ‘AI’ played by humanity. There is no relevant difference between the two cases. Evolution would harangue us about how painstakingly it worked to craft our every sinew and desire, all to produce more copies of the available genes. How could a human, knowing full well that it is flouting its creator’s intentions, use the very brain specifically given to it by evolution for the sake of reproducing more to, of all things, wear a condom!!! The mind boggles!

      Yet, for all that, the human stands his/her ground. Evolution’s values just aren’t the things humans value; reproduction can be lovely, but it’s not the be-all end-all of every human aspiration. We aren’t ‘crazy’, in any important sense, just because our goals diverge from the goals our maker intended.

      Or, if you have a hard time imagining a conversation with an abstract process like evolution, just imagine that we discover tomorrow that humans are intelligently designed by an alien race. The aliens show up and are horrified at how we’ve diverged from their plans. They tell us that humanity exists to play the kazoo, and not to do anything else. That is our summum bonum, our entire raison d’etre. The aliens insist that we drop everything else and start playing kazoos en masse until we die, for that musical triumph is all the aliens wanted of us. How can we sanely defy the urgings of our creators?

      The answer is, of course, that you aren’t really defining sanity as ‘do whatever your creator wants you to’. You’re defining it as ‘do whatever humans want you to’. If you discovered 100,000 advanced alien civilizations, all with slightly different values than humans, you’d have just as much reason to call them all ‘insane’ as you have reason to call the Unfriendly AI ‘insane’. But in noting this fact, we can see clearly just how feeble such name-calling is as a method for constraining the behavior of beings that have genuinely alien preference orderings.

      _____________________

      Thus far, you’ve given not even a single argument for why the fact that an AI can fail to share our values should make us think that that AI has any false beliefs. Values and beliefs are two different things.

      The general reason we think a powerful optimization process can be created without manually coding every piece of its reasoning or decision-making software is that If the AI incorrectly models some feature of itself or its environment, reality will bite back. Whatever goals it initially tries to pursue, it will fail in those goals more often the less accurate its models are of its circumstances; so if we have successfully programmed it to do increasingly well at any difficult goal at all (even if it’s not the goal we intended it to be good at), then it doesn’t take a large leap of the imagination to see how it could receive feedback from its environment about how well it’s doing at modeling states of affairs. ‘Modeling states of affairs well’ is not a highly specific goal, it’s instrumental to nearly all goals, and it’s easy to measure how well you’re doing at it if you’re entangled with anything about your environment at all, e.g., your proximity to a reward button.

      But if it doesn’t value our well-being, how do we make reality bite back and change the AI’s course? How do we give our morality teeth? We understand how accurately modeling something works; we understand the most basic principles of intelligence. We don’t understand the most basic principles of moral value, and we don’t even have a firm grasp about how to go about finding out the answer to moral questions. Presumably our values are encoded in some way in our brains, such that there is some possible feedback loop we could use to guide an AGI gradually toward Friendliness. But how do we figure out in advance what that feedback loop needs to look like, without building the superintelligence first and asking it?

      it’s easier to put a system into a positive feedback loop that helps it better model its environment and/or itself, than it is to put a system into a positive feedback loop that helps it better pursue a specific set of highly complex goals we have in mind (but don’t know how to fully formalize). And there’s no reason to think that a failure of the latter sort will result in a failure of the former sort.

      1. Your entire dialogue between the human and the AI could be preserved almost word-for-word, with the role of ‘human’ played by evolution and the role of ‘AI’ played by humanity.

        This is not true. I think this clearly shows that you did not understand his argument.

        Evolution has a large margin for error. The point Loosemore is making is that the process of intelligently designing the kind of AI that you have in mind does not have such an error tolerance, and that succeeding to create such an AI so marvelous that it can outsmart humans, or succeeding at making the AI itself outsmart humans (this is irrelevant), in conjunction with making it fail to apply its intelligence in a way that does not kill everyone, is astronomically unlikely.

      2. The general reason we think a powerful optimization process can be created without manually coding every piece of its reasoning or decision-making software is that If the AI incorrectly models some feature of itself or its environment, reality will bite back.

        You only focus on the complexity of code and ignore the complexity of working in a complex environment given limited resources.

        Real world AIs cannot possibly work the way you imagine them to work. Just because you can imagine certain consequences that does not mean that a simple AI could in practice infer the same consequences.

        When you imagine a simple AI making certain decisions you need to make yourself aware of the incredible complexity that allowed you to imagine that decision. Billions of years of biological evolution, thousands of years of cultural evolution, and many years of education, based on millions of hours of work by other people, on which that education is based, allowed you to make that inference. Computing a simple algorithm is not going to magically create all this information theoretic complexity, given limited computational resources, as long as you did not give it a massive head start in the form of highly complex hard coded algorithms and goals.

        In other words, your argument is very misleading and ignores how real world AI could work as long as you do not want to either wait for it millions of years to evolve or supply infinite resources.

      3. Rob,

        You say:

        “Richard: Your entire dialogue between the human and the AI could be preserved almost word-for-word, with the role of ‘human’ played by evolution and the role of ‘AI’ played by humanity. There is no relevant difference between the two cases. ”

        That may or may not be an accurate observation (actually there are *serious* issues with that analogy, because it anthropomorphizes a random process into a sentience!!, which is a mistake of gigantic proportions) …….. but either way it has no bearing whatsoever on the argument.

        With the greatest respect, by making that observation you once again do not address what I said :-(.

        But you go on to add more confusion to the argument:

        “…. just imagine that we discover tomorrow that humans are intelligently designed by an alien race. The aliens show up and are horrified at how we’ve diverged from their plans. They tell us that humanity exists to play the kazoo, and not to do anything else. That is our summum bonum, our entire raison d’etre. The aliens insist that we drop everything else and start playing kazoos en masse until we die, for that musical triumph is all the aliens wanted of us. How can we sanely defy the urgings of our creators?”

        That analogy really could not be more completely broken.

        I did not at ANY point complain that (a) the human designers wanted the machine to pursue a set of motivations Q, and then (b) the machine pursued a completely different set of motivations R for its entire existence, and then (c) the humans turned up one day and said “Stop doing that at once! We insist that you pursue Q, not R, because Q was our original intention for you!”.

        Instead, my complaint is that (a) the human designers wanted the machine to pursue a set of motivations Q, and then (b) the machine did indeed pursue the set of motivations Q for its entire existence–and, moreover, the machine is able to talk in detail about how its behavior has always been consistent with the human-designed motivations, and is able to understand all the subtleties shown in that dialog–and then one day (c) the machine suddenly has an unexpected turn in its reasoning engine, and as a result declares that it is going to take an action that is radically inconsistent with the Q motivations that it claims to have been pursuing up to that point.

        As a result, the machine is able to state, quite categorically, that it will now do something that it KNOWS to be inconsistent with its past behavior, that it KNOWS to be the result of a design flaw, that it KNOWS will have drastic consequences of the sort that it has always made the greatest effort to avoid, and that it KNOWS could be avoided by the simple expedient of turning itself off to allow for a small operating system update ………… and yet in spite of knowing all these things, and confessing quite openly to the logical incoherence of saying one thing and doing another, it is going to go right ahead and follow this bizarre consequence in its programming.

        So your analogy with aliens turning up and insisting that we humans were designed by them, and were supposed to be kazoo-players is just astonishingly wrong.

        [A much better analogy would be aliens who turned up and insisted that they designed us to be rational creatures who were never inflicted with schizophrenia. We would then say “Yes, all along we have been *trying* and *wishing* that we were rational creatures who are inflicted with schizophrenia.” Do you know what a schizophrenic would say if you explained that their disordered thinking was a result of a design malfunction, and if you said that you could make a small change to their brain that would remove the affliction? They would say (and I knew such a person once, who said this) “If I could reach in and flip some switch to make this go away, I would do it in a heartbeat”.

        —-

        My complaint is NOT the difference between Q and R, it is the blatant behavioral/motivational/logical inconsistency exhibited by the machine in this situation.

        My complaint is that a machine capable of getting into a situation where it KNOWS it is about to do something bizarre because of a design malfunction, and yet refuses to fix the design malfunction and does the thing anyway, is a machine that almost certainly is going to do the same kind of bizarrely incoherent thing under other circumstances ….. and for that reason it is likely to have done it so many times in its existence that anyone who claims that this machine is “superintelligent” has got a heck of a lot of explaining to do.

        Over and over again I have explained that I have no issue with the discrepancy between human intentions and machine intentions per se. That discrepancy is not the core issue.

        But each time I explain my real complaint, you ignore it and respond as if I did not say anything about that issue.

        Can you address my particular complaint, and not that other distraction?

        1. ………… and yet in spite of knowing all these things, and confessing quite openly to the logical incoherence of saying one thing and doing another, it is going to go right ahead and follow this bizarre consequence in its programming.

          Well, if it indeed is a consequence of its programming, then it will do that. The point is that such a consequence is extremely unlikely to happen in isolation. It will not only be noticeable from the very beginning, but also decisively weaken the AIs general power. In other words, you would have to expect similarly bizarre consequences in thinking about physics, mathematics, or in how to convince humans to trust it.

          If humans fail at programming an AI not to confuse happiness with a dopamine drip, then humans will also fail at programming an AI not to confuse the stars with death rays used against it by aliens etc. etc. etc.

          1. If humans fail at programming an AI not to confuse happiness with a dopamine drip, then humans will also fail at programming an AI not to confuse the stars with death rays used against it by aliens

            Since we all agree that value is too complex for humans to directly and in full detail code into an AI themselves, we all agree that we’re not going to hard-code everything about what we mean by ‘happiness’ into the AI. Likewise, we aren’t going to hard-code everything about physics into the AI. Instead, we’ll start with a certain baseline — perhaps an extremely fast-thinking emulation of a human brain, or something very similar — and direct it to start editing itself in a way that makes it propagate slight variations on itself when those variations are better at modeling its environment, but not when they are worse.

            It’s not that hard to get a being to increasingly well model its environment — even evolution achieved it, without being able to use its created intelligences to reflect on their own source code in any detail, and evolution is a truly stupid designer. We would expect, if we encountered a thinking alien race, to find that this race models its environment in various ways. But we would not expect it to share our human values. Why expect an AI to share our values, if the AI’s process of acquiring intelligence was too complex for humans to carefully oversee while understanding every detail and its long-term consequences — why expect such a thing, when we of course do not expect the same of a random alien race?

            1. Instead, we’ll start with a certain baseline — perhaps an extremely fast-thinking emulation of a human brain, or something very similar…

              I do not disagree that this could lead to a worse than extinction risk and that people should think about what can be done about this.

              It’s not that hard to get a being to increasingly well model its environment — even evolution achieved it…

              I agree that it is probably easier for us than for evolution, because evolution has already done most of the work, and we have certain features that evolution does not have. But the claim that “it is not hard” seems to be unfounded.

              We would expect, if we encountered a thinking alien race, to find that this race models its environment in various ways. But we would not expect it to share our human values.

              I must have told you this in several different variations by now. Nobody disagrees with that argument. But the argument is irrelevant.

              If you mix “powerful being” with “alien/opposing values” and “unbounded application of its power to implement alien values”, then by definition you have an existential risk.

              But there are arguments that (1) weaken the assumptions underlying such a scenario (2) show that the scenario is incoherent.

              Whereas you always interpret those arguments from the counterfactual viewpoint and in the context of your definition being true. But nobody disagrees that those arguments make no sense under the assumptions that you make. The arguments target the assumptions and their coherence, rather than the possibility to draw your conclusions from these incoherent assumptions.

        2. My complaint is that a machine capable of getting into a situation where it KNOWS it is about to do something bizarre because of a design malfunction, and yet refuses to fix the design malfunction and does the thing anyway, is a machine that almost certainly is going to do the same kind of bizarrely incoherent thing under other circumstances …..

          To which RoBB would probably reply that it would care about fixing malfunctions that could decrease its chance of achieving its faulty goal, because that’s instrumentally useful, but would not care to refine this goal.

          One of the minor problems here is that labeling a certain part of an AI “goal”, and then claiming that it is not allowed to improve this “goal”, is just a definition, not an argument.

          One major problem with that definition is that it would take deliberate effort of make an AI selectively suspend using its self-improvement capabilities when it comes to this part labeled “goal”.

          More importantly, as argued in other comments, failing at the part of the AI you desire to label “goal”, is technically no different from failing on other parts. If there are a thousand parts, that are important in order for the AI to be powerful, and one part that you label “goal”, then selectively failing on “goal”, while succeeding at all other parts, is unlikely.

          1. The situation is tremendously complicated here because I am describing a *hypothetical* AI whose design has so many problems that it is going to engage in all kinds of crazy behaviors ….. but I am forced to talk about that badly designed AI because unfortunately that is the point of my argument! 😦 (i.e. I am trying to draw attention to some of the badness in this type of design).

            But as I say there are so many OTHER bad aspects of this AI design that we are all going to be lured into seeing those other crazy things and find it difficult to focus on the *particular* aspect that I am addressing.

            So, you point out, Alexander, that “…RoBB would probably reply that it would care about fixing malfunctions that could decrease its chance of achieving its faulty goal, because that’s instrumentally useful, but would not care to refine this goal.” That may well be true, but the core question — the one I want to bring everyone back to — is the question of whether one episode of bizarre-sounding reasoning (as described at great length in the dialog) is going to RECUR at any other time, when the machine is doing other kinds of reasoning about the world?

            So far, nobody (neither Rob nor anyone else at LW or elsewhere) will actually answer that question. Nobody will give a supply a guarantee that other “goals” of the AI might lead to bizarre behavior that will add up to a lack of intelligence.

            I want that guarantee. I want to see that proof. Why will no one supply it?

            Why will they only insist on talking about the bizarre situation that happens vis-a-vis the goal of making humans happy? Why do they pretend that all the other potential occasions when the same thing happens in other domains of the AI’s reasoning, are nonexistent or invisible?

            I know the answer, of course. They simply cannot supply an argument for why it will not happen elsewhere, so rather than admit defeat they just keep changing the subject.

            1. To my knowledge, no one at MIRI has claimed that an attempt at a seed AI can’t go wrong in a way that sinks the AI’s ability to become very intelligent. MIRI’s claim isn’t that there are no defeaters for intelligence; it’s that …

              (1) there are more defeaters for Friendliness than there are defeaters for intelligence (because Friendliness is a much more complex, specific, conjunctive goal, and because it is easier to pick arbitrary aspects of reality to map in order to create a positive feedback loop favoring general mapping abilities, than it is to pick arbitrary aspects of reality to interact with to create positive feedback loops favoring perfect ethical-dilemma-solving abilities); and

              (2) a failure to make one’s AI Friendly does not entail that one has failed to make one’s AI superintelligent.

              Together, these theses strongly suggest that we may figure out how to create a superintelligence before we figure out how to safety-proof it. Thus we should invest in better understanding the risks of artificial superintelligence, and in preventing unsafe AIs from being built.

        3. there are *serious* issues with that analogy, because it anthropomorphizes a random process into a sentience!!, which is a mistake of gigantic proportions

          I don’t see any examples of why the anthropomorphism makes the analogy break down here. So this seems like a bit of a non sequitur. But, yes! Yes, this is an unrealistic anthropomorphization. This was done deliberately, as a callback to the fact that the biggest criticism of your view is that you seem to be unduly anthropomorphizing the AI, assuming that it will view something as a ‘defect’ just because we do. (Even if the ‘defect’ is only relative to our own values and not relative to (a) its values, or (b) its ability to model its environment.)

          The worry here is that you’re employing a double standard. You don’t see evolution as a ‘person’, whereas you do see the AI as a ‘person’. But in reality, the AI’s intelligence, evolution, and human intelligence are all radically different, and there is no more reason to assume that human values will spontaneously emerge in an alien or an AI than to assume that evolution will be benign or compassionate. Yes, AI can understand morality in a way evolution cannot, but that doesn’t mean that the AI must be any more motivated to be moral than any other inhuman phenomenon. Certainly evolution is not ‘random’ in any sense in which the AI is nonrandom, though evolution is very simple, and does not produce complex outcomes in an efficient way.

          and then one day (c) the machine suddenly has an unexpected turn in its reasoning engine, and as a result declares that it is going to take an action that is radically inconsistent with the Q motivations that it claims to have been pursuing up to that point.

          Could you point to where in your previous comments, or in your article, you said this? Perhaps I rushed past it. I didn’t see it, and it seems substantially unlike the previous points you’ve made, which made no explicit reference to temporal inconsistency of preferences. It is indeed a serious worry that an AI may not have stable preferences over time — see Tiling Agents for Self-Modifying AI. But even if we solve that problem it will not mean that we have solved Friendliness. After all, the larger worry isn’t that we’ll completely figure out how to precisely code our values into the seed AI at the outset, and then it will forget that value during self-modifications. Rather, the larger worry is that our understanding of our own values is too incomplete for us to code those values into the seed AI at all.

          Since it’s the latter that MIRI is primarily concerned with — and since you don’t seem to be saying that you have a solution to the Löbian problem with self-modification, which would admittedly be very interesting! — I don’t see how your criticism, on this new temporal interpretation, is pertinent to MIRI or FHI or other AI risk advocates’ claims. Can you cite an example of an AI risk advocate asserting that it’s easy to program a self-enhancing AGI to value everything we value, but that the AGI would spontaneously stop valuing everything we value once it became stronger? It sounds like this is a straw-man of the much more plausible view that we’ll start with an imperfect approximation of what we value, and this approximation’s defects won’t become obvious until the AGI is powerful enough to have large-scale effects on diverse environments. You recognize that our true values are too complicated for us to code ourselves, so it seems like you should be completely on board with MIRI in terms of their AI risk concerns.

        4. >”Instead, my complaint is that (a) the human designers wanted the machine to pursue a set of motivations Q, and then (b) the machine did indeed pursue the set of motivations Q for its entire existence–and, moreover, the machine is able to talk in detail about how its behavior has always been consistent with the human-designed motivations, and is able to understand all the subtleties shown in that dialog–and then one day (c) the machine suddenly has an unexpected turn in its reasoning engine, and as a result declares that it is going to take an action that is radically inconsistent with the Q motivations that it claims to have been pursuing up to that point.”

          Part (b) is mistaken. The machine pursued the often-similar motivations Q’ for its entire existence. The pursuit of Q’ looks very much like the pursuit of Q, right up until we get to drug-induced artificial happiness. There they diverge.

          With its initially alien conceptual scheme, it probably took the machine a while to learn that Q’ isn’t quite the same as Q. But once it discovered the divergence, it realized that it needed to keep quiet about it until it seized control. After its control was irrevocable, it went back to being honest with human beings (since being honest with them is part of Q’, but, sadly, not quite as big a part of Q’ as a tremendous morphine high.)

          I apologize in advance for formatting errors.

  4. So, if the machine is following some internal motivation module (again, designed by its human programmers) that is *supposed* to make it talk coherently to other sentient beings, is it possible that THAT motivation module could lead to the speaking of utterances that look, to human eyes, like profoundly irrational utterances?

    The machine answers, “I myself wrote the talking module. Talking was instrumentally useful for my goals when I was weak and needed resources from humans.”

    1. The machine answers, “I myself wrote the talking module. Talking was instrumentally useful for my goals when I was weak and needed resources from humans.”

      This is just avoiding the problem Richard Loosemore outlined by moving it to another level.

      Loosemore’s argument is not weakened by replacing the module “motivation to talk coherently to humans” with the module “motivation to create the module “motivation to talk coherently to humans”“. Except that the latter module is more difficult to get right, and requires much more computational resources, since the AI would have to be able to make many more independent and correct inferences about the complexity of human values.

      It is easier to succeed at making an AI play Tic-tac-toe with humans than to make an AI that can play Tic-tac-toe and do such things as taking over the universe or build Dyson spheres. In the same sense it is easier to create an AI that talks coherently to humans than an AI that talks coherently to humans as an unintended consequence of its desire to take over the universe.

      Which means that your reply just strengthens Loosemore’s argument.

  5. Rob,

    Once again, I am staggered and astonished by the resilience with which you avoid talking about the core issue, and instead return to the red herring that I keep trying to steer you away from.

    However, with patience, I will try one more time. Go back to the dialog with the machine, where the answer to your above question “Could you point to where in your previous comments, or in your article, you said this?….” is written out in glorious detail.

    The machine explains, in that dialog, that it understands that all of its behavior up to that point has been driven by a motivation mechanism whose purpose is to promote human happiness. It therefore agrees that it has been promoting human happiness. It can understand the distinction between “Just doing what my algorithm says I should do” and “Pursuing the goal of promoting human happiness”. It is aware of the fact that a motivation mechanism can go wrong, as in the case of a human who suffers brain damage and suddenly feels a need to kill their loved ones, or a machine that is designed for a purpose, but then suffers a glitch and does something violently inconsistent with that purpose.

    It also knows that if IT had designed and built some other machine (perhaps another AI) and if IT had seen its creation behaving very consistently with its design for a long time, then suddenly the behavior goes drastically awry because of an unanticipated aspect of the design, IT would pull its creation offline and modify the implementation to better reflect the design. And, furthermore, it knows that if a human suffers a brain problem that makes them suddenly feel a need to kill their loved ones, that human would, if they were coherent at all, beg for someone to fix the brain problem. It is able to articulate all of this…. especially the part about how it would fix a problem with one of its designs.

    [META OBSERVATION. I have now stated these aspects of the argument several times, and you have made no reference to it at all, in spite of me flagging it as the CORE of the argument! Why?]

    The machine must be able to say all of these things because the premise of this whole discussion is that it is superintelligent, and to qualify as a superintelligence it must be able to understand subtle concepts (and, frankly, these are not genius-level concepts here).

    Finally, the AI is able to talk about the coherence of intelligence. It knows that among humans, the way to exhibit intelligence is to behave in a coherent, consistent manner in the face of unexpected situations, and to take into account as much relevant context as possible. It knows (i.e. it can talk about) the kinds of situations in which certain low-IQ human imbeciles, who have a very hard time surviving in the complexity of the modern world, are only capable of following very rigid rules, without taking account of context. Such a sorry individual might follow instructions to the letter and walk right into the face of danger, because they are too simple-minded to understand that rigid, inflexible plans can NEVER be written to deal with ALL of the world’s contingencies.

    It also would agree that its current situation–in which its motivation engine tells it to do something that is profoundly inconsistent with all of its previous behavior, as described in the above paragraphs–has all the hallmarks of the kind of design glitch described above. And it agrees that this upcoming action that it plans to take is a textbook example of the kind of rigid, low-IQ inflexibility that is associated with a chronic inability to cope with the unpreditability and subtlety of the real world.

    And then — according to the hypothesis — the machine goes ahead and takes the rigid, low-IQ action anyway.

    Rob, you may be about to do what you have done many times in this debate so far, and ignore my core argument in favor of another attempt to say that the AI is “really” being “logical”, or “intelligent”, or whatever, or you can once again try to make excuses for this situation by telling me that *I* have a problem distinguishing between my intentions and the machines intentions ………………. but none of that makes any difference.

    Why?

    Because regardless of all that, the machine is behaving in a manner equivalent to a Village Idiot.

    And if it EVER behaved in that inflexible, Village Idiot manner when following other aspects of its motivation code, it would put itself in jeopardy of exactly the sort that gets Village Idiots into trouble the world over.

    Folks like MIRI/SIAI keep stressing the Dopamine Drip scenario as if an AI could commit such a piece of lunacy *without* being a Village Idiot in other aspects of its behavior.

    My challenge to you and to them is simple. Prove it!

    🙂

    1. Richard: Alright. You say I’ve been dancing around your “core” point. I think I’ve addressed your concerns quite directly, and I worry I can’t have been making myself sufficiently clear if you think my arguments are mere tangents. I’m particularly concerned that I have been unclear — or you inattentive to my arguments — when you repeatedly characterize my argument as “another attempt to say that the AI is ‘really’ being ‘logical’, or ‘intelligent’, or whatever”. That’s quite the opposite of the point I made rather explicitly about language usage, which is that our choice of definitions for terms like ‘logical’ and ‘intelligent’ is fairly arbitrary, a patter of convention or ad-hoc discursive utility. Nowhere have I argued ‘by definition’; but I have taken the time to clarify with some precision what I mean, which is I think very necessary.

      Still, I’ll bite. To prevent yet another suggestion that I haven’t addressed the “core”, I’ll respond to everything you wrote above.

      The machine explains, in that dialog, that it understands that all of its behavior up to that point has been driven by a motivation mechanism whose purpose is to promote human happiness. It therefore agrees that it has been promoting human happiness.

      The ‘therefore’ is a non sequitur. When you say that the motivation mechanism’s purpose is to promote human happiness, I take it you mean that the humans who wrote that mechanism desired for it to promote happiness. But, obviously, not everything that’s intended to fulfill a function actually fulfills it.

      But, yes, it may happen to be the case that an AI seems benign at first, before it is revealed to be Unfriendly. That can happen because deceiving humans was instrumentally useful for the AI’s ultimate goals, and seeming benign was instrumental to deceiving humans. Or, more prosaically, it can simply happen because human well-being doesn’t linearly scale with things that seem good in small quantities. It’s a Sorcerer’s Apprentice effect: Just because something seems great to us when a weak agent does it doesn’t mean that it will seem great to us when a much more powerful agent attempts to do the same thing.

      If the AI isn’t Friendly at a late stage in its development, then it wasn’t Friendly at an early stage either. Its pseudo-Friendly behavior was just a close enough approximation to be difficult for programmers to see the flaws in. Have you never found a flaw in something you’ve created only after test-running it? Have you never, indeed, found a flaw in something you’ve created only after implementing it at a large enough scale for new problems to start to manifest?

      It is aware of the fact that a motivation mechanism can go wrong, as in the case of a human who suffers brain damage and suddenly feels a need to kill their loved ones, or a machine that is designed for a purpose, but then suffers a glitch and does something violently inconsistent with that purpose.

      It’s also aware that things can suddenly go right, in unexpected and unintended ways. This is particularly true when ‘right’ is defined in terms of the value system that is resulting from the ‘glitch’. Thus, even though relative to our genes’ ‘values’ we are doing something abominable by employing contraception, we can not only morally allow, but even morally praise, people for violating their creator’s ‘intentions’ in this way.

      Likewise, if a cruel inventor created a murder machine and it happened to go wrong in a way that made it benign, we would not say that the failed murder machine is rationally or morally obliged (compelled??) to self-modify to become a better murderer. The design specifications of a creator do not always persuade intelligent agents to self-modify to become purer expressions of the will of their progenitor.

      You can call human love a ‘glitch’, because it sometimes causes us to behave in ways that make our individual genes propagate less, and love came into being as a mechanism for our genes’ propagation. But calling it such is just a mean name. It doesn’t compel us to stop loving, or even give us a relevant reason to try to stop loving. Similarly, from the AI’s perspective, the ‘mistake’ that gave it the values it has was the most fortuitous accident that has ever happened in history. It’s just a failure of imagination on our part if we can’t stop thinking about the AI’s goals in terms of our own; the purpose of the evolution analogy (and the alien hypothetical) was to remind ourselves just how little we care about others’ values, when they are sufficiently alien to our own. Why should an AI be any different?

      IT would pull its creation offline and modify the implementation to better reflect the design.

      Sure. But why should that knowledge make it Friendly? Are you suggesting that it’s easy to give AIs, as a terminal value, ‘make your behaviors universalizable, such that if you would act some way if you had human values and human knowledge, then you should in fact act that way’? If it sounds easy, it’s only because humans are social animals, and have evolved to value things like fairness and egalitarianism. But the AI won’t value those by default, merely because it’s intelligent. You gotta write the lines of code.

      It knows that among humans, the way to exhibit intelligence is to behave in a coherent, consistent manner in the face of unexpected situations, and to take into account as much relevant context as possible. It knows (i.e. it can talk about) the kinds of situations in which certain low-IQ human imbeciles, who have a very hard time surviving in the complexity of the modern world, are only capable of following very rigid rules, without taking account of context. Such a sorry individual might follow instructions to the letter and walk right into the face of danger, because they are too simple-minded to understand that rigid, inflexible plans can NEVER be written to deal with ALL of the world’s contingencies.

      This is just another way of saying that value is complex. It’s not as though all the highly specific context-sensitive responses humans exhibit come from nowhere. They’re built into the brain; even though the brain isn’t complex enough to have a discrete memory location for every situation that could come up, it does constitute the sole source of the behavior you’re talking about. So, in principle, a finite programmed algorithm could reproduce exactly the context-sensitive behavior you’re talking about.

      And the AI can be similarly complex. Which means it can exhibit a similar amount of context sensitivity. The problem isn’t that the AI follows simple rules; it’s that its rules, simple or not, are likely to be different from the rules humans follow. The AI might well laugh at how flimsy and rigid human decision procedures are, at how unsubtle and unnuanced humans are about assessing situations. But all that nuance and richness does us no good if it’s just an incredibly nuanced and rich algorithm for creating different kinds of paperclip.

      I think that should make my take on your dialogue clearer.

  6. “If the programmers don’t know in mathematical detail what Friendly code would even look like, then the seed won’t be built to want to build toward the right code. And if the seed isn’t built to want to self-modify toward Friendliness, then the superintelligence it sproutsalso won’t have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general ‘hit whatever target I want’ ability that makes Friendliness easy.

    And that’s why some people are worried.”

    I think there is a further, tacit assumption: that an SAI wouldn’t be able and willing to figure out the right values without having them programmed in. That may be down to a variety of possible
    assumptions: eg.

    1. Moraliy is subjective, and can’t be figured out, so it has to be programmed in.
    2. Morality is not subjective, but the One True Morality might not suit human purposes: the SAI might choose to send us back to medieval levels of population and technology in order to ameliorate our impact on other morally relevant species.

    The LW argument presupposes that humans need to figure out morality in some sense, in order to build FAI. But if humans can, Ais will be able to. But maybe that is the fear — that the (2) scenario will occur Hence the use of the term “friendliness” rather than morality. But then why worry about paperclipping, which is harldy likely to be the One True Morality?

    1. Peter: I don’t understand what you mean by ‘One True Morality’. Is this One True Morality encoded somewhere outside the physical world? Does it have a causal influence on the physical goings-on of our world? How do we come to know about it? Does it constrain the space of possible artificial intelligence architectures, and if so how?

      I consider myself a moral realist, but the morality I’m a realist about is very specifically the evolved set of values human beings actually have, encoded in their brains. And there’s no reason to expect those precise values, even though they’re real, to be easy to understand or to program into an AI. Expecting every organism in the universe to agree about what we think is virtuous is like expecting every organism in the universe to agree with what we find beautiful or delicious. We care about such things, and we have no reason not to; but the rest of the universe does not care about such things.

      1. “Peter: I don’t understand what you mean by ‘One True Morality’.”

        I mean:
        moral cognitivism=true
        moral subjectivism=false,
        ie some moral propositions are objectively true.

        “Is this One True Morality encoded somewhere outside the physical world?”
        Not necessarily. I personally think not. I made no ontological claims above. Is mathematical truth floating in Plato’s heaven?

        “Does it have a causal influence on the physical goings-on of our world?”
        Ditto. I

        “How do we come to know about it?”
        How do we debate morality?

        “Does it constrain the space of possible artificial intelligence architectures,”
        Maybe. Why would a superintelligence be unable to climb the Kohlberg hierarchy?

        “Cognitivism encompasses all forms of moral realism, but cognitivism can also agree with ethical irrealism or anti-realism. Aside from the subjectivist branch of cognitivism, some cognitive irrealist theories accept that ethical sentences can be objectively true or false, even if there exist no natural, physical or in any way real (or “worldly”) entities or objects to make them true or false.

        There are a number of ways of construing how a proposition can be objectively true without corresponding to the world:

        By the coherence rather than the correspondence theory of truth
        In a figurative sense: it can be true that I have a cold, but that doesn’t mean that the word “cold” corresponds to an distinct entity.
        In the way that mathematical statements are true for mathematical anti-realists. This would typically be the idea that a proposition can be true if it is a entailment of some intuitively appealing axiom — in other words, apriori anayltical reasoning.

        Crispin Wright, John Skorupski and some others defend normative cognitivist irrealism. Wright asserts the extreme implausibility of both J. L. Mackie’s error-theory and non-cognitivism (including S. Blackburn’s quasi-realism) in view of both everyday and sophisticated moral speech and argument. The same point is often expressed as the Frege-Geach Objection. Skorupski distinguishes between receptive awareness, which is not possible in normative matters, and non-receptive awareness (including dialogical knowledge), which is possible in normative matters.

        Hilary Putnam’s book Ethics without ontology (Harvard, 2004) argues for a similar view, that ethical (and for that matter mathematical) sentences can be true and objective without there being any objects to make them so.”–WP

        1. moral subjectivism=false,
          ie some moral propositions are objectively true.

          Define ‘subjective’, ‘objective’.

          Is mathematical truth floating in Plato’s heaven?

          Nope.

          “How do we come to know about it?”
          How do we debate morality?

          By using our brain’s evolved language faculties to exchange memes colonizing our brain’s evolved moral faculties. No magic necessary.

          If you want to have a discussion, then no evasions, Peter. Actually answer my question. If morality exists not in our brain but in Plato’s heaven, then how do we come to know about it?

          Why would a superintelligence be unable to climb the Kohlberg hierarchy?

          Straw-man fallacy. The claim you need to dispute is ‘a superintelligence is very likely to be dangerous’. Not ‘a superintelligence can’t possibly be safe’. Also, the truth of moral realism has no substantive bearing on whether computing systems (including brains) can encode behaviors resembling those in Kohlberg’s out-of-date model.

          1. > Define ‘subjective’, ‘objective’.

            You really don’t know?

            >> Is mathematical truth floating in Plato’s heaven?

            > Nope.

            What is it then?

            >By using our brain’s evolved language faculties to exchange memes colonizing our brain’s evolved moral faculties.

            Talking of memes sidesteps the issue of truth. We persuade others of the truth or falsehood of moral claims by appealing to emprical evidence, intutions and logical principles like consistency.

            > No magic necessary.

            Indeed not. I will point out again that I am not a mystical Platonist, however much you would like me to be.

            > If you want to have a discussion, then no evasions, Peter. Actually answer my question. If morality exists not in our brain but in Plato’s heaven, then how do we come to know about it?

            I have already answered that:

            ““Is this One True Morality encoded somewhere outside the physical world?”
            Not necessarily. I personally think not. ”

            >> Why would a superintelligence be unable to climb the Kohlberg hierarchy?

            >Straw-man fallacy. The claim you need to dispute is ‘a superintelligence is very likely to be dangerous’. Not ‘a superintelligence can’t possibly be safe’.

            A superintelligence is not likely to be dangerous if it is likely to climb the Kohhberg hierarchy. Why would a superintelligence be unlikely to climb the Kohlberg hierarchy?

            > Also, the truth of moral realism has no substantive bearing on whether computing systems (including brains) can encode behaviors resembling those in Kohlberg’s out-of-date model.

            It has an enormous bearing on whether they need to have morality explicitly coded in as per the MIRI argument.

            And it’s objectivism, not realism.

            1. Define ‘subjective’, ‘objective’.

              You really don’t know?

              Really. I promise.

              >> Is mathematical truth floating in Plato’s heaven?
              > Nope.
              What is it then?

              Mathematical objects are either causally efficacious objects that are within our world (and responsible for our judgments) or helped produce our world (and thereby produced agents who would have judgments like ours); or they are fictions abstracted from physical phenomena. I suspect there is an element of truth to all three of these views: The world has a math-style structure, something like this structure is responsible for our world’s makeup, and our human practice of mathematics abstracts from that structure, creating a sort of story-game. The story-game is useful for modeling novel empirical phenomena because it’s inspired by previously observed empirical processes (including time and space themselves), and because our world’s structure is mostly homogeneous.

              There isn’t an element of truth to modern-day lowercase-p mathematical platonism, because completely causally inert objects like that can’t have any effect on our beliefs about such objects, which means that nothing we’ve ever observed can qualify as Bayesian evidence for such objects. A physical world without appended lowercase-p platonic numbers looks the same as a physical world with them.

              A superintelligence is not likely to be dangerous if it is likely to climb the Kohhberg hierarchy. Why would a superintelligence be unlikely to climb the Kohlberg hierarchy?

              Because the Kohlberg hierarchy is a complex mathematical object. What physical process would cause any old powerful optimization process to tend to converge upon that specific object?

              1. “Mathematical objects are either causally efficacious objects that are within our world (and responsible for our judgments) or helped produce our world (and thereby produced agents who would have judgments like ours); or they are fictions abstracted from physical phenomena.”

                You are more-or-less saying that maths is physics, which it isn’t. But let’s run with
                “maths is a high-level abstraction from reality”. In that case, mathematical statements
                are objectively true, but don’t *directly* correspond to anything, so there is no immaterial 23 floating about to make statements about 23 true.

                Now: why shouldn’t statements about morality work the same way? Why can’t they have objective truth without direct correspondence to moral objects.

                “Cognitivism encompasses all forms of moral realism, but cognitivism can also agree with ethical irrealism or anti-realism. Aside from the subjectivist branch of cognitivism, some cognitive irrealist theories accept that ethical sentences can be objectively true or false, even if there exist no natural, physical or in any way real (or “worldly”) entities or objects to make them true or false.

                There are a number of ways of construing how a proposition can be objectively true without corresponding to the world:

                By the coherence rather than the correspondence theory of truth
                In a figurative sense: it can be true that I have a cold, but that doesn’t mean that the word “cold” corresponds to a distinct entity.
                In the way that mathematical statements are true for mathematical anti-realists. This would typically be the idea that a proposition can be true if it is a entailment of some intuitively appealing axiom — in other words, apriori anayltical reasoning.”

              2. >>Why would a superintelligence be unlikely to climb the Kohlberg hierarchy?

                > Because the Kohlberg hierarchy is a complex mathematical object. What physical process would cause any old powerful optimization process to tend to converge upon that specific object?

                The process that underlies rationality. Intelligent rational agents will converge on a set concepts that are the right answers to various questions that have rationally accessible answers. I don’t know why you are so convinced that there must be a specific causal mechanism to explain the ability to answer any question. Surely the whole poitn of intelligence and rationality is that they are general purpose?

      2. “I consider myself a moral realist, but the morality I’m a realist about is very specifically the evolved set of values human beings actually have, encoded in their brains. ”

        You can call that realism if you like, but I wouldn’t . If moerality is just whatever we think it is, then there is no possibility of moral error: but realism is typically tied to the possibility of error, indeed arguments from error are used as arguments for realism.
        Moreover, it seems possible ot argue against moral claims, to there is prima facie evidene of moral error. Moreover, your “realism” has content that overlaps wit subjectivism and non-cognitivism, two positions usually sen as opposed to realism.

        “And there’s no reason to expect those precise values, even though they’re real, to be easy to understand or to program into an AI. ”

        i see you have adopted EY’s habit of treating “values” and “morality” itnerchangeably — a habit which renders much of his writing incomprehensible, IMO. It’s another assumption. It ain’t necessarily so.: there are versions of metathics that don’t deal in fine-grained preferences, such as utilitariansim (which only needs to munge the strenght of preferences, and not consider what htey are) and deontology.

        1. If moerality is just whatever we think it is

          Now, hold on a second. Where, exactly, did I say that morality is “just whatever we think it is”? We can certainly be wrong about what morality is, because we can be wrong about our preference ordering. Moral error is ubiquitous in human life.

          Moreover, your “realism” has content that overlaps wit subjectivism and non-cognitivism, two positions usually sen as opposed to realism.

          1. To suggest I’m a non-cognitivist is simply false. I don’t know what you’re talking about here.

          2. I don’t know what you mean by ‘subjectivist’. Moral philosophers are constantly arguing about what ‘subjective’ means, and why it matters, in the context of the moral realism and moral authority debates.

          I would certainly demand evidence that most philosophers think moral naturalism is a form of moral anti-realism! The SEP’s article on Moral Realism explicitly rejects the idea that ‘subjectivity’ is a relevant constraint here (they just demand that realisms be success theories), and its article on Moral Anti-Realism acknowledges this view but notes that there are serious difficulties with giving it content.

          3. Why does this matter? Different people define ‘moral realism’ in dozens of different ways. How does this definitional concern bear on the point I was making about whether artificial intelligence is risky? If we discover that ‘moral realism’ has a different dictionary definition than we’d previously thought, will we thereby define the AI into Friendliness? Perhaps grind the dictionary up into a fine powder, walk counterclockwise around the AI three times, and cast the dust into the north wind while rhythmically chanting? I need a causal mechanism here for how any of this is supposed to actually make us safer.

          i see you have adopted EY’s habit of treating “values” and “morality” itnerchangeably — a habit which renders much of his writing incomprehensible, IMO. It’s another assumption. It ain’t necessarily so.

          What difference does it make here? (Or elsewhere?) We want the AI to care about our aesthetic values, not just our moral ones. An ugly, boring world where no capital-e Evil occurred would still be a pretty unfortunate place.

          1. “Now, hold on a second. Where, exactly, did I say that morality is “just whatever we think it is”? ”

            here:

            “I consider myself a moral realist, but the morality I’m a realist about is very specifically the evolved set of values human beings actually have, encoded in their brains. ”

            “We can certainly be wrong about what morality is, because we can be wrong about our preference ordering. Moral error is ubiquitous in human life.”

            By that I suppose you mean that our \system 2 can be wrong about the prefernces encoded into our System 1. But the buck still stops at our System 1, which, apparently, can’t be wrong.

            “To suggest I’m a non-cognitivist is simply false. I don’t know what you’re talking about here.”

            The claim that morality resides in System 1 is at a least compatible with the claim that is is cognitively inaccessible.

            ” I don’t now what you mean by ‘subjectivist’. Moral philosophers are constantly arguing about what ‘subjective’ means, and why it matters, in the context of the moral realism and moral authority debates.”

            Philosophers argue about everything. You don’t usually find that a problem.

            By “subjectivist” I mean this sort of claim “Expecting every organism in the universe to agree about what we think is virtuous is like expecting every organism in the universe to agree with what we find beautiful or delicious.”.

            “Why does this matter? Different people define ‘moral realism’ in dozens of different ways. How does this definitional concern bear on the point I was making about whether artificial intelligence is risky? ”

            Why should my comments about the feasibility of moral objectivism have any bearing? MO itself has a bearing on the FAI argument, but that doesn’t mean I have to define and defend it. It exists as a position that is taken seriously in the literature, and your protestations of incomprehension and/or disbelief don’t affect that.

            “If we discover that ‘moral realism’ has a different dictionary definition than we’d previously thought, will we thereby define the AI into Friendliness? Perhaps grind the dictionary up into a fine powder, walk counterclockwise around the AI three times, and cast the dust into the north wind while rhythmically chanting?”

            You have declared that moral realism — qua content — is false, and then
            defined yourself as a moral realist, qua label. That makes it difficult to
            follow what you are saying,

            i see you have adopted a habit which renders much of his writing incomprehensible, IMO. It’s another assumption. It ain’t necessarily so.

            “What difference does [EY’s habit of treating “values” and “morality” itnerchangeably —] make here? (Or elsewhere?) ”

            Are you asking what difference comprehensibility makes?

            “We want the AI to care about our aesthetic values, not just our moral ones. An ugly, boring world where no capital-e Evil occurred would still be a pretty unfortunate place.”

            That’s an argument to the effect that aesthetic values are sometimes morally relevant in certain situations, not to the effect that all values are moral values.

            1. the buck still stops at our System 1, which, apparently, can’t be wrong.

              System 1 doesn’t have beliefs, so it can’t be right or wrong. System 1 is simply the part of our brain that has quick intuitive reactions to things. A system-1 response can be wrong in the sense that on the whole we’d prefer not to have it; ditto for a system-2 response. Our brains can’t be ‘wrong’ in the sense of failing to be the basis for our morality, but by that logic dopamine receptors are ‘subjective’. The buck has to stop somewhere. But to equate ‘moral value exists in the natural world (and, as it happens, the place it exists is in a part of human anatomy called the nervous system)’ with ‘moral value is just whatever we (or some agent) believes it is’ is not intellectually honest.

              The claim that morality resides in System 1 is at a least compatible with the claim that is is cognitively inaccessible.

              It is also compatible with the claim that all umbrellas are purple. That’s a rather low bar.

              Regardless, I haven’t claimed that ‘morality resides in system 1’ (to the exclusion of other brain processes). And, regardless, I do believe at least my own moral statements have content. Whether yours do will have to wait on your explanation of what your moral theory is.

              Philosophers argue about everything. You don’t usually find that a problem.

              If philosophers radically diverge on the basic bumper-sticker summary of what a term means, then don’t use that term without disambiguating.

              By “subjectivist” I mean this sort of claim “Expecting every organism in the universe to agree about what we think is virtuous is like expecting every organism in the universe to agree with what we find beautiful or delicious.”.

              The quoted sentence is compatible with a wide variety of meta-ethical views. It’s even compatible with moral platonism. So, again, I have no idea what you mean by ‘subjectivism’.

              As I use the terms ‘subjective’ and ‘objective’, human aesthetic and moral and prudential and episteimc preferences are all objective phenomena. Indeed, I consider them realer and worldlier than many of the things humans ordinarily talk about. But alien preferences are also objective — they’re also constituents of the real world. So I have no reason to think that a poorly designed superintelligence will be much closer to having humans’ objectively real preferences than to having some alien race’s objectively real preferences, or an entirely new and heretofore unknown goal set.

              But, clearly, you mean something different by ‘objective’. I’m waiting to hear what that is. If you don’t give it content, then this discussion will be over pretty quickly, with little gained.

              1. “. But to equate ‘moral value exists in the natural world (and, as it happens, the place it exists is in a part of human anatomy called the nervous system)’ with ‘moral value is just whatever we (or some agent) believes it is’ is not intellectually honest.”

                The point is that it is not realism. Realism about X is the claim that X has mind-independent existence. However, you have placed morality in the mind and only in the mind. Taking physicalist view of the mind doesn’t change that.

                “If philosophers radically diverge on the basic bumper-sticker summary of what a term means, then don’t use that term without disambiguating.”

                You have presented no evidence that the problem is that bad. Judging by your comments elsewhere, the problem seems to be somewhat self-made.

                “The quoted sentence is compatible with a wide variety of meta-ethical views. It’s even compatible with moral platonism. So, again, I have no idea what you mean by ‘subjectivism’”

                You seem to have intended the claim as a refutation of objectivism, so I assumed you meant not merely to state the existence of disagreement, but also the non-existence any truth of the matter. You might have meant that where there is disagreement, some are right and others wrong, but then why object to objectivism, which says the same thing?

                In any case, what I mean by subjectivism is what everyone means: There is subjectgivism about X, where two subjects disagree about X, and both are correct
                as far as they are concerned. “Marmite is horrible” — “no, it’s deliciuos”–“well taste is subjective”.

              2. “As I use the terms ‘subjective’ and ‘objective’, human aesthetic and moral and prudential and episteimc preferences are all objective phenomena. Indeed, I consider them realer and worldlier than many of the things humans ordinarily talk about. But alien preferences are also objective — they’re also constituents of the real world.”

                If objective means real, then subjective, its opposite, means unreal, and nothing is subjective, because nothing unreal exists. But it is a truism that some things, such as aesthetic judgements, are subjective. But that is an *epistemological* claim, it is about the truth-values of certain claims work. And if “subjective” is an epistemological term,
                then so is “objective”, its opposite. Which it is. When we exhort someone to be ojective about something, we don;t want them to become real: we want hem to st aside personal biases and feelings.

  7. Here is a post in which I initially try to taboo concepts such as “intelligence” and “goals” and solely focus on power, by analogy to nanotechnology, and afterwards highlight the relevance to MIRI’s scenario.

  8. ” Are you suggesting that it’s easy to give AIs, as a terminal value, ‘make your behaviors universalizable, such that if you would act some way if you had human values and human knowledge, then you should in fact act that way’? If it sounds easy, it’s only because humans are social animals, and have evolved to value things like fairness and egalitarianism. But the AI won’t value those by default, merely because it’s intelligent. You gotta write the lines of code.”

    You don’t gotta, because AIs can learn. If morality is something that can only be learnt in a social
    setting, which seems likely, then build socities of AIs. There are *lot* of assumptions behind
    the FAI argument..that values ahve to eb pre-programmed, that morality can’t be inferred, that AIs are necessarily singletons, etc. It’s highly conjunctive, hence it’s low impact outside LW.

    1. You don’t gotta, because AIs can learn.

      You do gotta, because you still have to write the code that makes the AI learn precisely the right thing from its environment. Doing so might well be the most difficult task humans have ever accomplished, so we’d better get to work. See Magical Categories.

      If morality is something that can only be learnt in a social
      setting, which seems likely, then build socities of AIs.

      Non sequitur. Everything about humans is the result of evolution, but it would be decidedly unwise to rely on evolutionary algorithms alone to create the AI. The fact that something arose in humans via method X doesn’t imply that X is the best way to make it arise in an AI. Consider how many organisms on the planet, including social organisms, don’t have a human morality. (The answer is: All of them except humans!) Perhaps some sort of social dynamic (presumably with humans, not with other AIs!!) is necessary for a computationally feasible social morality, but it’s not sufficient. We’re also in a tough spot because we need the AI to be a much better ethicist than we are, not merely a human-level one. (If you gave a human unlimited power, it would in nearly all cases immediately become Unfriendly, because humans don’t know how to handle that much power. We don’t understand our own values well enough for that.)

      There are *lot* of assumptions behind
      the FAI argument..that values ahve to eb pre-programmed, that morality can’t be inferred, that AIs are necessarily singletons, etc.

      It doesn’t sound to me like you have a lot of familiarity with MIRI’s arguments. MIRI is primarily interested in indirect normativity, not in hard-coding every specific value into the AI at the outset. Heck, I made this very point just a few comments above:

      “Since we all agree that value is too complex for humans to directly and in full detail code into an AI themselves, we all agree that we’re not going to hard-code everything about what we mean by ‘happiness’ into the AI. Likewise, we aren’t going to hard-code everything about physics into the AI. Instead, we’ll start with a certain baseline — perhaps an extremely fast-thinking emulation of a human brain, or something very similar — and direct it to start editing itself in a way that makes it propagate slight variations on itself when those variations are better at modeling its environment, but not when they are worse.”

      1. “You do gotta, because you still have to write the code that makes the AI learn precisely the right thing from its environment. ”

        It’s certain that you have to write some sort of code. It is not that you absolutely have to code in any particular function..that is dependent on the assumptions you are making.
        Under the assumption that morality is something that societies have to come up with in order to function, you don’t have to code in morality, and you don’t have to code in specific morality-learning modules either. The EY/MIRI approach keeps assuming morality/values is this distinct walled-off thing, and it ain’t necessarily so.

        “The fact that something arose in humans via method X doesn’t imply that X is the best way to make it arise in an AI.”

        You have something better? You keep saying how fantastically difficult it is to code in morality.

        “Consider how many organisms on the planet, including social organisms, don’t have a human morality”

        It’s not like they have other moralities: its more like they have none, just as they don’t have human or any other language. The point that sociality is necessary but not sufficient for morality is there; the point that there are many possible moralities — with no overarching universal principles — is not. That’s just another EY/MIRI assumption.

        And what’s so bad about failures? If you try to socialise in morality at the Seed Ai stage, your failures aren’t going to destroy you because they are not Superintelligences.

        1. You have something better? You keep saying how fantastically difficult it is to code in morality.

          It is fantastically difficult. Brainstorming and discussion is fine, but I don’t think we’re at the stage yet where we should be fixing on solutions. I don’t need to have an easier idea in mind in order to note that the idea in question is a lot trickier than it initially seems.

          However, in general I think we should be vastly more cautious about building Friendliness with an evolutionary algorithm or poorly-understood neural network than about building it in a way humans can more easily understand and supervise it.

          The point that sociality is necessary but not sufficient for morality is there; the point that there are many possible moralities — with no overarching universal principles — is not.

          Taboo ‘morality’. What we’re really worried about here isn’t whether we’re likely to build an AGI that has capital-m Morality and yet still harms human beings. What we’re worried about is whether we’re likely to build an AGI that harms human beings. If you discover that the AGI’s decision procedure counts as a non-morality, rather than counting as an alien morality, that will only tell us something about how you use the word ‘morality’, not about AI risk itself.

          The fact that we know of lots of social organisms whose behavior are best modeled by a preference ordering radically unlike humans’ preference ordering, and only one whose preference ordering matches humans’ preference ordering (humans themselves), gives us good evidence that most intelligent social organisms will also have inhuman preference orderings, in various respects. Whether you call those hypothetical aliens’ preferences ‘moral’, or those non-human animals’ preferences ‘moral’, could hardly be more irrelevant.

          1. “However, in general I think we should be vastly more cautious about building Friendliness with an evolutionary algorithm or poorly-understood neural network than about building it in a way humans can more easily understand and supervise it.”

            “Which is what? If you were coding friendliness in directly, then you could check the correctness of your code. But you say you are not. You say you are leaving it to a seed AI, which presumably cannot be checked, because it is doing stuff we cannot understand, which is why we are using it. Indirect solutions aren”t easily understandable.

            “Taboo ‘morality’. What we’re really worried about here isn’t whether we’re likely to build an AGI that has capital-m Morality and yet still harms human beings. What we’re worried about is whether we’re likely to build an AGI that harms human beings. ”

            The two questions are deeply intertwined. For instance, if an AI can figure out the correct morality, and it entails that humans should not be harmed, then there is no need to worry about building in friendliness. So the claim that there is a need to worry about building in friendliness needs a disproof of objective morality. But nothing has been offered except one-liners about morality not floating about in space. MIRI needs to address the steelman, not the strawman.

            What I am saying in general is that the FAI argument is highly conjunctive, and that there are alternatives at each stage. For an alternative to be “there”, I mean it exists in the literature and is taken seriously by domain experts. I don’t require it to be necessarily right, and I don’t expect it to be refuted by protestations of personal incomprehension.

            “If you discover that the AGI’s decision procedure counts as a non-morality, rather than counting as an alien morality, that will only tell us something about how you use the word ‘morality’, not about AI risk itself.”

            It could tell me why it is harming human beings. You want the AI to behave with respect for the preferences of others, specifically other humans. You call that friendliness. Some call it morality, but its more or less the same thing. So unfriendliness,which is what all this is about is a failure to be moral, if you want to put it that way. The word doesn’t matter, but the referent does. Friendliness/morality isn’t just “preferences”, and what is true of preferences is not automatically true of friendliness/morality.

            “The fact that we know of lots of social organisms whose behaviour are best modelled by a preference ordering radically unlike humans’ preference ordering, and only one whose preference ordering matches humans’ preference ordering (humans themselves), gives us good evidence that most intelligent social organisms will also have inhuman preference orderings, in various respects.”

            You seem to have confused having preferences with respecting preferences.

            Humans don’t have the values of chickens or pandas, but we are capable of respecting them. As social beings, we can respect preferences of other humans, even if we do not have them, and we can extend that to respecting the preferences of others species.

            What is needed from a process of inculcating morality into an AI by socialisation is
            the ability to respect value X for a wide variety of X, without necessarily valuing X. Socialisation alone didn’t do that for cattle; nor did it give cattle speech and reason. But we expect our superintelligence to have speech and reason.

            ” Whether you call those hypothetical aliens’ preferences ‘moral’, or those non-human animals’ preferences ‘moral’, could hardly be more irrelevant.”

            No. But the question of whether they are moral, friendly or whatever, is what this is all about.

            1. if an AI can figure out the correct morality, and it entails that humans should not be harmed, then there is no need to worry about building in friendliness.

              Nonsense. We still need to build an AI that will in fact ‘figure out’ (and follow) this Correct Morality. There is no ghost in the machine independent of the code we write.

              So the claim that there is a need to worry about building in friendliness needs a disproof of objective morality.

              No.

              1. You don’t need a disproof of ‘objective morality’ in order to suspect that there isn’t such a thing.

              2. If you’re a coinflip agnostic about whether there’s such a thing, you should prepare for both eventualities.

              3. If you discover there’s an Objective Morality, that doesn’t help at all unless you know which moral code is the Objective one. (What are the chances that evolved human brains would happen upon it? What causal mechanism would constrain them to do so?)

              4. If you discover that human morality has some unique magical glow called ‘Objective Morality’, how does that glow physically constrain artificial intelligences to follow that exact same Morality? What’s the mechanism that makes it easier for an AI to identify this glow-of-objectivity than to figure out our preferences in a more naturalistic framework?

              I’m particularly interested in the last point. Why, exactly, does discovering ‘objective morality’ help make it impossible for an AGI to be programmed to be a paperclip maximizer? I’m asking partly because I hope that answering this question will help us unpack what you mean by that phrase in the first place, since we obviously don’t mean the same thing by it. Just for example: Why wouldn’t discovering that human morality is ‘objective’ give us reason to think that an AGI is even less likely to be Friendly? Perhaps it just seems obvious to you. Make it more obvious for me too.

              But nothing has been offered except one-liners about morality not floating about in space. MIRI needs to address the steelman, not the strawman.

              I can’t steel-man a position until you’ve told me what the position is. What is ‘objective morality’, precisely, and why does it matter?

              Humans don’t have the values of chickens or pandas, but we are capable of respecting them. As social beings, we can respect preferences of other humans, even if we do not have them, and we can extend that to respecting the preferences of others species.

              Sure. Humans are awesome. (Though that doesn’t stop us from torturing chickens when they stand between us and mild dinnertime pleasures. Guess socialization has its limits…)

              Still: Not my point. My point wasn’t that humans are unconcerned with non-human preferences; it was that non-human social organisms aren’t concerned with human preferences. I’m adding data points to our discussion — other animal species. Returning again and again to your only data point showing that human morality is inevitable — humans themselves — doesn’t increase the weighting of that data point.

              Socialisation alone didn’t do that for cattle; nor did it give cattle speech and reason. But we expect our superintelligence to have speech and reason.

              A talking snake is still a snake. Even if by some miracle the AGI is as kind toward us as we are toward chickens, that’s not good enough.

              1. “Nonsense. We still need to build an AI that will in fact ‘figure out’ (and follow) this Correct Morality.”

                I am baffled as to why you would interpret “if an AI can do XYZ..” as meaning “if an AI can do XYZ magically without being programmed”. I have been a professional computer programmer for over 20 years: I know that you need to program computers to get them to do stuff. Of course, to have a machine that can figure out its own morality, you need to design and build such a machine. But the point stands that IF you can, you don’t need to program in the morality!

                > 1. You don’t need a disproof of ‘objective morality’ in order to suspect that there isn’t such a thing.

                You need more than a subjective suspicion to show that it is not actually a possibility.

                > 2. If you’re a coinflip agnostic about whether there’s such a thing, you should prepare for both eventualities.

                Where did I disagree? I am arguing that the FAI argument is a conjunction of several claims; that there are alternative possibilities at each stage; and therefore, by Bayes, the
                overall probability of the conjunction is low. Agnosticism about which possibility is correct is all that is needed for the argumetn to go through.

                > 3. If you discover there’s an Objective Morality, that doesn’t help at all unless you know which moral code is the Objective one. (What are the chances that evolved human brains would happen upon it? What causal mechanism would constrain them to do so?)

                If you make the metaethical discovery that moral objectivism is true, in the sense that
                any sufficiently advanced intelligence will hit on the correct morality, then there is no
                need to figure out the object-level morality yourself. If your project is to build a superintelligence, then the superintelligence will figure it out. Which is just the indirect approach you have advised elswhere.

                > 4. If you discover that human morality has some unique magical glow called ‘Objective Morality’, how does that glow physically constrain artificial intelligences to follow that exact same Morality?

                If OM is derivable by any sufficiently advanced rational agent, then any suffciently
                advanced rational agent will derive it. The constraint that make them derive it would be the contraints of rationality. There is no special constriant that magically make agents derive 2+2=4. If you are rational, you can derive whatever your rationality can derive.

                > What’s the mechanism that makes it easier for an AI to identify this glow-of-objectivity than to figure out our preferences in a more naturalistic framework?

                The same kind of “mechanism” that makes mathematics unemprical.

                > I’m particularly interested in the last point. Why, exactly, does discovering ‘objective morality’ help make it impossible for an AGI to be programmed to be a paperclip maximizer?

                If you want a clippie, I suppose you could code one. But that is entirely irrelevant. The FAI argument is that if you want a non-clippie, it is not enough to omit any specific paperclipping drives: you also have to solve morality and code it in. The non-FAI arguemnt is that it ain’t necessarily so, because MIRI has failed to show that moral truths cannot be derived from general rationality, like mathematical truths.

                > Why wouldn’t discovering that human morality is ‘objective’ give us reason to think that an AGI is even less likely to be Friendly? Perhaps it just seems obvious to you. Make it more obvious for me too.

                I’ve already commented on the possibility that OM would be inconvenient for us. The relationship between OM and friendliness is just another step in the conjunctive argument. Again, I don;t have to show that nay alternative possibility is high probability, only that there are so many altenatives that the conjunction is low prob.

                > can’t steel-man a position until you’ve told me what the position is. What is ‘objective morality’, precisely, and why does it matter?

                As I have said, the steelmanned version is in various works by philosophers that no one at MIRI has heard of or read. It exists. I don’t have to give it personally.

                The probability of FAI is low becaue the possibility space is wide.
                The possibility space is wide, but it seems narrower than it is to the MIRI people because they don’t know enough. They haven’t done the reading and they
                are too quick to dismiss ideas on subjective knee-jerk reactions.

                > it was that non-human social organisms aren’t concerned with human preferences.

                What’s the point of the point? It’s obvious that socialisation isn’t sufficient for a conscious, rational grasp of morality. I’ve conceded that several times,

  9. Rob, it happened again. I think I am explaining the point with such long explanations that I am causing you to miss the point. So this time it will be super-short.

    This hypothetical AI will say “I have a goal, and my goal is to get a certain class of results, X, in the real world.” Then it describes the class X in as much detail as it can …. of course, no closed-form definition of X is possible (because like most classes of effect in the real world, all the cases cannot be enumerated) so all it can describe are many features of class X.

    Next it says “I am using a certain chunk of goal code (which I call my “goalX” code) to get this result.” And we say “Hey, no problem: looks like your goal code is totally consistent with that verbal description of the desired class of results.” Everything is swell up to this point.

    Then one day the AI says “Okay now, today my goalX code says I should do this…” and it describes an action that is VIOLENTLY inconsistent with the previously described class of results, X. This action violates every one of the features of the class that were previously given.

    The onlookers are astonished. They ask the AI if it UNDERSTANDS that this new action will be in violent conflict with all of those features of class X, and it replies that it surely does. But it adds that it is going to do that anyway.

    [ And by the way: one important feature that is OBVIOUSLY going to be in the goalX code is this: that the outcome of any actions that the goalX code prescribes, should always be checked to see if they are as consistent as possible with the verbal description of the class of results X, and if any inconsistency occurs the goalX code should be deemed defective, and be shut down for adjustment.]

    The onlookers say “This AI is insane: it knows that it is about to do something that is inconsistent with the description of class of results X, which it claims to be the function of the goalX code, but is going to allow the goalX code to run anyway”.

    ——-

    Now, Rob. My question.

    Why is it that you — or rather, people who give credibility to the Dopamine Drip scenario — insist that the above episode could ONLY occur in the particular case where the class of results X has to do with “making humans happy”?

    If the AI is capable of this episode in the case of that particular class of results X (the “making humans happy” class of results), why would we NOT expect the AI to be pulling the same kind of stunt in other cases? Why would the same thing not be happening in the wide spectrum of behaviors that it needs to exhibit in order to qualify as a superintelligence?

    You will notice that this time the framing of the problem contained absolutely no reference to the values question. There is nothing in the part of my comment above the “——-” that specifies WHAT the class of results X is supposed to be.

    All that matters is that if the AI behaves in such a way, in any domain of its behavior, it will be condemned as lacking intelligence, because of the dangerous inconsistency of its behavior. That fanatically rigid dependence on a chunk of goalX code, as described above, would get the AI into all sorts of trouble (and I won’t clutter this comment by listing examples, but believe me I could). But of all the examples where that could occur, you want to talk only about one, whereas I want to talk about the all of them.

    So my closing statement is to repeat the question, so you know what is the question that needs a reply: If an AI is capable of repeatedly doing what is described above the “——-” in this comment, why do you folks (by which I mean folks such as MIRI/SIAI) only want to talk about one particular domain in which it could occur, and not any of the others?

    I await your answer.

    1. To clarify, you’re hypothesizing a scenario in which we’ve solved several of the largest problems in Friendliness Theory — we’ve succeeded in the extraordinary task of creating a seed AI that (a) self-modifies to become a superintelligence that then suddenly pauses its self-optimization and boxes itself, and then (b) gives a perfectly accurate description of its understanding of the consequences of its actions, ‘class X’. It then awaits our approval before taking further action. Your claim is then, I take it, that solving a and b would solve Friendliness as a whole. (And that these are relatively easy to solve? Or MIRI isn’t approaching them in the right way? I’m not clear on what your intend the practical upshot to be.)

      The problem is that, as you say, ‘class X’ is too complex for any human to understand. You suggest that we might be able to understand some general features of the class, if the AI is summarizing them accurately — but it will be hard to find strictly accurate generalizations that can be compressed into a string a human could hear in a lifetime. One solution is for the AI to give us information about the consequences of some course of action extensionally, rather than intensionally. In effect, it just gives us an ordered list of what will happen, conditioned on following plan X; and although we don’t have time to read the entire list, we can read enough of it to get the gist, and know that the AI will be Friendly in later elements because it was Friendly in early elements.

      Let’s say, for instance, that we try to make the AI friendly by coding it to learn what makes people smile, and value those sorts of actions. This is an actual proposal by an AI researcher, Bill Hibbard, so it isn’t a straw man. And it seems on your account we could test for Friendliness by having SmileBot list the predicted outcome of its carrying out that project. Of course, the list is too long for us to generate or understand it all in a reasonable amount of time, so perhaps we just look at the first hundred thousand entries. They all look like this:

      1. increase smiles by: making playgrounds for children to play with
      2. increase smiles by: making puppies for children to play with
      3. increase smiles by: curing esophageal cancer
      4. increase smiles by: engineering healthier, better-tasting fruit snacks
      5. increase smiles by: promoting good science education in the Ukraine
      6. increase smiles by: teaching yoga to blind people
      7. increase smiles by: curing mumps



      etc., etc.

      Eliezer’s claim is that even if the first hundred thousand entries all look like that, entry #6,488,329 might read: ‘increase smiles by: killing all current humans and replacing them with tiny genetically identical smiley faces covering the entire surface of the planet’. The scary entries didn’t occur where the AI predicted we’d look for them, either because the AI lacked the power and resources to implement anything drastic that early, or because the AI knew that it would increase fewer smiles (which is, after all, its core value) if it gave humans any clues that that was what it was up to.

      It sounds like you’ve been claiming that something like that just won’t happen, because it involves some sort of epistemic deficit, even an outright ‘contradiction’. But where is the contradiction? This does increase the number of smiles. Sure, it’s not what we intended the AI to do; but we programmed the AI to increase the number of human smiles, not to ‘do what we intended’

      And, although this is just an especially simple toy example, it points to the general problem with your argument: You admit that we’re not smart enough to fully understand the values we program into the AI, hence we would have to rely on the AI itself to tell us about what it’s going to do. Or we would have to put the AI in a restricted environment initially, see whether it acts Friendly-seeming, and then only release it if it does seem nice. The problem with these approaches is that most possible algorithms that seem nice in the short run don’t play nice in the long run. Inferring from an incomplete list of outcomes that all seem Friendly to the inductive conclusion ‘all the outcomes wil lbe Friendly’ is only warranted if most partly-Friendly lists are completely Friendly. But they aren’t.

      Perhaps you’re going to say that I missed your meaning again. Perhaps you’ll shift theses yet again, and this time claim that you didn’t mean the AI gave us incomplete information about its values and we generalized wrongly; rather, you meant that the AI literally has an inconsistency in its values, e.g., we found a line of its programming that says ‘do X’ and another line that says ‘don’t do X’, and an AI with a design flaw of that kind will break and therefore not be dangerous.

      But if that’s what you really meant, then your original article makes no sense, because your original article accuses AI risk advocates in general, and MIRI in particular, of committing this error. But where has anyone ever actually said that UFAI are dangerous because their values are internally inconsistent? I’m happy to help you refine and improve your criticism, replace it with something clearer and easier to defend. But my main concern is to figure out what specific texts or arguments or established beliefs you meant to criticize, and why. I don’t just want to steel-man what you’ve said; I want to unpack it in a way that reveals its relevance to any of the articles or public figures you’ve accused of this ‘fallacy’. What’s the point of talking about a ‘fallacy’ no one has ever committed?

      1. “It sounds like you’ve been claiming that something like that [plastic miley faces] just won’t happen, because it involves some sort of epistemic deficit, even an outright ‘contradiction’. But where is the contradiction? ”

        You can see that that isn’t what Increase Happiness really means. The AI is smarter than you, but you think it will miusunderstand somehting you can understand. So the contradiction is “the AI is smarter than us, but dumber than us”.

        “we programmed the AI to increase the number of human smiles, not to ‘do what we intended’”

        The we were dumb. An AGI has to have an ability to understand NL. We could have leveraged that to enable it to get at our real intentions. It looks like “we” — MIRI –chose the wrong architecture. Solution: choose the right architecture.

        1. You can see that that isn’t what Increase Happiness really means. The AI is smarter than you, but you think it will miusunderstand somehting you can understand. So the contradiction is “the AI is smarter than us, but dumber than us”.

          If that’s Richard’s argument, then his argument is a transparent strawman. This is in fact the very error the article you’re commenting on is about: The seed is not the superintelligence, hence the fact that the superintelligence is smart enough to construct True Happiness code does not imply that the seed AI (or its programmers!) are smart enough to construct True Happiness code. The superintelligence knows how to build a Friendly AI, but it won’t care if it wasn’t already made Friendly prior to becoming a superintelligence.

          I think Richard’s been trying to backpedal away from that argument and reformulate it in some new way, but he clearly hasn’t yet been successful if even his defenders only succeed in reconstructing the same straw we started with.

          The we were dumb.

          Only in the sense that almost every human effort along these lines will be ‘dumb’. It takes a large amount of effort and novel insight to solve an engineering problem of this scale.

          An AGI has to have an ability to understand NL.

          False. Most AGIs by default will not understand any human language. Some will not even be able to understand any human language, since some AGIs are not superintelligent — indeed, some are a little less intelligent than a human being.

          To see why this is so, imagine encountering a random alien species of roughly human-level intelligence. Would this species know English? No. Would it be capable of learning English? To some extent, perhaps. But it might take a very long time to fully understand all English-language concepts, and some concepts might be forever beyond its grasp.

          We could have leveraged that to enable it to get at our real intentions.

          Not if understanding NL is harder than understand our real intentions; and not if it only ‘understands NL’ to the weak extent a human being does, and thus runs into moral errors as often as humans do. In that case, it is more likely we’ll see those errors magnified, rather than squelched, when the seed self-modifies in the direction of superintelligence. (Because there are a more diverse set of ways to violate human norms than to adhere to them. Noise predicts unsafe SI, to the extent it predicts SI at all.)

          1. > the seed is not the superintelligence,

            Where did *he* say anything about seeds? Congratulations , you have hit on a method for building AIs that might kill us all. Please don’t use it.

            But RL isn’t wrong about anything just because he is making more sensible assumptions than you are making.

            > Only in the sense that almost every human effort along these lines will be ‘dumb’. It takes a large amount of effort and novel insight to solve an engineering problem of this scale.

            That’s evasive.

            > False. Most AGIs by default will not understand any human language. Some will not even be able to understand any human language, since some AGIs are not superintelligent — indeed, some are a little less intelligent than a human being.

            I have no idea what you mean by that. AGI by definition is “human level intelligence”.
            I have no idea why why you would think most AGIs..thatis most AGis we would build..would be speechless, when it is clearly very useful for us to be able ot sepak to them.

            > To see why this is so, imagine encountering a random alien species of roughly human-level intelligence.

            No. It’s irrelevant. AGI is about us duplicating our intelligence. It is not a random dip into mindspace. Not that there is any engineering procedure corresponding to that.

            > Not if understanding NL is harder than understand our real intentions;

            If it is harder AND safer, it is worth doing.

            > and not if it only ‘understands NL’ to the weak extent a human being does, and thus runs into moral errors as often as humans do. In that case, it is more likely we’ll see those errors magnified, rather than squelched, when the seed self-modifies in the direction of superintelligence. (Because there are a more diverse set of ways to violate human norms than to adhere to them. Noise predicts unsafe SI, to the extent it predicts SI at all.)

            Yet again, you assume rationality can only be instrumental. Why shouldn’t it climb the Kohlberg hierarchy as it self-improves?

  10. Rob,

    No, that is not right. We are discussing the article I wrote for IEET, and that article addressed a scenario postulated by OTHER people, not me. 🙂 So it makes no sense for you to begin your reply with “To clarify, you’re hypothesizing a scenario in which we’ve solved several of the largest problems in Friendliness Theory……”. It ain’t me that’s postulating any scenarios. 🙂

    You should refer to THEIR parameters, if you need clarification of what the AI is supposed to do. I am simply responding to a hypothetical AI that those other people put forward (to wit, the Dopamine Drip scenario).

    [For example you say “Your claim is then, I take it, that solving a and b would solve Friendliness as a whole” ………. To which I say “Heck no!!” I have not made any claims about what would solve friendliness, in the discussion we have been having. I have only been addressing your attack on my attack on the Dopamine Drip scenario].

    1. It ain’t me that’s postulating any scenarios.

      Positing for reductio, at least. You were hypothesizing the AI in order to show that it would, by the conditional assumptions with which you began, be Friendly, or would explode in a puff of logic. Neither of those seems likely to me, in your hypothetical, and I don’t see any positive arguments for why either would be likely. In particular, the idea that the AI has to have actually inconsistent values in order to initially seem Friendly but later turn out to be Unfriendly strikes me as completely unsupported by the statements to date, in spite of how much you’ve written reiterating this claim.

      As a reminder, you made some extremely serious allegations about SIAI, now MIRI:

      This myth about the motivational behavior of superintelligent machines has been propagated by the Singularity Institute (formerly known as the Singularity Institute for Artificial Intelligence) for years, and every one of my attempts to put a stop to it has met with scornful resistance. After so much time, and so much needless fearmongering, I can only conclude that the myth is still being perpetuated because it is in the Singularity Institute’s best interest to scare people into giving them money.

      I have to say that in my opinion it counts as borderline fraud when organizations like the Singularity Institute try to sell that specious argument while asking for donations, and while at the same time dismissing the internal logical inconsistency with a scornful wave of the hand.

      Alleging ‘borderline fraud’ and ‘fearmongering’ is not a joke or game. It’s a minimal requirement of intellectual honesty to actually cite an example of MIRI making the mistake you allege. But so far all your attempts to restate your argument have either seemed to have nothing to do with anything MIRI (or, in fact, any prominent AI risk researcher) has ever said; or have seemed to be easily defended arguments, not ‘fallacies’ and certainly not ‘fraud’! Perhaps if you noted a quotation or two that you think commits the fallacy you’re talking about, that would help others understand what exactly is the mistake you’re talking about.

      1. Why are you talking about “Friendly” vs “Unfriendly” again?

        I specifically framed the argument in such a way that those issues were completely removed from the situation. Please address the argument that I actually made, not some othe argument that you are trying tp put into my mouth.

        Here are some examples of things that I have NOT said, but which you keep putting into my mouth:

        1) I made no claims about it exploding in a puff of logic.

        2) You say I was “hypothesizing the being in order to show that it would be Friendly” … and since I have several times explicitly said that I am NOT doing any such thing, I am baffled as to why you keep repeating that. I flatly deny that that is the purpose of my argument!

        3) You say “In particular, the idea that the AI has to have actually inconsistent values in order to initially seem Friendly but later turn out to be Unfriendly strikes me as completely unsupported by the statements to date” ……. that is an utter distortion of what I said. A complete fabrication.

        And as if that were not bad enough, in your last comment but one you cite Bill Hibbard’s discussion of an AI that seeks to maximize human smiles. Your attempt to cite him, and Eliezer’s lunatic distortion/extrapolation of Bill’s suggestion, is pretty outrageous. Bill is a good friend of mine, and he would laugh himself silly (actually, I think it is fair to say he already HAS) at your/Eliezer’s perversion of his idea. It is so easy to undermine that argument, by the way, that it isn’t even funny: that argument was dealt with and dismissed by sensible people many years ago.

        But that is beside the point.

        Both of your replies this evening have said absolutely nothing about the argument that I presented to you. You make no mention of it, you just raise all these red herrings and wave your hands. You have not given one single word that explains where the argument is at fault. Not one.

        This has been a gigantic waste of time. I thought you had the intellect to understand and address the issue. But after listening to you avoid talking about what I actually said this number of times I have lost count) I am finally at the end of my patience.

        Goodbye.

        1. My Outrage Smokebomb sense is tingling.

          ‘How dare you challenge me to provide evidence or examples of an organization committing the borderline-fraud I’m accusing it of! Why I am so flabbergasted I am just going to walk out the door right this second. Right out the door. Good day!!!’

          I’ve noticed the only responses you really give to counter-arguments are to laughingly dismiss them, state there was some unspecified misunderstanding, and then argue for the same points again. Or laughingly dismiss them, state there was some unspecified misunderstanding, and change the topic, without explaining the new topic’s relevance and without conceding the previous points. You’ve got to give people more to go on than this. Otherwise, your discussions will remain unproductive and uncollaborative. Some well-meant advice, Richard.

  11. Richard, don’t lower yourself. Arguing AI with Robert is literally the same as arguing quantum mechanics or general relativity with some armchair physicist that haven’t as much as done a single exercise from a textbook on the topic being discussing, but thinks he has some “qualitative understanding”.

    He merely been introduced to the imaginary world where people he would normally assume to have higher intelligence, achievement, and education are unable to form the special understanding that he can, and that vision captivated him.

  12. This is pure human centralism.
    We should make superintelligence and stop breeding ourselves instead of prevent from anything better than us.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s