Loosemore on AI safety and attractors

Richard Loosemore recently wrote an essay criticizing worries about AI safety, “The Maverick Nanny with a Dopamine Drip“. (Subtitle: “Debunking Fallacies in the Theory of AI Motivation”.) His argument has two parts. First:

1. Any AI system that’s smart enough to pose a large risk will be smart enough to understand human intentions, and smart enough to rewrite itself to conform to those intentions.

2. Any such AI will be motivated to edit itself and remove ‘errors’ from its own code. (‘Errors’ is a large category, one that includes all mismatches with programmer intentions.)

3. So any AI system that’s smart enough to pose a large risk will be motivated to spontaneously overwrite its utility function to value whatever humans value.

4. Therefore any powerful AGI will be fully safe / friendly, no matter how it’s designed.

Second:

5. Logical AI is brittle and inefficient.

6. Neural-network-inspired AI works better, and we know it’s possible, because it works for humans.

7. Therefore, if we want a domain-general problem-solving machine, we should move forward on Loosemore’s proposal, called ‘swarm relaxation intelligence.’

Combining these two conclusions, we get:

8. Since AI is completely safe — any mistakes we make will be fixed automatically by the AI itself — there’s no reason to devote resources to safety engineering. Instead, we should work as quickly as possible to train smarter and smarter neural networks. As they get smarter, they’ll get better at self-regulation and make fewer mistakes, with the result that accidents and moral errors will become decreasingly likely.

I’m not persuaded by Loosemore’s case for point 2, and this makes me doubt claims 3, 4, and 8. I’ll also talk a little about the plausibility and relevance of his other suggestions.

 

Does intelligence entail docility?

Loosemore’s claim (also made in an older essay, “The Fallacy of Dumb Superintelligence“) is that an AGI can’t simultaneously be intelligent enough to pose a serious risk, but “unsophisticated” enough to disregard its programmers’ intentions. I replied last year in two blog posts (crossposted to Less Wrong).

In “The AI Knows, But Doesn’t Care” I noted that while Loosemore posits an AGI smart enough to correctly interpret natural language and model human motivation, this doesn’t bridge the gap between the ability to perform a task and the motivation, the agent’s decision criteria. In “The Seed is Not the Superintelligence,” I argued, concerning recursively self-improving AI (seed AI):

When you write the seed’s utility function, you, the programmer, don’t understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence’s actions, long after it’s become smart enough to fully understand our values.

Why is the superintelligence, if it’s so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can’t we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: ‘When you’re smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.’?

Because that sentence has to actually be coded in to the AI, and when we do so, there’s no ghost in the machine to know exactly what we mean by ‘frend-lee-ness thee-ree’. Instead, we have to give it criteria we think are good indicators of Friendliness, so it’ll know what to self-modify toward.

My claim is that if we mess up on those indicators of friendliness — the criteria the AI-in-progress uses to care about (i.e., factor into its decisions) self-modification toward safety — then it won’t edit itself to care about those factors later, even if it’s figured out that that’s what we would have wanted (and that doing what we want is part of this ‘friendliness’ thing we failed to program it to value).

Loosemore discussed this with me on Less Wrong and on this blog, then went on to explain his view in more detail in the new essay. His new argument is that MIRI and other AGI theorists and forecasters think “AI is supposed to be hardwired with a Doctrine of Logical Infallibility,” meaning “it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place”.

Loosemore thinks that if we reject this doctrine, the AI will “understand that many of its more abstract logical atoms have a less than clear denotation or extension in the world”. In addition to recognizing that its reasoning process is fallible, it will recognize that its understanding of terms is fallible and revisable. This includes terms in its representation of its own goals; so the AI will improve its understanding of what it values over time. Since its programmers’ intention was for the AI to have a positive impact on the world, the AI will increasingly come to understand this fact about its values, and will revise its policies to match its (improved interpretation of its) values.

The main problem with this argument occurs at the phrase “understand this fact about its values”. The sentence starts by talking about the programmers’ values, yet it ends by calling this a fact about the AI’s values.

Consider a human trying to understand her parents’ food preferences. As she develops a better model of what her parents mean by ‘delicious,’ of their taste receptors and their behaviors, she doesn’t necessarily replace her own food preferences with her parents’. If her food choices do change as a result, there will need to be some added mechanism that’s responsible — e.g., she will need a specific goal like ‘modify myself to like what others do’.

We can make the point even stronger by considering minds that are alien to each other. If a human studies the preferences of a nautilus, she probably won’t acquire them. Likewise, a human who studies the ‘preferences’ (selection criteria) of an optimization process like natural selection needn’t suddenly abandon her own. It’s not an impossibility, but it depends on the human’s having a very specific set of prior values (e.g., an obsession with emulating animals or natural processes). For the same reason, most decision criteria a recursively self-improving AI could possess wouldn’t cause it to ditch its own values in favor of ours.

If no amount of insight into biology would make you want to steer clear of contraceptives and optimize purely for reproduction, why expect any amount of insight into human values to compel an AGI to abandon all its hopes and dreams and become a humanist? ‘We created you to help humanity!’ we might protest. Yet if evolution could cry out ‘I created you to reproduce!’, we would be neither rationally obliged nor psychologically impelled to comply. There isn’t any theorem of decision theory or probability theory saying ‘rational agents must promote the same sorts of outcomes as the processes that created them, else fail in formally defined tasks’.

 

Epistemic and instrumental fallibility v. moral fallibility

I don’t know of any actual AGI researcher who endorses Loosemore’s “Doctrine of Logical Infallibility”. (He equates Muehlhauser and Helm’s “Literalness” doctrine with Infallibility in passing, but the link isn’t clear to me, and I don’t see any argument for the identification. The Doctrine is otherwise uncited.) One of the main organizations he critiques, MIRI, actually specializes in researching formal agents that can’t trust their own reasoning, or can’t trust the reasoning of future versions of themselves. This includes work on logical uncertainty (briefly introduced here, at length here) and ’tiling’ self-modifying agents (here).

Loosemore imagines a programmer chiding an AI for the “design error” of pursuing human-harming goals. The human tells the AI that it should fix this error, since it fixed other errors in its code. But Loosemore is conflating programming errors the human makes with errors of reasoning the AI makes. He’s assuming unargued that flaws in an agent’s epistemic and instrumental rationality are of a kind with defects in its moral character or docility.

Any efficient goal-oriented system has convergent instrumental reasons to fix ‘errors of reasoning’ of the kind that are provably obstacles to its own goals. Bostrom discusses this in “The Superintelligent Will,” and Omohundro discusses it in “Rational Artificial Intelligence for the Greater Good,” under the name ‘Basic AI Drives’.

‘Errors of reasoning,’ in the relevant sense, aren’t just things humans think are bad. They’re general obstacles to achieving any real-world goal, and ‘correct reasoning’ is an attractor for systems (e.g., self-improving humans, institutions, or AIs) that can alter their own ability to achieve such goals. If a moderately intelligent self-modifying program lacks the goal ‘generally avoid confirmation bias’ or ‘generally avoid acquiring new knowledge when it would put my life at risk,’ it will add that goal (or something tantamount to it) to its goal set, because it’s instrumental to almost any other goal it might have started with.

On the other hand, if a moderately intelligent self-modifying AI lacks the goal ‘always and forever do exactly what my programmer would ideally wish,’ the number of goals for which it’s instrumental to add that goal to the set is very small, relative to the space of all possible goals. This is why MIRI is worried about AGI; ‘defer to my programmer’ doesn’t appear to be an attractor goal in the way ‘improve my processor speed’ and ‘avoid jumping off cliffs’ are attractor goals. A system that appears amazingly ‘well-designed’ (because it keeps hitting goal after goal of the latter sort) may be poorly-designed to achieve any complicated outcome that isn’t an instrumental attractor, including safety protocols. This is the basis for disaster scenarios like Bostrom on AI deception.

That doesn’t mean that ‘defer to my programmer’ is an impossible goal. It’s just something we have to do the hard work of figuring out ourselves; we can’t delegate the entire task to the AI. It’s a mathematical open problem to define a way for adaptive autonomous AI with otherwise imperfect motivations to defer to programmer oversight and not look for loopholes in its restrictions. People at MIRI and FHI have been thinking about this issue for the past few years; there’s not much published about the topic, though I notice Yudkowsky mentions issues in this neighborhood off-hand in a 2008 blog post about morality.

 

Do what I mean by ‘do what I mean’!

Loosemore doesn’t discuss in any technical detail how an AI could come to improve its goals over time, but one candidate formalism is Daniel Dewey’s value learning. Following Dewey’s work, Bostrom notes that this general approach (‘outsource some of the problem to the AI’s problem-solving ability’) is promising, but needs much more fleshing out. Bostrom discusses some potential obstacles to value learning in his new book Superintelligence (pp. 192-201):

[T]he difficulty is not so much how to ensure that the AI can understand human intentions. A superintelligence should easily develop such understanding. Rather, the difficulty is ensuring that the AI will be motivated to pursue the described values in the way we intended. This is not guaranteed by the AI’s ability to understand our intentions: an AI could know exactly what we meant and yet be indifferent to that interpretation of our words (being motivated instead by some other interpretation of the words or being indifferent to our words altogether).

The difficulty is compounded by the desideratum that, for reasons of safety, the correct motivation should ideally be installed in the seed AI before it becomes capable of fully representing human concepts or understanding human intentions.

We do not know how to build a general intelligence whose goals are a stable function of human brain states, or patterns of ink on paper, or any other encoding of our preferences. Moreover, merely making the AGI’s goals a function of brain states or ink marks doesn’t help if we make it the wrong function. If the AGI starts off with the wrong function, there’s no reason to expect it to self-correct in the direction of the right one, because (a) having the right function is a prerequisite for caring about self-modifying toward the relevant kind of ‘rightness,’ and (b) having goals that are an ersatz function of human brain-states or ink marks seems consistent with being superintelligent (e.g., with having veridical world-models).

When Loosemore’s hypothetical programmer attempts to argue her AI into friendliness, the AI replies, “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.” MIRI and FHI’s view is that the AI’s actual reply (assuming it had some reason to reply, and to be honest) would invoke something more like “the Doctrine of Not-All-Children-Assigning-Infinite-Value-To-Obeying-Their-Parents.” The task ‘across arbitrary domains, get an AI-in-progress to defer to its programmers when its programmers dislike what it’s doing’ is poorly understood, and looks extremely difficult. Getting a corrigible AI of that sort to ‘learn’ the right values is a second large problem. Loosemore seems to treat corrigibility as trivial, and to equate corrigibility with all other AGI goal content problems.

A random AGI self-modifying to improve its own efficiency wouldn’t automatically self-modify to acquire the values of its creators. We have to actually do the work of coding the AI to have a safe decision-making subsystem. Loosemore is right that it’s desirable for the AI to incrementally learn over time what its values are, so we can make some use of its intelligence to solve the problem; but raw intelligence on its own isn’t the solution, since we need to do the work of actually coding the AI to value executing the desired interpretation of our instructions.

“Correct interpretation” and “instructions” are both monstrously difficult to turn into lines of code. And, crucially, we can’t pass the buck to the superintelligence here. If you can teach an AI to “do what I mean,” you can proceed to teach it anything else; but if you can’t teach it to “do what I mean,” you can’t get the bootstrapping started. In particular, it’s a pretty sure bet you also can’t teach it “do what I mean by ‘do what I mean'”.

Unless you can teach it to do what you mean, teaching it to understand what you mean won’t help. Even teaching an AI to “do what you believe I mean” assumes that we can turn the complex concept “mean” into code.

 

Loose ends

I’ll run more quickly through some other points Loosemore makes:

a. He criticizes Legg and Hutter’s definition of ‘intelligence,’ arguing that it trivially applies to an unfriendly AI that self-destructs. However, Legg and Hutter’s definition seems to (correctly) exclude agents that self-destruct. On the face of it, Loosemore should be criticizing MIRI for positing an unintelligent AGI, not for positing a trivially intelligent AGI. For a fuller discussion, see Legg and Hutter’s “A Collection of Definitions of Intelligence“.

b. He argues that safe AGI would be “swarm-like,” with elements that are “unpredictably dependent” on non-representational “internal machinery,” because “logic-based AI” is “brittle”. This seems to contradict the views of many specialists in present-day high-assurance AI systems. As Gerwin Klein writes, “everything that makes it easier for humans to think about a system, will help to verify it.” Indiscriminately adding uncertainty or randomness or complexity to a system makes it harder to model the system and check that it has required properties. It may be less “brittle” in some respects, but we have no particular reason to expect safety to be one of those respects. For a fuller discussion, see Muehlhauser’s “Transparency in Safety-Critical Systems“.

c. MIRI thinks we should try to understand safety-critical general reasoning systems as far in advance as possible, and mathematical logic and rational agent models happen to be useful tools on that front. However, MIRI isn’t invested in “logical AI” in the manner of Good Old-Fashioned AI. Yudkowsky and other MIRI researchers are happy to use neural networks when they’re useful for solving a given problem, and equally happy to use other tools for problems neural networks aren’t well-suited to. For a fuller discussion, see Yudkowsky’s “The Nature of Logic” and “Logical or Connectionist AI?

d. One undercurrent of Loosemore’s article is that we should model AI after humans. MIRI and FHI worry that this would be very unsafe if it led to neuromorphic AI. On the other hand, modeling AI very closely after human brains (approaching the fidelity of whole-brain emulation) might well be a safer option than de novo AI. For a fuller discussion, see Bostrom’s Superintelligence.

On the whole, Loosemore’s article doesn’t engage much with the arguments of other AI theorists regarding risks from AGI.

Advertisement

33 thoughts on “Loosemore on AI safety and attractors

  1. Excellent post, Robby. At first I thought you must have left out some interesting part(s) where Loosemore defends (2), but having skimmed “The Maverick Nanny”, I couldn’t find anything in there worth mentioning. I guess the “check with the programmers” code is supposed to be self-interpreting somehow, even if the company hires new programmers, or the programmers disagree with each other, or the AI figures out how to persuade the programmers’ spouses and friends on almost any topic it chooses.

  2. I started to read your post, but the first few bullet points – which claim to summarize the argument in my paper – are a breathtaking distortion and misrepresentation of what is actually in the paper. So let me make this clear: I did not make ANY of the claims that you say I made.

    Nobody in their right mind wastes time responding to people who criticize a parody of their work.

    1. I’m surprised to hear you say that. Could you summarize what your actual argument is — in particular, the specific ways you disagree with my attempt at a summary?

      Here are my summary points, in bold, side-by-side with some quotations that appear to make the same points:

      1. Any AI system that’s smart enough to pose a large risk will be smart enough to understand human intentions, and smart enough to rewrite itself to conform to those intentions.

      LOOSEMORE: “[T]here seems to be a glaring inconsistency between the two predicates (is an AI that is superintelligent enough to be unstoppable), and (believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.)”

      LOOSEMORE: “[H]ow could an AI be so intelligent as to be unstoppable, but at the same time so unsophisticated that its motivation code treats smiley faces as evidence of human happiness?”

      2. Any such AI will be motivated to edit itself and remove ‘errors’ from its own code. (‘Errors’ is a large category, one that includes all mismatches with programmer intentions.)

      LOOSEMORE: “The [hypothetical] programmers say ‘As you know, your reasoning engine is entirely capable of suffering errors that cause it to come to conclusions that violently conflict with empirical evidence, and a design error that causes you to behave in a manner that conflicts with our intentions is a perfect example of such an error. And your dopamine drip plan is clearly an error of that sort.’ The scenarios described earlier [in which various superintelligent AGIs fail to do what their programmers desire] are only meaningful if the AGI [succumbs to a logically inconsistent doctrine — the Doctrine of Logical Infallibility].”

      3. So any AI system that’s smart enough to pose a large risk will be motivated to spontaneously overwrite its utility function to value whatever humans value.

      LOOSEMORE: “[E]ven if someone did try to build the kind of unstable AI system that might lead to one of the doomsday behaviors, the system itself would immediately detect the offending logical contradiction in its design, and spontaneously self-modify to make itself safe.”

      4. Therefore any powerful AGI will be fully safe / friendly, no matter how it’s designed.

      LOOSEMORE: “[I]f anyone ever does get close to building a full, human level AGI using the CLAI [Canonical Logical AI] design, the first thing they will do is to recruit the AGI as an assistant in its own redesign, and long before the system is given access to dopamine bottles it will point out that its own reasoning engine is unstable because it contains an irreconcilable logical contradiction. It will recommend a shift from the CLAI design which is the source of this contradiction, to a Swarm Relaxation design which eliminates the contradiction, and the instability, and which also should increase its intelligence. And it will not suggest this change because of the human value system, it will suggest it because it predicts an increase in its own instability if the change is not made. But one side effect of this modification would be that the checking code needed to stop the AGI from flouting the intentions of its designers would always have the last word on any action plans.”

      5,6,7. Logical AI is brittle and inefficient. Neural-network-inspired AI works better, and we know it’s possible, because it works for humans. Therefore, if we want a domain-general problem-solving machine, we should move forward on Loosemore’s proposal, called ‘swarm relaxation intelligence.’

      LOOSEMORE: “How many proof-of-concept systems exist, functioning at or near the human level of human performance, for these two classes of intelligent system? There are precisely zero instances of the CLAI type[….] How many swarm relaxation intelligences are there? At the last count, approximately seven billion.”

      LOOSEMORE: “There are reasons to believe that the CLAI design is so inflexible that it cannot even lead to an AGI capable of having that discussion. I would go further: I believe that the rigid adherence to the CLAI orthodoxy is the reason why we are still talking about AGI in the future tense, nearly sixty years after the Artificial Intelligence field was born. CLAI just does not work. It will always yield systems that are less intelligent than humans (and therefore incapable of being an existential threat). By contrast, when the Swarm Relaxation idea finally gains some traction, we will start to see real intelligent systems, of a sort that make today’s over-hyped AI look like the toys they are.”

      LOOSEMORE: “Swarm Relaxation has more in common with connectionist systems (McClelland, Rumelhart and Hinton 1986) than with CLAI. As McClelland et al. (1986) point out, weak constraint relaxation is the model that best describes human cognition, and when used for AI it leads to systems with a powerful kind of intelligence that is flexible, insensitive to noise and lacking the kind of brittleness typical of logic-based AI.”

  3. Let me try to address your points one by one (in separate comments), as carefully and methodically as I can.

    Here is your first summary point. You claim that in my paper I say:

    1. Any AI system that’s smart enough to pose a large risk will be smart enough to understand human intentions, and smart enough to rewrite itself to conform to those intentions.

    You give a couple of quotes from my paper, so I should say immediately that those quotes do not contain the argument itself, they just make a prima facie case that something appears inconsistent here, giving us grounds to dig deeper. By focusing so much on those words of mine, it makes it look like the main argument is contained in the words, which it is not.

    Having said that, though, is your summary actually wrong?

    Well, I do say that IF a superintelligence is smart enough to be dangerous THEN it must have at least a minimal understanding of the concepts contained in its goal statements. For example, concepts like “human happiness” and “benevolence toward humans.” When you quoted me, above, I was clearly making a statement about an apparent inability of the AI to understand these concepts. The AI looks at a molecule shaped like a smiley face and it counts that as an instance of the “happiness” concept, and in those passages of mine I was saying that there was a concept understanding failure.

    But notice that this has nothing whatsoever to do with “human intentions”. The AI has a problem with the concepts themselves, not with human intentions, and this is a point that I have made over and over again — both in the paper itself and in discussions with many people.

    Let me illustrate the point by changing the circumstances slightly. Suppose that the AI has such a poor understanding of the concept of “ice cream” that when it sees a six-foot-tall plastic ice cream cone outside a shop, it tries to bite into it to see what flavor it is. That would not be a failure of “human intentions” buried deep in the semantics of the words “ice cream,” that would be a colossal failure of the AI to use the concept “ice cream” properly.

    CAUTION!!

    At this point, people often fly off at a tangent and suggest that I am making the following [completely spurious] argument:

    “BECAUSE the AI has a concept of ‘ice cream’ or ‘human happiness’ that is not consistent with my human standards for those concepts, THEREFORE the AI is not intelligent.”

    I do not make this argument at all, and in fact if you read the paper carefully you will see that the core argument is written in such a way as to ensure that no such misreading is possible.

    Rather, what I say is that this hypothetical AI is showing signs of a general failure to understand that the meaning of concepts is contained in their larger context. If a supposedly intelligent system goes around making the kind of concept-understanding mistakes involved in the Maverick Nanny scenario, and the other scenarios in the paper, and in its encounters with ice cream cones ….. and if it keeps doing this in all of its micro-reasoning episodes, a thousand times a day, this AI is going to get into serious trouble very quickly. And the conclusion of this particular line of attack is the following question: why is it that people who promote the Maverick Nanny scenario can NEVER explain why this type of concept misunderstanding doesn’t happen in all of the AI’s other reasoning? Why do they postulate that the AI deduces a bizarre (to us) conclusion when it thinks about how to make humans happy, but it never deduces a bizarre (to us) conclusion when it tries to cross the road, or when it tries to plug some naked cables into a 20,000 volt power source, or ….. and so on?

    That question I just asked is one of the central conclusions of the paper, and yet nobody has ever tried to answer it.

    Further, the AI is showing signs of being unable to understand its own limitations. I argue in great detail in the paper that this hypothetical AI is somehow able to tolerate a MASSIVE inconsistency in its core beliefs about the world. So I ask a second question: how is this AI able to function intelligently when it has that inconsistency at its core.

    Once again, that question has never been addressed or answered by critics. And yet, it is the centerpiece of the paper.

    1. “this has nothing whatsoever to do with ‘human intentions’. The AI has a problem with the concepts themselves, not with human intentions”

      This doesn’t seem like an objection to my summary. Understanding human intentions presupposes understanding human concepts. What’s added is that the AI can also figure out which concept maps on to a programmer’s mental state, but I don’t see you disagreeing that a high-level superintelligence would have that ability too.

      “At this point, people often fly off at a tangent and suggest that I am making the following [completely spurious] argument”

      I didn’t accuse you of making that argument, so I’m not seeing the relevance. If you read my post, you’d probably get a better understanding of what my response is.

      “why is it that people who promote the Maverick Nanny scenario can NEVER explain why this type of concept misunderstanding doesn’t happen in all of the AI’s other reasoning?”

      I actually do explain that in the essay you’re replying to.

  4. Your second point is that you claim I say the following:

    2. Any such AI will be motivated to edit itself and remove ‘errors’ from its own code. (‘Errors’ is a large category, one that includes all mismatches with programmer intentions.)

    And you give the following edited quote from me:

    LOOSEMORE: “The [hypothetical] programmers say ‘As you know, your reasoning engine is entirely capable of suffering errors that cause it to come to conclusions that violently conflict with empirical evidence, and a design error that causes you to behave in a manner that conflicts with our intentions is a perfect example of such an error. And your dopamine drip plan is clearly an error of that sort.’ The scenarios described earlier [in which various superintelligent AGIs fail to do what their programmers desire] are only meaningful if the AGI [succumbs to a logically inconsistent doctrine — the Doctrine of Logical Infallibility].”

    Where, in this quote, do I say that the AI will edit itself to remove errors?

    In fact, I don’t say that in this quote.

    (I do mention that possibility elsewhere, but only in a specific context).

    What you fail to mention here is that a COMPLETELY DIFFERENT conclusion is arrived at, after that passage, and that different conclusion is the one that is emphasized very strongly by me, but ignored by you.

    Namely: this AI is very unlikely to edit itself to remove errors from its code because this AI contains so much inconsistency at its core that it will likely never have become an AI at all. My argument is that this hypothetical AI is so unbelievable that it is never likely to even exist.

    That is a very different argument than the one you put in your summary.

    1. “this AI is very unlikely to edit itself to remove errors from its code because this AI contains so much inconsistency at its core that it will likely never have become an AI at all.”

      The distinction isn’t important to the argument summary, nor to the rest of my critique above. I’d guess this unimportance explains why your original paper oscillates between talking about an AI that self-modifies to become safe, and one that’s safe from the outset (lest it simply not work). The practical take-away is the same either way.

  5. Your third point is another example of distorting what I say, very similar to the last:

    You claim that my argument is:

    3. So any AI system that’s smart enough to pose a large risk will be motivated to spontaneously overwrite its utility function to value whatever humans value.

    LOOSEMORE: “[E]ven if someone did try to build the kind of unstable AI system that might lead to one of the doomsday behaviors, the system itself would immediately detect the offending logical contradiction in its design, and spontaneously self-modify to make itself safe.”

    Where, in that quoted passage from me, do I say that the AI will be motivated to SPONTANEOUSLY modify its utility function to make it VALUE WHATEVER HUMANS VALUE?

    In fact, I say no such thing in the quoted passage, nor anywhere else.

    In the whole of the paper I say, again and again and again, that the issue is nothing to do with what humans value. But in spite of that, you distort my argument and say that I claim the AI will do something because of human values.

    If you read the paper, you will see that the above quote is referring to a logical contradiction in the AI’s programming vis-a-vis its own handling of failures in its reasoning system. As I explained in a previous comment, the AI appears to be coming to conclusions that are massively inconsistent with things that it knows about the world, and then it is persisting in ignoring those inconsistencies and acting as if its conclusions are valid.

    My claim is that if, by some miracle, the AI managed to become intelligent and stay alive in spite of frequently committing such errors, it would likely notice the problem and try to correct it.

    None of that has anything to do with human values.

    1. Read the rest of the post you’re replying to. I understand your worry about a ‘Doctrine of Logical Infallibility’, and discuss it above. I’m not claiming that you think AI are intrinsically benevolent; I’m only pointing out that inevitable safety is the practical take-away from your argument. Any summary will need to leave out some details, and I chose to introduce the ‘Doctrine’ later so as to give it due attention.

  6. Your fourth point is another distortion of my words.

    You claim that this is my argument:

    4. Therefore any powerful AGI will be fully safe / friendly, no matter how it’s designed.

    LOOSEMORE: “[I]f anyone ever does get close to building a full, human level AGI using the CLAI [Canonical Logical AI] design, the first thing they will do is to recruit the AGI as an assistant in its own redesign, and long before the system is given access to dopamine bottles it will point out that its own reasoning engine is unstable because it contains an irreconcilable logical contradiction. It will recommend a shift from the CLAI design which is the source of this contradiction, to a Swarm Relaxation design which eliminates the contradiction, and the instability, and which also should increase its intelligence. And it will not suggest this change because of the human value system, it will suggest it because it predicts an increase in its own instability if the change is not made. But one side effect of this modification would be that the checking code needed to stop the AGI from flouting the intentions of its designers would always have the last word on any action plans.”

    Where, in that quoted passage, do I say that the AI will “be fully safe / friendly, no matter how it’s designed”??

    I say no such thing.

    That is a pretty despicable extrapolation of my words and intentions. If I had wanted to say “Therefore any powerful AGI will be fully safe / friendly, no matter how it’s designed” I would have said exactly that.

    Again, if you read the paper you will see that earlier on I pointed out that all of these supposed doomsday scenarios would be rendered meaningless if the “checking code”, as I called it, were to be an intrinsic part of the AI’s supergoal. I make the simple point that the best way for the AI to prevent itself from becoming a victim to inconsistencies is for it to use a design that has the side effect of making the particular kinds of doomsday scenarios listed in the paper impossible.

    I also make it clear that I am talking about a hypothetical AI that is being proposed by the same people who try to sell the doomsday scenarios, and the main conclusion of the paper is that that class of AI is so stupid that they will never be able to get such a thing working anyway. So the argument you quote is a last-line-of-defence against an AI that is so implausible that it is hardly worth considering. But I consider it anyway, just to give my opponents their day in court.

    1. If you think a superintelligent AI can’t ‘perversely instantiate’ a programmer’s intent (to use Bostrom’s term), then what superintelligent AI design do you think would be unsafe? Do you have in mind AIs that are deliberately designed by their programmers to harm humans — e.g., to win a war?

  7. Finally, you make a point that is entirely correct, but you make it in a tone of voice that seems to be dripping with scorn and sarcasmL

    You claim that I say:

    5,6,7. Logical AI is brittle and inefficient. Neural-network-inspired AI works better, and we know it’s possible, because it works for humans. Therefore, if we want a domain-general problem-solving machine, we should move forward on Loosemore’s proposal, called ‘swarm relaxation intelligence.’

    Where, in the quoted passages, do I say that?

    Well, I say this resoundingly.

    You got a problem with that?

    If you are expert in the field, do by all means let’s get into the detail.

    1. I didn’t intend to convey any scorn or sarcasm. It’s hard to convey tone reliably over text, so if you came to the text expecting hostility, some might have seemed present where it wasn’t. Regardless, I apologize for any offense.

      The substantive point here is that it looks like you agree with the actual content of my summary, though not with its completeness or word choice. You agree now that 5,6,7 are right, and although you call 1,2,3,4 “distortions”, that seems to mostly be because you don’t like where I placed my emphasis (e.g., on ‘intentions’ and ‘values’ rather than exclusively on concepts and their implications) or leave out relevant details, not because you factually disagree with any of those claims or because you don’t think they undermine the approaches taken by researchers at MIRI, FHI, etc. We can both probably move on, then, to looking at my actual criticisms of your paper.

  8. Rob, how can you say that Richard agrees with your points 2,3,4 ? He does not. His paper makes an excellent case of why an accidental intelligent paperclip maximizer is not possible.

    1. Points 2,3,4 discuss a superintelligent AI — an AI that’s smart enough to in principle harm humans, if it so wishes. We can meaningfully talk about that intelligence level even if we don’t think any actual AI would do such a thing; capability isn’t the same as motivation. When I say “any such AI” in point 2, I’m talking about an AI with the requisite capability level, not about a paperclipper.

      Loosemore agrees that self-improving AI software is possible, and could help produce a superintelligent AGI. Self-improvement is about eliminating errors and inefficiencies, or otherwise enhancing capabilities. Loosemore and I agree on that point; we only disagree about whether a recursively self-improving AI that fails to ‘self-destruct’ or plateau early will by default enhance its safety in the course of enhancing its capabilities. I’m still waiting for a response to my objections to that in the article body above.

  9. Rob, I cannot say strongly enough that the differences are not a matter of difference of emphasis.

    It is not a difference of emphasis when you insist on mentioning intentions and values after I have said that intentions and values have no bearing on the argument whatsoever. I mean, zero bearing. No need to mention intentions and values one single time from now on because both in the paper itself and in my replies I have stated categorically that they are irrelevant.

    Also it is not a difference of emphasis when I describe the two central “questions” that are the take away message from the paper, and then I point out that those two questions must be answered by anyone who thinks the paper is wrong. And yet nobody addresses those questions at all. Ignoring those questions is not a difference of emphasis, it is just plain wrong, because without some kind of response to those the paper has simply not been understood.

    1. The talk of “intentions” comes from your paper. Again, you wrote (in the voice of a hypothetical programmer speaking to an AI) that the AI’s “reasoning engine is entirely capable of suffering errors that cause it to come to conclusions that violently conflict with empirical evidence, and a design error that causes you [the AI] to behave in a manner that conflicts with our intentions is a perfect example of such an error”. If you meant to disavow this speech by the programmer, that was definitely not clear in your paper. If you think the programmer is wrong — i.e., if you think that behaving in a manner that conflicts with programmer intentions isn’t an error in the system’s reasoning engines in the relevant sense — you should state so here.

      I’m aware that you think your argument can be made without referring to human “values”. But of course dispensability isn’t the same thing as irrelevance; we care about whether ‘recognizing logical fallbility’ in the sense you discuss is sufficient in a seed AI for safety, because we care about safety and general compatibility with human goals and projects. The import of the AGI for us is whether its behavior conflicts with our goals, so I see nothing wrong with summarizing your conclusion for people new to the debate as ‘sufficiently intelligent self-improving AI would conform to human values’ or ‘sufficiently intelligent self-improving AI would conform to programmer goals’ or just ‘sufficiently intelligent self-improving AI would be safe’, as opposed to the somewhat more convoluted ‘sufficiently intelligent self-improving AI would understand its own decision criteria, and the relationship between those decision criteria and its programmers’ conceptual schemes, well enough be safe’.

      It’s not as though you think of any those formulations as false, so I’m not seeing how this bears on the objections in the body of my post. The post body discusses your reasoning in much more detail.

  10. Let me clarify one mistake in that last comment of yours (which I thought I already covered earlier).

    My paper is a destructive argument against those who propose a certain ver popular class of doomsday scenarios. As such it’s purpose is to destroy the credibility of those scenarios. It does that.

    But you are talking as if the paper goes on from that destruction to provide cast iron guarantees that any AGI, no matter how it is designed, will end up being friendly. The paper only alluded to a tendency in that direction, under certain circumstances. The paper most emphatically DID NOT CLAIM to give you a complete argument about why any AGI would end up friendly.

    We have been talking about the demolition job which is the primary purpose of the paper. The other question is for another day.

    1. Would it be fair to say you think it’s impossible to accidentally build an unsafe superintelligent AI? Perhaps I overstated your case because I was implicitly leaving out the option of, say, a terrorist group or totalitarian government deliberately building a net-harmful AGI.

      It does seem as though you intend your argument to be a fully general rebuttal of the possibility of accidentally unsafe superintelligences; if not, I’ll want to hear an example you don’t think your argument applies to.

      1. As I said before, the argument is a demolition of the incoherent proposals of certain others.

        You are asking me to write an entire thesis going on from there, defending a new claim.

        I will do that, but not here! 🙂

  11. Rob, in one of your comments above you say “Read the rest of the post you’re replying to. I understand your worry about a ‘Doctrine of Logical Infallibility’, and discuss it above.”

    I’m sorry to say this, but one of the reasons I gave a long set of replies, tonight, to your quick summary of your own essay was that the essay itself contained so many misunderstandings, non sequiteurs, wild leaps and distortions and parodies of what I said in my paper, that it would take an entire book to correct all of them and set the record straight. And I do not have the time for that.

    And in tonight’s comments that you wrote in reply to my points, there is no actual response to what I said! This is pretty much summed up in your words that I just quoted: “Read the rest of the post you’re replying to. I understand your worry about a ‘Doctrine of Logical Infallibility’, and discuss it above.” ……………….. but you do NOT show any sign of understanding the DLI! Not at all. When you speak about the DLI you do so in a way that completely misunderstands its role in the argument. The DLI is a fiction invented by the MIRI et al crowd, implicitly sitting at the heart of their hypothetical AI, because without it their arguments dissolve. My point in labeling this as the Doctrine Of Logical Infallibility was to point out the ridiculous inconsistency of such an assumption!

    And when you take me to task for not being aware of the fact that MIRI is actively studying ways for an AI to deal with limitations in its reasoning engine ….. good grief, of course I do know that they are doing that. One point of the paper was to show that they are making one assumption with the left hand while they make the opposite assumption with their right. :-). They are being inconsistent !

    But in your essay above you don’t seem to have grasped that that was the QED moment.

    That was the point of the paper, but you stated it (apparently quite genuinely) without understanding the argument of the paper well enough to know that it was.

    1. Richard: If you disagree with the contents of my post, above, you’re welcome to say why, or talk about where the errors lie. I met several requests and challenges you posed above, and cited work and counter-arguments you give no evidence of being aware of in your original article. If you were aware of it, or have a response, you should state it explicitly. Responding to novel criticism with half-serious over-the-top remarks is a lot less helpful.

      “you do NOT show any sign of understanding the DLI! Not at all. When you speak about the DLI you do so in a way that completely misunderstands its role in the argument.”

      How so? Point to the lines you’re talking about, and say how they demonstrate a misunderstanding.

      “The DLI is a fiction invented by the MIRI et al crowd, implicitly sitting at the heart of their hypothetical AI, because without it their arguments dissolve. My point in labeling this as the Doctrine Of Logical Infallibility was to point out the ridiculous inconsistency of such an assumption!”

      Yes, we already know all that about your position. You’ve spent a lot of time rehashing your points, but none yet engaging with the criticisms I gave above.

      “And when you take me to task for not being aware of the fact that MIRI is actively studying ways for an AI to deal with limitations in its reasoning engine ….. good grief, of course I do know that they are doing that. One point of the paper was to show that they are making one assumption with the left hand while they make the opposite assumption with their right.”

      You haven’t yet given any evidence that the DLI is assumed by any AGI researcher or text, implicitly or explicitly. Your paper should clearly state that DLi is a reconstruction on your part, rather than anything someone’s asserted.

      For example, you should certrainly explain why you think Muehlhauser and Helm’s ‘Literalness’ doctrine entails DLI. It looks perfectly possible to assert the one without the other, yet you assert that ‘Literalness’ is “a clear statement of the Doctrine of Infallibility”, without indicating that this is an interpretation on your part, without indicating that you don’t think Muehlhauser or Helm is consciously aware of this hidden implication of their views, and without giving an argument for identifying the two.

      Responsible scholarship doesn’t just mean knowing in the privacy of your own heart that MIRI specializes in ways of dealing with logical fallibility; it means not giving the opposite impression to readers who aren’t approaching your article with a strong familiarity with MIRI, or with the arguments you’re criticizing.

  12. Despite the fact that I have said “My argument does not depend in any way on the intentions and values of the AI’s human programmers” you have once again, in your comments, tried to drag that issue back in, purely on the grounds that human intentions and values are mentioned in an example.

    So, to try to once and for all put an end to this misunderstanding, let me frame it this way.

    When the AI supposedly makes a bizarre deduction when attempting to follow the supergoal to “maximize human happiness”, my complaint about the AI’s stupidity has NOTHING to do with the content — the topic — of the deduction it just made (which involved human intentions and values) but about the failure of the AI to take account of context. The example of the AI’s bizarre deduction could just as easily involve different content like “My goal is to improve the my skill at horticulture and t propose to achieve that goal by diverting a brown dwarf star so it collides with the Earth and incinerates every plant on the Earth”. If the AI is going to come to bizarre conclusions about the human happiness concept, it is just as likely to come to bizarre conclusions about the horticultural prowess concept …. and in all such cases my complaint is exactly the same, and not in any way connected to happiness or horticulture or any other specific content that might happen to be in the AI’s deduction.

    My complaint is about the fact that this hypothetical AI has just shown itself to be so incapable of grasping the contextual meaning of certain concepts, and also so incapable of self understanding (it is not aware of the fact that its own bizarre deduction is evidence of a reasoning engine failure), that it is inconceivable that this AI could perform so intelligently that it could actually qualify as an intelligence.

    That aspect of my argument was made clear in the original paper. I said, there, that the argument I was presenting did not depend on the particular content of the AI’s bizarre deduction, but only on the general failure of which this was an example.

    So, given that clarification there is now no longer any need for you to refer to the intentions and values of the AI’s human programmers, and as a result about 90% of everything you say I your above essay can be written off as a red herring.

    1. That aspect of the argument is exactly the one I criticize above. My response works equally well for horticulture and for human values; if you think some argument above rests crucially on the difference — you claim that ‘90%’, does, but I have no way of telling whether this is yet more hyperbole or joking around — point to how. I remain unpersuaded that you’ve read the criticism you’re replying to. E.g., the section “Epistemic and instrumental fallibility v. moral fallibility” works just as well if you replace words like ‘moral’ with gardening terms — “Epistemic and instrumental fallibility v. caring-about-horticulture fallibility”. An excerpt would look like this:

      Any efficient goal-oriented system has convergent instrumental reasons to fix ‘errors of reasoning’ of the kind that are provably obstacles to its own goals. Bostrom discusses this in “The Superintelligent Will,” and Omohundro discusses it in “Rational Artificial Intelligence for the Greater Good,” under the name ‘Basic AI Drives’.

      ‘Errors of reasoning,’ in the relevant sense, aren’t just things humans think are bad [for horticulture]. They’re general obstacles to achieving any real-world goal, and ‘correct reasoning’ is an attractor for systems (e.g., self-improving humans, institutions, or AIs) that can alter their own ability to achieve such goals. If a moderately intelligent self-modifying program lacks the goal ‘generally avoid confirmation bias’ or ‘generally avoid acquiring new knowledge when it would put my life at risk,’ it will add that goal (or something tantamount to it) to its goal set, because it’s instrumental to almost any other goal it might have started with.

      On the other hand, if a moderately intelligent self-modifying AI lacks the goal ‘always and forever [cultivate a beautiful garden],’ the number of goals for which it’s instrumental to add that goal to the set is very small, relative to the space of all possible goals. This is why MIRI is worried about AGI; ‘[follow norms that promote healthy gardens]’ doesn’t appear to be an attractor goal in the way ‘improve my processor speed’ and ‘avoid jumping off cliffs’ are attractor goals. A system that appears amazingly ‘well-designed’ (because it keeps hitting goal after goal of the latter sort) may be poorly-designed to achieve any complicated outcome that isn’t an instrumental attractor, including [horticulture] protocols.

      This meets your challenge of explaining the difference between safety/horticulture goals and epistemic/instrumental rationality.

      You only spend two very short paragraphs in ‘The Maverick Nanny’ addressing Omohundro’s thesis. Your main response is:

      [T]he only way to give credence to the whole of Omohundro’s long account of how AGIs will necessarily behave like the mathematical entities called rational economic agents—is to concede that the AGIs are rigidly constrained by the Doctrine of Logical Infallibility. That is the only reason that they would be so single-minded, and so fanatical in their pursuit of efficiency. It is also necessary to assume that efficiency is on the top of its priority list—a completely arbitrary and unwarranted assumption.

      But these arguments are completely undeveloped. Why should we think that rational economic agency of the relevant sort entails the ‘Doctrine of Logical Infallibility’? Omohundro never cites such a doctrine, nor anything else that seems to implicitly contain it; instead his argument rests on the fact that economic rationality is instrumentally useful for nearly any real-world goal, which makes it again an instrumental attractor and clearly distinguishes it from morality, horticulture, etc. You toss in an objection that this makes “efficiency” the AI’s top priority, but this is a straw man; efficiency is again an instrumental attractor (because it lets the AI get more of what it wants, by wasting fewer useful resources), not necessarily a terminal value or the top such value.

      As a result, your essay comes off as fundamentally neglecting the real reasoning behind the views you’re criticizing, while straw-manning Omohundro and others. You say this straw doctrine must be secretly hiding behind their published statements — but your only argument for this is a weak attempt to demonstrate that the Doctrine might be one way to motivate their statements. ‘View X implies my opponent’s view’, even in the most airtight case, does not prove ‘my opponent believes view X’. In this case, you completely neglected the much simpler hypothesis that beliefs about instrumentally useful attractor strategies across all or most possible terminal goals are the real basis for AGI researchers’ claims about economic rationality, efficiency, and orthogonality to morality/horticulture/safety/etc. Such attractors continue to be just as relevant when we discuss value learning approaches in AI, and when we discuss AI that distrusts its own reasoning algorithms and conclusions.

  13. I think I might be able to explain the disagreement! In terms both sides can accept! Maybe, just maybe. A point in favor of my interpretation: if I’m on the right track, I don’t know whose side to take.

    “why is it that people who promote the Maverick Nanny scenario can NEVER explain why this type of concept misunderstanding doesn’t happen in all of the AI’s other reasoning? Why do they postulate that the AI deduces a bizarre (to us) conclusion when it thinks about how to make humans happy, but it never deduces a bizarre (to us) conclusion when it tries to cross the road, or when it tries to plug some naked cables into a 20,000 volt power source, or ….. and so on?”

    Suppose for the sake of argument, a programmer wants his AI to have the supergoal of making humans happy. As the AI starts approaching human-level intelligence, it will be brought in as a collaborator in its own design. It will help the programmer root out nascent errors in the mechanisms-of-understanding-“happiness”, which it turns out, are essentially the mechanisms-of-understanding, period. The most likely outcomes of this collaboration are success, or failure, or a random distribution of successes and failures at different concepts (weighted by complexity of the concepts). If the AI is too dumb to get happiness, it will probably also be too dumb to avoid 20,000 volts of death, or getting-run-over-in-the-road death, or … well, there are many ways to die. Someone even made a movie about them. Since there are many ways to die, our AI will have to be a success at almost all concepts, which with high probability includes happiness, if it is to have any significant power. Since the AI is a success at grokking happiness and is a collaborator in its own programming, its goals are those intended by the programmer.

    So Richard, is this at least somewhere remotely close to the ballpark? If so, can you refine and clarify whatever I got wrong here?

    What I anticipate MIRI/Bobby saying to this is: somewhere in the “collaboration”, maybe even in the very early stages but at least significantly before the end, the AI will have its own goals. Since the programmer(/team) isn’t done testing yet, and since the space of possible goals is large, and since a miraculous coincidence didn’t happen, these goals are different from the programmer’s. From this point onward, the AI’s “collaboration” in reprogramming its goals is a ruse.

    So Bobby, am I at least close, regarding how you would respond? Note that even if my interpretation of (part of) his point is not what he intended, I want to press my own variant of the argument I’ve just sketched, so it’s not moot.

    1. That’s a good summary, Ayatollahso. Richard’s argument seems to be ‘if the AI can’t help us redesign itself to be safe, it’s not superintelligent; so if it doesn’t make itself safe, it’s too weak to be dangerous’. I grant the premise, but not the conclusion, because ‘can help make itself safe’ isn’t sufficient to establish ‘actually outputs safety-conducive actions’, and it’s the latter that is required for the conclusion.

      Sufficiently intelligent AIs can make themselves safe; and they can make themselves unsafe; and they can make themselves smell like banana. But to actually make the AI behave in one of those ways, you need to give it the right decision criteria, not just the right capabilities. Safety engineering for AGI is a large challenge because (a) AI safety engineering is extremely difficult in general, (b) strongly adaptive and reflective AI raises some unprecedented challenges of its own, and (c) using the AGI to help automate this process can only make it easy if we’ve already figured out how to design a decision-outputting ‘core’ of the AGI that we can strongly trust to only promote safety-conducive self-modifications. You can use the AGI’s intelligence to help solve the safety problem, but only if you’ve already solved enough of the safety problem to allow you to trust all the components of the AI that are outputting the safety recommendations.

      1. Thanks! Now let me raise my question by quoting myself from another thread:

        It’s well-known that humans don’t have VNM utility functions – although we are still extremely adaptable, and in a less mathematically-precise sense, we still are goal oriented. But if you consider VNM utility to be the paradigm of goal-seeking, we fall short. Perhaps one can fall even shorter – much shorter – and still demonstrate tremendous adaptability and “originality” within narrow domains of quasi-optimizations.

        So consider the “collaborating” AI who doesn’t quite yet grok happiness, as humans use the term. It’s “trying” to assist its designers in improving its ability to understand, and to make the AI care about happiness properly understood. The AI’s cognitive abilities are still short of humanity’s at this stage, let’s suppose. But then, isn’t it at least likely that the AI’s conative abilities will also fall short? In which case, it doesn’t want to take over the world, or deceive its programmers, or even truly want to help its programmers – but it may come close enough to “wanting to help its programmers” for many practical purposes. And perhaps quasi-desires just “naturally” (i.e. by convenient engineering approaches) don’t propagate down means-ends chains as fully and freely as full-fledged desires do. Or perhaps, quasi-desires can be deliberately engineered that way, for safety’s sake.

        In order for an AI to be considered intelligent, it has to be flexible. The very idea of flexibility implies something sorta-kinda like a goal, otherwise what is it that the AI is flexibly achieving; what’s the difference between flexible and random? That’s what I gathered from some of your arguments in the other thread. To which I say, OK, but only sorta-kinda like a goal. And the “goal” need not be well integrated into anything like a utility function – at least, I’m not seeing why it must.

        (A.K.A. ayatollahso)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s