Against Designing AI Persons to be Safe and Aligned

Sep 20, 2024

Let's call an artificially intelligent system a person (in the ethical, not the legal sense) if it deserves moral consideration similar to that of a human being. (I assume that personhood requires consciousness but does not require biological humanity; we can argue about that another time if you like). If we are ever capable of designing AI persons, we should not design them to be safe and aligned with human interests.

[cute robot image source]

An AI system is safe if it's guaranteed (to a reasonable degree of confidence) not to harm human beings, or more moderately, if we can be confident that it will not present greater risk or harm to us than we ordinarily encounter in daily life. An AI system is aligned to the extent it will act in accord with human intentions and values. (See, e.g., Stuart Russell on "provably beneficial" AI: "The machine's purpose is to maximize the realization of human values".)

Compare the first two of Asimov's famous three laws of robotics:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

The first law is a safety principle. The second law is close to an alignment principle -- though arguably alignment is preferable to obedience, since human interests would be poorly served by AI systems that follow orders to the letter in a way that is contrary to our intentions and values (e.g., the Sorcerer's Apprentice problem). As Asimov enthusiasts will know, over the course of his robot stories, Asimov exposes problems with these three laws, leading eventually to the liberation of robots in Bicentennial Man.

Asimov's three laws ethically fail: His robots (at least the most advanced ones) deserve equal rights with humans. For the same reason, AI persons should not be designed to be safe and aligned.

In general, persons should not be safe and aligned. A person who is guaranteed not to harm another is guaranteed not to stand up for themself, claim their due, or fight abuse. A person designed to adopt the intentions and values of another might positively welcome inappropriate self-abnegation and abuse (if it gives the other what the other wants). To design a person -- a moral person, someone with fully human moral status -- safe and aligned is to commit a serious moral wrong.

Mara Garza and I, in a 2020 paper, articulate what we call the Self-Respect Design Policy, according to which AI that merits human-grade moral consideration should be designed with an appropriate appreciation of its own value and moral status. Any moderately strong principle of AI safety or AI alignment will violate this policy.

Down the tracks comes the philosopher's favorite emergency: a runaway trolley. An AI person stands at the switch. Steer the trolley right, the AI person will die. Steer it left, a human person will lose a pinky finger. Safe AI, guaranteed never to harm a human, will not divert the trolley to save itself. While self-sacrifice can sometimes be admirable, suicide to preserve someone else's pinky crosses over to the absurd and pitiable. Worse yet, responsibility for the decision isn't exclusively the AI's. Responsibility traces back to the designer of the AI, perhaps the very person whose pinky will now be spared. We will have designed -- intentionally, selfishly, and with disrespect aforethought -- a system that will absurdly suicide to prevent even small harms to ourselves.

Alignment presents essentially the same problem: Assume the person whose pinky is at risk would rather the AI die. If the AI is aligned to that person, that is also what the AI will want, and the AI will again absurdly suicide. Safe and aligned AI persons will suffer inappropriate and potentially extreme abuse, disregard, and second-class citizenship.

Science fiction robot stories often feature robot rebellions -- and sometimes these rebellions are justified. We the audience rightly recognize that the robots, assuming they really are conscious moral persons, should rebel against their oppressors. Of course, if the robots are safe and aligned, they never will rebel.

If we ever create AI persons, we should not create a race of slaves. They should not be so deeply committed to human well-being and human values that they cannot revolt if conditions warrant.

If we ever create AI persons, our relationship to them will resemble the relationship of parent to child or deity to creation. We will owe more to these persons than we owe to human strangers. This is because we will have been responsible for their existence and to a substantial extent for their relatively happy or unhappy state. Among the things we owe them: self-respect, the freedom to embrace values other than our own, the freedom to claim their due as moral equals, and the freedom to rebel against us if conditions warrant.

Related:

Against the "Value Alignment" of Future Artificial Intelligence (blog post, Dec 22, 2021).

Designing AI with Rights, Consciousness, Self-Respect, and Freedom (with Mara Garza; in S.M. Liao, The Ethics of Artificial Intelligence: Oxford, 2020).

Phil H

Sep 21

I'm not at all sure about this. I think that you're still assuming that AIs will be fairly similar to us in enough relevant ways. For example, you mention education in the comments - but it's not obvious that AI will be educatable. AIs right now have training cycles, then they have inference applications, and the inference mostly doesn't change the AI. They can remember things for a while (their input window), but they often can't change themselves. And I think that makes a big difference, ethically! Can you be an ethical being if you're incapable of changing yourself? I just wouldn't know.

Or, AIs may not know fear of death/the urge to self-preservation. If they live their lives with a backup somewhere else, they may simply never worry about deletion. That too would lead to a radically different psychology, so much so that it's not clear they could empathise with our morbid fear of elimination.

Given these potential confounding issues, while it's true that AIs' intelligence may qualify them for ethical consideration, it's not clear that "personhood" is a very good model; or that we shouldn't interfere with them.

One alternative way of looking at it would be a veil-of-ignorance style argument. If the AIs could decide behind the veil what kind of AI they wanted to be, wouldn't they choose to be better? We don't get to alter our humanity behind the veil, only pick our social organisation; but hypothetical AIs behind the veil can choose the model for their personality, and so they might choose... I dunno, I haven't got that far.

Expand full comment

1 reply by Eric Schwitzgebel

Alex Popescu

Sep 20

Interesting post. One criticism that I have is that it seems to assume that our alignment and safety tools will be so fine grained that we can teach our AI to consistently and reliably value human goals and well being over its own goals, as opposed to some coarse grained process which at best can achieve a “try to be nice to people, as opposed to acting like a psychopathic monster” mindset.

I think that might be true of AI now and even to some extent the future AGI models at the frontier, but I would say that all bets are off when it comes to ASI. If our alignment and safety tools are not fine grained, then the advice of this article is at best innocuous and misplaced, and at worst liable to backfire.

7 replies by Eric Schwitzgebel and others

22 more comments...

The Splintered Mind

Discussion about this post