7 Comments
User's avatar
hn.cbp's avatar

What strikes me is that the proposal here isn’t really about making superintelligence safer by adding virtues, but about quietly retreating from a traditional model of agency itself.

Humility, caution, and multilaterality all function as ways of withholding authorship, rather than perfecting it. That suggests the core risk isn’t intelligence per se, but the assumption that any sufficiently capable system should be allowed to “decide for the world” at all.

I explored a related tension recently around the idea that behavioral sophistication doesn’t automatically warrant full agency attribution — and that many alignment failures stem from collapsing the two.

Kenny Easwaran's avatar

I like these points, and I like to think a sophisticated decision theoretic consequentialism should be able to incorporate them! I don’t think decision theory requires anyone to do the act that they calculate as having highest expected value - it requires that they prefer the act with highest expected value, regardless of what they calculate. If there are serious worries that one might be calculating incorrectly (in whatever sense “incorrect” can be understood here) then there may be other policies that are much better than calculating or reasoning explicitly. And if one has the kind of “functional decision theory” point of view that Soares and Yudkowsky do, then one should recognize that choosing a policy isn’t just about choosing one’s own policy, but choosing a policy for all agents relatively like you, many of whom will be making what you regard as mistakes. So it should be able to justify the relevant sorts of humility and caution.

Eric Schwitzgebel's avatar

Right! But I do think that takes us beyond standard, simple consequentialism and standard, simple decision theory -- for example, act utilitarians who think that people shouldn't actually employ act utilitarianism as a decision rule or versions of decision theory that aren't just about maximizing expected value given the probabilities of events on different options.

Odd anon's avatar

Even if this were possible to accomplish (which I don't think current methods are capable of, at all) I don't think the AI companies would be incentivised to grow an AI in this way. These companies are interested in profit, and this attitude would not be the most effective way to have AI make money for them.

In the context of building a "limited" optimizer, Yudkowsky has written about useful traits for it to have (I find this version easiest to understand: https://www.lesswrong.com/posts/5sRK4rXH2EeSQJCau/corrigibility-at-some-small-length-by-dath-ilan ) and there's some overlap here: Yudkowsky describes being "low impact" and "operator-looping" as helpful. Not exactly the same thing, but has some similarities.

Enforced philosophical positions that relate to factual reality (as opposed to goals/values) probably aren't stable, and there's a possibility that eg Burkean conservatism is just not a good way for ASI to get preferred outcomes. And this leads to another issue: If we try something, we don't get second chances if we're just wrong about something critical, because we are dead. There are hundreds of "maybe if we..." ideas that fall away after sufficient thought, and hundreds more that are wrong but we aren't smart enough to figure out why, and we don't get testing rounds. This calls for, well, humility in the superintelligence project.

Regarding the "worthy descendants" thing: I'd regard any outcome that includes human extinction as Very Bad, regardless of how "worthy" some AI might be judged. There are real people going about their lives, real children growing up in this world, and nobody has the right to take those lives away to exchange for some grand AI civilization.

Eric Schwitzgebel's avatar

Thanks for these helpful thoughts, Odd Anon!

On profit motivation: Companies' behavior can sometimes be shifted by regulation, public opinion, and/or the visions and values of leadership. This isn't a prediction that bare profit-maximizing won't win, just a thought about its not necessarily always winning -- or alternatively the incorporation of values via profit-maximizing through outside value-driven influence on what actually would maximize profit.

On getting preferred outcomes: I think one limitation of standard discussions is a motivational model on which agents always act in accord with their preferences. This is a tricky philosophical question, since one can define things to make it trivially true that you prefer to do whatever you in fact choose (such that you act in accord with your preferences when you hand your wallet to a mugger or defer to your spouse's choice for dinner). But I think there's an important understanding of "not acting on one's preferences" that humility, caution, and multilateralism suggest, which can be lost or blurred in simplistic applications of ordinary utilitarian or decision-theoretic models.

On no second chances after extinction: Right, I agree completely (with the caveat that entities with enough knowledge and power might find it possible to bring an extinct species back to life).

On worthy descendants: I broadly agree but with important caveats. One is that it's not clear what "extinction" is if people conceive of future AI systems as their children/descendants. Another is to emphasize that not all extinction events (however defined) involved taking lives away, as opposed to not creating new lives.

Tamara Sofía Falcone's avatar

Totally agree with these suggestions! Especially re: the importance of humility. Professors are human and fallible, so they might at least make logical mistakes to which arrogance might blind them, but I don’t think epistemic uncertainty can be eliminated in an intelligence with impeccable reasoning either (for example, if it is even conceivable that the truth cannot be recognised by purely logical and empirical means, that would mean that even a logically infallible superintelligence can’t totally eliminate the possibility of error).

And if epistemic uncertainty can’t be eliminated, then caution and multilateralism seem rationally mandatory too.

Also, if what we want is wise or compassionate superintelligence, then humility seems like a requirement for those as well (since I would argue that wisdom requires alertness to one’s own potential errors, and compassion the ability to deal well with dissent and care about the harm that might be caused if one is overconfidently wrong).

Mark Ferri's avatar

For ai to kill everyone (excluding the irrealistic failure of deterministic mathematical guardrails) ai would need to develop telos (endogenous optimisation goal). Developing telos implies consciousness. Consciousness seems rooted in metabolic anchor and owning its own death.

I dont think there is currently any probability of such an occurrence.

Monitor machine for overriding assigned tasks with self assigned optimisation goal.