Infinite Possibility Space and the Shutdown Problem
This post is a response to the recent Astral Codex Ten post, “CHAI, Assistance Games, And Fully-Updated Deference”.
A brief summary of the context, for any readers who are not subscribed to ACX or familiar with the shutdown problem:
The Center for Human-compatible Artificial Intelligence (CHAI) is a research group at UC Berkeley. Their researchers have published on the shutdown problem, showing that “propose an action to humans and wait for approval, allowing shutdown” strictly dominates “take that action unilaterally” as well as “shut self down unilaterally.”
MIRI discusses a counterexample, using a toy example where the AI has a finite number of policy options available, and express that “learn which of those finite set of options is best according to humans, then execute without allowing humans to shut it down” can dominate the course of “propose action to humans and wait for approval.”
I claim that the fact that the AI is “larger” than its value-space seems to me to be a critical ingredient in the AI being able to conclude that it has reached it’s terminal point in value-space. I posit that given a value-space that is “larger” than the AI, the AI will accept shutdown. Here I present an argument that, for at least one AI architecture and structure of value-space, the “propose action and allow shutdown” option should dominate much of the time.
Assume that a current AI model A contains a deep neural net connected to some decision procedure, of a specified, finite size (e.g. 16 layers with 1024 nodes each).
Then assume we can specify that human values are best specified by some ideal AI Z with the same structure [1], with unknown size (e.g. unknown number of layers and nodes in each layer.)
Further assume that we can specify that A’s action-space is to set the weights of its existing nodes, then propose actions to humans who can allow the actions or veto them by shutting down A.
This search space for values is infinite-dimensional. In particular, the “correct” value can always be an architecture which has additional layers or additional nodes in each layer, beyond what A is capable of expressing.
This is a well-formed prior, and also presents a well-formed alternative interpretation within A’s conception (assuming this is compatible with the limits of the AI’s action-space, though I think this is a reasonable assumption given sufficiently large architecture.)
Per MIRI, when a human decides to shut down the A, A has the following options:
Optimize existing best-guess A0 at time T0
Optimize A’s updated best-guess A1 at T1 after update
Optimize A’s fully updated best-guess Afinal at Tfinal
Allow shutdown and human action
Since humans have attempted shutdown, (1) should be low value unless the expected value of A0 between T0 and T1 is extremely high [2].
However the AI now has a more detailed hypothesis for what might happen in (3): humans may run a new AI B with a different (larger) network shape at a later time T2.
Then (3) has expected value at least equal to P(humans run B) * EV(B optimizes B0 at T2), where B0 may be some policy that A is not capable of running (for example because it is too small).
That is, assuming P(humans run B) is sufficiently high and the expected value of A1 between T2 and T1 is not too high (and note that if the update takes a long time, T1 could even be later than T2), the AI can take shutdown as evidence that B0 is sufficiently greater than A1, and can update toward (3) > (2) and even (3) > (2.1).
1: While the space of neural net architectures is easier to do analysis on, the space of all AI architectures can be counted as Turing machines under a Solomonoff prior, so this should not require A to be myopic about the specifics of architecture.
2: It is unclear to me whether this T0/T1 gap is a feature or a bug. For example, we might attempt to shut down A if it wants to fire a gun out the window. However if it sees someone outside about to throw a grenade into the research lab or release a bioengineered pandemic, we might want it to act under time pressure and explain itself or allow shutdown afterward. Also once the urgent moment has passed it should return to deference (e.g. it knows something important now, but knows it doesn’t have the full picture to run the universe yet). In practice I would expect the gaps between T0/T1/T2 to be relatively small (hours/days/months/years to reach final models) whereas time after T2 should be very large (centuries, millenia, millions of years) and have greater scope for action due to buildup time, such that concerns optimizing after T2 easily dominate almost all concerns before T2. However this also seems to have gaps, such as if the AI concludes that all humans are suffering horribly and definitely need to be given heroine before a new model is built (but doesn’t understand that this will pollute it’s value estimate for the rest of the future).