Not reading about Shard Theory

In the spirit of On (Not) Reading Papers, and in the spirit of coming up with one’s own ideas in AGI safety, I’m going to try to write some summaries of ideas I’m not super familiar with, to work out my own thinking before/as I dive in to understand what others have written.


I’m going to start this with Shard Theory, which was mentioned to me by another attendee at EAGxVirtual last week. The description I got was something like, “it’s the idea of agents who attempt to satisfy different functions depending on their environment,” and a statement that the person wasn’t actually super familiar with shard theory.


I then built a model in my head of a type of agent which would be well described by shard theory.


A “sharded agent” in my model consists of:

  • An ontology of types of situations

  • Perceptions which attempt to place the agent’s current situation in the ontology

  • A map from the ontology of situations to goals


So a “sharded agent” might consist of a group of agents with utilitarian goals, each of which gets a “turn” at being in control sometimes. 

This could be decided mechanistically (e.g. Agent 1 gets an hour, then agent 2 gets an hour, etc.) or holistically (agent 1 gets a turn when it’s a nice indoor temperature and there are no emergencies, agent 2 gets a turn when perceptions are at the limits of expected distributions, etc.) or politically (each agent puts in votes for a representative agent depending on the circumstances and the winner gets control), or however else.


There are two clear places I might want to use such a model:


  1. As a descriptive model of human minds

  2. As a normative framework for building friendly AI


I think I gathered that the idea came from purpose (2), as an attempt to solve the problem of overoptimization. Since a shard based agent could be utilitarian within a certain scope, it should retain its ability to be useful and competitive in that domain. However, its motivations could be overwritten when in other circumstances, reducing its ability to seek power, for example.


While this seems plausibly helpful to me it doesn’t seem like an obvious solution. Adversarial agents in the mix could attempt to change the overall sharded agent’s perspective to gain power a larger amount of the time, for example.


As a descriptive model of humans, however, it seems like a strong intuition pump.


I think that modeling a human brain as a set of goal-oriented utilitarian agents, each of which emerges in a certain set of circumstances, is a pretty good model. It reminds me of internal family systems, and clicks with a number of other ideas about the human mind that I’ve considered strong candidates.


If that’s the case, it has strong implications for what human values actually are.


In particular, it suggests that within what a human considers a narrow domain, that human could have approximately coherent utilitarian goals and values.


If that is the case, then that leaves a more tangible set of problems in place in order to construct something like CEV:


  1. Extract the ontology of contexts that humans consider “distinct”

    1. Test for compatibility of the boundaries between contexts across humans

  2. Extract the value functions humans have in these domains

    1. Test for compatibility of these value functions across humans

    2. Test for compatibility of these value functions across contexts–e.g. Can they be combined into a single utilitarian goal system

  3. Extend the ontology of contexts to cover all expected future situations


Most of these subproblems seem much more tractable to me than the idea of CEV as a whole. For example, (1) is implicit in a lot of survey data and text on how people describe their days and experiences. (1a) I am cautiously optimistic about because of our shared mental architecture; if the brain consists of hard-coded instincts and is otherwise unconstrained, then those hard-coded instincts could be the root of some kind of compatibility. Accomplishing (2) is just inverse reinforcement learning, and (2a) is approachable with survey data/big data munging. I am cautiously optimistic about (2b) because I expect that value functions can be quite complex while still being coherent.


However, being tractable does not mean they make it easy to seed an AI with human values. I am cautiously optimistic about (1a) and (2b), but I could also imagine impossibility theorems in either place that mean encoding human values is necessarily incoherent!


And it seems likely to me that there is no unique extension (3); every answer to population ethics could very well be compatible with human universals. There could be uncountably infinitely many incompatible extensions of human value to the future, with no way of dividing the universe fairly between them.


Returning to the normative framework for AI, if encoding “human values” turns out to be incoherent to do directly, then a framework for AI which is not purely a maximizer is important.


If there are game theoretic results about agents sharing power being a stable equilibrium (e.g. one could create lower bounds on the “size of shock” needed for one agent in the group to attempt a coup) this could be a valuable strategy.


If there are 4 agents, each embodying 3 out of 4 values, then in the alliance each is able to satisfice their 3 values each with 1/4  priority, for expected value 3/4. whereas if they defect and lose, their values will each be at 2/9 priority for expected value ⅔. If they defect and win, they get expected value 1. Then the probability of a win at which defection is equivalent to cooperation is:

p*1 + (1-p)*⅔ = ¾

2/3+1/3p = ¾

P = 9/4-2 = ¼

This seems to stay consistent across levels of correlated values (break even point is a neutral odds free for all), which seems precarious to me, so I’m not optimistic on the mathematical front that there will be a cooperation-attractor, though this is obviously the simplest toy example and I could easily be wrong.

Now that I’ve written all that, looking at the wiki page for shard theory… it says approximately what I thought it said, the only new information for me was that it emerged from attempting to understand neural networks.

I don’t claim to have reinvented all of this from whole cloth; I’ve been on LessWrong recently enough to have probably seen a few of the posts in the wiki tag, although I have not up- or down-voted any of them. I am pretty sure I had not read the big frontpage post about it, though, which makes me think that reading background is not a strong use of my time.

Skimming for results using this framework gives this post about… the complete lack of results in the framework. Well, good game, guys.

Previous
Previous

Oathsworn: Into the Deepwood review

Next
Next

Infinite Possibility Space and the Shutdown Problem