Meta-agentic Prisoner's Dilemmas

Published on November 5, 2025 4:44 PM GMT

In the classic Prisoner's Dilemma (https://www.lesswrong.com/w/prisoner-s-dilemma), there are two agents with the same beliefs and decision theory, but with different values. To get the best available outcome, they have to help each other out (even if they don't intrinsically care about the other's values); and they have to do so even though, if the one does not help the other, there's no way for the other to respond with a punishment afterward.

A classic line of reasoning, from the perspective of one of the prisoners, goes something li...

Published on November 5, 2025 4:44 PM GMT

Crosspost from my blog.

A classic line of reasoning, from the perspective of one of the prisoners, goes something like this: I and my collaborator each only cares about himself. So it seems logical that we will defect against each other. However, there's some kind of symmetry at play here. If you abstract away the details of which specific prisoner I am, really I'm in exactly the same situation as my collaborator. So it's almost as though our decisions are logically bound to each other: Either my reasoning leads to me defecting, and therefore his reasoning also leads to him defecting, or else likewise I cooperate and he cooperates. We will make "the same choice" as each other, i.e. the symmetric / conjugate choice.

In this line of reasoning, what is shared between the two agents, and what is different? What's different is that I am here in this room, so I can respond Defect or Cooperate on behalf of this agent, and I care about how long this agent gets thrown in prison; and you likewise, mutatis mutandis. We what we share is our beliefs about the situation, our knowledge about how each other thinks, and our general procedure for making decisions. We both seem to act against our values (unshared), but it makes sense from a shared pespective that's behind a veil of ignorance about values.

In the Epistemic Prisoner's Dilemma (https://www.lesswrong.com/w/epistemic-prisoners-dilemma), the two agents have the same values (as well as the same decision theory). But they have different beliefs. According to my belief that the village is sick with malaria, we should treat them with malaria medication; you think it's bird flu, so you want to treat for bird flu. There's no time to explain our evidence to each other and argue and update. To cooperate and get the best outcome according to both of our beliefs, I should give up 5,000 malaria treatments (which I think would help) to gain 10,000 bird flu treatments (which I think would not help), and you should give up 5,000 bird flu treatments to gain 10,000 malaria treatments. We both seem to act against our beliefs, but it makes sense from a shared perspective that's behind a veil of ignorance about which beliefs we have (or which evidence we saw).

From the perspective of Functional Decision Theory (https://www.lesswrong.com/w/functional-decision-theory), the reasoning can go something like this: "I don't know exactly what all the subjunctive-dependent-consequences of my decisions are. However, clearly if there is another agent who is identical to me, then if I make a decision, I make that decision also for that other agent (indeed, I couldn't even tell which one I am). Further, if there is another agent who is identical to me except in some way that is completely separated from the general procedure I use to make decisions, then clearly if I make a decision, I make that decision for that other agent but conjugated through the differences—i.e. I also make the decisions for that other agent in situations that differ only in ways that my general decision procedure are (/ought to be) invariant to."

In the above cases, there is a clear way to "cordon off" differences between two agents. Prima facie (though maybe not actually), caring about this person or that person should not affect the general decision procedure you use. Likewise, having control over this or that agent's outputs shouldn't change your general procedure, and having seen this or that evidence shouldn't change it. When you can do this "cordoning off", the FDT reasoning (which is only a "spherical cow fragment" of what FDT would want to do) above should go through.

However, there's a whole spectrum of how possible it is to "cordon off" some difference between two agents, separated off from their shared identical decision procedure. On the "yes, 100% doable" end, you have two agents that start off identical, but then each see a different piece of information about the world. The information only says something like "this disease is probably malaria", and doesn't have other implications for the agents. They have the same prior beliefs, ways of updating on evidence, values, and ways of going from values to actions. So they can easily ask "What would I have done / how would I have thought, if I had seen what the other agent had seen?".

On the other end of the spectrum... I don't know exactly, I don't understand what this looks like. But the vague idea is: Imagine two agents who have eternal, core differences in their decision-procedures, which have many practical applications, and where it doesn't make sense to ask "what if the other agent is just like me, but with some tweak", because the difference between you and the other agent is so big that "If I were like that, I just would not be me anymore.".

What this does or could look like, I don't know. (See "Are there cognitive realms?".) An example might be CDT / Son-of-CDT. I think you can quasi-timelessly coordinate with Son-of-CDT—as long as you only need the agent to take actions that come after the moment when CDT self-modified into Son-of-CDT, and it only needs you to take actions that come after you've read the modified source code; or something like that. But you cannot timelessly coordinate with that agent before the magic self-modification moment. And probably there's some successful and reflectively stable Son-of-CDT like this, so we can't just say "well sure, other people can be irrational".

There are intermediate points on this spectrum which contain rich and confusing decision-theory questions. There are cases where the difference between the two agents is not so extreme that coordination is off the table (e.g. between an FDTer and Son-of-CDT, regarding actions before the magic moment), but the difference is not easy to "cordon off" from the decision theory. There are "metaepistemic prisoner's dilemmas", where the two agents have some different way of processing evidence or arguments. (It's not obvious that there is a canonical way for bounded agents to do this, especially about questions that are intrinsically too computationally complex to answer, e.g. many self-referential questions.) There are "meta-decision-theoretic PDs", where two agents have stumbled into or reasoned themselves into different decision procedures.

For example, how do you coordinate with a frequentist? Do you know how to answer the question "If I decide this way, what does that imply for how I would have decided if I were me-but-a-frequentist?"?

For example, how do you cooperate with someone who has a different idea of fairness from you?

For example, how do you coordinate with someone who is not updateless? (Imagine they are somehow reflectively stable.) (Are you updateless? Which things are you updateless about? Do you sometimes change your decision theory itself, based on evidence / arguments / thinking? How can you (and others) coordinate with yourself across that change?)

These questions are not necessarily meta-flavored. For example, in the Logical Counterfactual Mugging, Omega flips a logical coin by asking "What is the Graham's-numberth digit of π?". That seems easy enough to cordon off. I feel like I know how to answer the question "Even though I know that the answer is 6, how would I think and act if the answer were 5?". Since that logical question is not a consequence of, and doesn't have consequences for, my decision procedure, I know what that hypothetical means. But, what if it's the third digit of π? Do I know how to imagine that world? How would I think in that world? If the third digit were 9 rather than 1, that would totally change a bunch of computations, wouldn't it? And therefore it would change many of my decisions? (There might be a clean way to analyze this, but I don't know it.) (Since π should be pretty objective, i.e. all agents point to the same π, this question might actually boil down to questions like "How likely is it that I'm being simulated and tricked to think that the third digit of π is 1?".)

These are theoretically interesting questions, because they poke at the edge of decision-theory concepts. What's the boundary / boundaries between decision theory and other aspects of an agent like beliefs, belief-formation, and values? What is core to my decision theory, and what isn't? What are some sets of minimal conditions sufficient for me to be able to coordinate with another agent? There could be a different "core of my decision-theory" or minimal conditions for cooperation depending on context.

They're also practical questions. Human disagreements, conflicts, threats, bargaining, and coordination involve meta-agentic PDs. I'm frequentist, you're Bayesian; I think in term of duty to God, you think in terms of strategy-consequentialism; I rely on ethics, you rely on intuition and emotion; I reprogram myself in order to deeply pretend I believe X by actually believing X, you make truth-seeking triumph over everything else in yourself; I'm trying to find symmetries between us so that we can coordinate, you are trying to get the drop on me; and so on.

The setting of open-source game theory (https://arxiv.org/abs/1401.5577, https://arxiv.org/abs/2208.07006) can describe many of these situations, because it's very flexible—agents are programs. Löbian cooperation is a kind of coordination that works in a rich set of cases.

But for the most part these problems are unsolved.

Thanks to Mikhail Samin for a related conversation.

Discuss

Similar Posts