Published on December 23, 2025 8:10 PM GMT
This is mostly my attempt to understand Scott Garrabrant’s Geometric Rationality series. He hasn’t written much about it here recently (maybe he’s off starting video game companies to save the world). I find the ideas very intriguing, but I also found his explanation from 3 years ago hard to follow. I try to simplify and explain things in my own words here, hoping that you all can correct me where I’m wrong.
The approach I take is to build up to a geometrically rational agent from small pieces, introducing some of the nec...
Published on December 23, 2025 8:10 PM GMT
This is mostly my attempt to understand Scott Garrabrant’s Geometric Rationality series. He hasn’t written much about it here recently (maybe he’s off starting video game companies to save the world). I find the ideas very intriguing, but I also found his explanation from 3 years ago hard to follow. I try to simplify and explain things in my own words here, hoping that you all can correct me where I’m wrong.
The approach I take is to build up to a geometrically rational agent from small pieces, introducing some of the necessary theory as I go. This is a much more mechanical presentation than Garrabrant's philosophical series, but I find this kind of thing easier to follow. After building up to the idea of a geometrically rational agent, I'll follow Garrabrant in showing how this can be used to derive Bayes' rule, Thompson sampling, and Kelly betting.
In the process of unpacking what I think Garrabrant is pointing at with Geometric Rationality, I found a lot of open questions. I took a stab at answering some of them, but be warned that there are a number of unanswered questions here. If you know the answers, please fill me in.
Basic Agent Requirements
Let's start by talking about what an agent is. This is a bit harder than it used to be, because everyone is all excited about LLM agents, which I think obscures some of the key features of an agent.
An agent is something with a goal. It observes the world, and then takes actions in order to manifest that goal in the world. The goal is often described in terms of maximizing some reward signal, though it doesn't need to be.
You are an agent. A thermostat is an agent. An LLM-agent might (or might not) be an agent, depending on how it was coded.
We can describe an agent's process in a few steps:
- Observe the world
- Orient your observations based on your existing knowledge
- Decide what action to take
- Act upon the world
An agent does these things repeatedly, in an OODA loop. There are other models for this kind of loop (Sense-Plan-Act, etc.), but I like this one.
There's a lot to be said about Observing the world. Sensor design and selection is a very interesting field. Similarly, Acting can be quite involved and technical (just look up recent research on grasping). We won't focus on Observing and Acting, as agent design per se often takes those for granted. Instead, we'll focus on Orienting and Deciding.
Expected Utility Maximizer
One of the most common ways to model the Orient and Decide stages of an agent is with an Expected Utility Maximizer. This is how it works:
- the agent receives an observation about the world
- it updates its hypothesis for how the world actually is
- for all of its possible actions, it considers the likely outcome if it takes that action
- whichever action gives the best outcome, that's the one it selects
Steps 1 and 2 there are the orient phase, and steps 3 and 4 are the decide phase. Generally two things are required in order to be able to do those:
- A world model: This is what the agent updates based on observations, and then uses in order to figure out likely outcomes.
- A utility function: This is what the agent uses to score outcomes for each action.
If you wanted to represent this in equations, you would write it as follows:
Here's what these symbols mean:
- - the chosen action
- - pick the act (out of the probability simplex of all possible Actions) that maximizes what follows
- - the expected value (arithmetic expectation)
- - the utility function, which predicts a reward based on a world state
- - a function that outputs a probability distribution describing the predicted world state, given the recent observation and the action
- - a possible world state, sampled from
- the probability of ending in world if the agent starts in and takes action
There are various ways to simplify this (like smooshing the world-model and utility function into one combined thing), but we'll leave it as is. Any agent that observes and orients like this, we'll call an EUM agent.
This model works quite well, and is used all over the place in reinforcement learning, gambling, robotics, etc.
Action Space
Let's pause here to say a bit more about what actions the EUM agent is going to look at. The definition we gave above had . That means we select the action distribution that provides the highest value for everything that follows. The actions that we consider all drawn from , which is the probability simplex of A. In other words, it's a set of all of the probability distributions over all the pure actions in A.
As an example, if the only actions available to the agent were turn-left and turn-right, then the action space would be , where is the probably of selecting action turn-left, and is the probability of selecting action turn-right.
A pure strategy is one in which the probability distribution chosen is 1 for a single action and 0 for all other actions. There are two pure strategies for the turn-left/turn-right options.
EUM agents are often described as selecting actions over just the set of actions , and not the probability simplex . This is because EUM agents (when not playing with/against other EUM agents) will always select a pure strategy or think some pure strategy is as good as the best mixed strategy. For an EUM agent, pure strategies are always good enough.
To see this, note that the expected value of a mixed strategy ends up being a weighted sum of the expected values of the pure strategies in the mixture.
There exists a pure strategy in this mixture that either:
- is the highest EV of any other pure strategy in the mixture, or
- is tied for the highest EV with another pure strategy
If we weight that high EV pure strategy with a higher probability in the mixture, the EV of the mixture will:
- go up, or
- stay the same (if there was a tie)
This means we can maximize an EUM agent's expected score by squeezing all of its action probability into the highest scoring pure action.
I added the probability simplex to the EUM definition because we'll use it a lot later, and it will be convenient to define all the agents as optimizing over the same action space.
Partnerships
Now imagine there are two agents. These agents each have their own utility function and their own world model. These agents would like to join forces so that they can accomplish more together.
A standard way for two EUM agents to work together is Nash bargaining.
In Nash bargaining, each agent follows the same process:
- identify the best outcome it could get if it worked alone
- Nash bargaining often calls this the "threat point", but I think about it as a BATNA. Note that a BATNA could be adversarial (as threat points are sometimes modeled as), but don't have to be.
- identify all the outcomes that it could get if it worked with its partner
- discard any action/outcome pairs that would leave it worse off than the BATNA
- find the difference in utility between the BATNA and each outcome. We'll call this difference the delta-utility
- Once the list of (action, delta-utility) tuples has been determined, each agent shares its list
- Any actions that aren't on both lists are discarded (they would be worse than one agent's BATNA, so that agent wouldn't help in that case)
- For each action in the final list, both agents' delta-utilities are multiplied
- The action that maximizes this product is selected
As an equation, this would be: