A toy model of corrigibility

Published on November 2, 2025 6:19 PM GMT

This is just a simple idea that came to me, maybe other people found it earlier, I’m not sure.

Imagine two people, Alice and Bob, wandering around London. Bob's goal is to get to the Tower Bridge. When he gets there, he'll get a money prize proportional to the time remaining until midnight, multiplied by X pounds per minute. He's also carrying a radio receiver.

Published on November 2, 2025 6:19 PM GMT

This is just a simple idea that came to me, maybe other people found it earlier, I’m not sure.

Alice is also walking around, doing some chores of her own which we don't need to be concerned with. She is carrying a radio transmitter with a button. If/when the button is pressed (maybe because Alice presses it, or Bob takes it from her and presses it, or she randomly bumps into something), Bob gets notified that his goal changes: there'll be no more reward for getting to Tower Bridge, he needs to get to St Paul's Cathedral instead. His reward coefficient X also changes: the device notes Bob's location at the time the button is pressed, calculates the expected travel times to Tower Bridge and to St Paul's from that location, and adjusts X so that the expected reward at the time of the button press remains the same.

I think this can serve as a toy model of corrigibility. Formally speaking, we don't need to talk about Bob having a utility function to "get to the Tower Bridge" which changes depending on the button, and we don't need the button to actually do any calculation at all. Instead, the utility function can be formalized as being fixed from the start, a big case statement like "if the button gets pressed at time T when I'm at position P, then my reward will be calculated as..." and so on.

The cool feature of this setup is that the existence of the button really, truly, doesn't influence Bob's behavior at all. For example, let's say Bob can sacrifice just a minute of travel time to choose an alternate route, one which will take him close to both Tower Bridge and St Paul's, to prepare for both eventualities in case Alice decides to press the button. Will he do so? No. He won't spare even one second. He'll take the absolute fastest way to Tower Bridge, secure in the knowledge that if the button gets pressed while he's on the move, the reward will get adjusted and he won't lose anything.

We can make the setup more complicated and the general approach will still work. For example, let's say traffic conditions change unpredictably during the day, slowing Bob down or speeding him up. Then all we need to say is that the button does the calculation at the time it's pressed (or the calculation is encoded into Bob's utility function as described above), taking into account the traffic conditions and projections at the time of button press.

Is this relevant to AI corrigibility in real life? Well, it's a toy model. And even in this model, the utility formula can get quite complicated. But now it feels to me that any solution to corrigibility would have to do something like this: switching from utility function U1 to U2 by calculating some coefficient at the time of switch, and multiplying U2 by that coefficient, so that the expected utility remains the same and there's no incentive for the agent to influence the switch. It feels like the only way things could work. So maybe this will be useful to someone.

Discuss

Similar Posts