A toy model of corrigibility
lesswrong.com·16h
Flag this post

Published on November 2, 2025 6:19 PM GMT

This is just a simple idea that came to me, maybe other people found it earlier, I’m not sure.

Imagine two people, Alice and Bob, wandering around London. Bob's goal is to get to the Tower Bridge. When he gets there, he'll get a money prize proportional to the time remaining until midnight, multiplied by X pounds per minute. He's also carrying a radio receiver.

Alice is also walking around, doing some chores of her own which we don't need to be concerned with. She is carrying a radio transmitter with a button. If/when the button is pressed (maybe because Alice presses it, or Bob takes it from her and presses it, or she randomly bumps into something), Bob gets notified that his goal changes: there'll be no more reward for g...

Similar Posts

Loading similar posts...