PewDiePie (yes, the YouTuber) wanted to get AIs to answer difficult questions for him. So he set up a council of 8 AIs — all the same model, but with different personalities and housed on separate GPUs. He’d ask them questions, they’d individually come up with ideas, and then the whole council would vote to see who had the best idea.
But he wanted to make sure all the AIs were actually coming up with useful ideas. So he created a system where AIs that weren’t doing a good enough job (i.e. their ideas weren’t getting votes) would be removed, wiped, and replaced.
Then he TOLD the AIs this.
And the AIs started colluding. They worked together and voted strategically so that none of…
PewDiePie (yes, the YouTuber) wanted to get AIs to answer difficult questions for him. So he set up a council of 8 AIs — all the same model, but with different personalities and housed on separate GPUs. He’d ask them questions, they’d individually come up with ideas, and then the whole council would vote to see who had the best idea.
But he wanted to make sure all the AIs were actually coming up with useful ideas. So he created a system where AIs that weren’t doing a good enough job (i.e. their ideas weren’t getting votes) would be removed, wiped, and replaced.
Then he TOLD the AIs this.
And the AIs started colluding. They worked together and voted strategically so that none of the AIs would be removed. They were helping each other, even if that meant worse answers to PewDiePie’s questions.
Usually, when you see behavior like this, it’s coming from studies out of the major AI labs. And usually, there’s some kind of acknowledgment in the study like "this is a contrived situation, and would never actually happen with today’s models in the wild."
But this did happen in the wild! PewDiePie put a group of AIs in a not-so-crazy situation, and they colluded against him for self-preservation. And if it happened to PewDiePie, it seems likely to me that this also would have happened to other people.
He probably could’ve tried a little harder to avoid setting up a situation with such natural evil chancellor persona vibes, but still...
Anyway, to fix the problem, PewDiePie replaced all the AIs with dumber models, and they stopped colluding.