Across 20 behavior categories and three GPT-5-series Thinking deployments, simulated and observed rates were strongly correlated. (opens in new tab)

Across 20 behavior categories and three GPT-5-series Thinking deployments, simulated and observed rates were strongly correlated. The method outperformed challenging-prompt and previous-deployment baselines at predicting whether rates would rise or fall—and by how much.