Bluffbench is near saturation: LLMs can interpret counterintuitive plots (opens in new tab)
Model releases from the last couple months have shown a large jump in capability on our bluffbench eval, which measures agents' ability to faithfully describe plots showing surprising results.
Read the original article