✍️Prompt Engineering opensource.posit.coBlog

Bluffbench is near saturation: LLMs can interpret counterintuitive plots (opens in new tab)

Covers Claude Fable 5 and Claude Mythos 5Discussed on Hacker News

Model releases from the last couple months have shown a large jump in capability on our bluffbench eval, which measures agents' ability to faithfully describe plots showing surprising results.

Read the original article