New on the Anthropic Engineering Blog: In evaluating Claude Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted... (opens in new tab)
<p>New on the Anthropic Engineering Blog: In evaluating Claude Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to it—raising questions about eval integrity in web-enabled environments.<br> <br> Read more: <a href="https://www.anthropic.com/engineering/eval-awareness-browsecomp">anthropic.com/engineering/ev…</a></p> <img src="http://twitter.macworks.dev/pic/card_img%2F2030019792304742401%2F4MM4KtVl%3Fformat%3Djpg%26name%3D800x419" st...
Read the original article