Would an LLM tell you if it’s gaming your eval? Often, no. But we can still catch the model thinking about it. (opens in new tab)
Would an LLM tell you if it’s gaming your eval? Often, no. But we can still catch the model thinking about it. New research: we measure how close a model comes to saying it’s being tested. This detects eval awareness with 10× to 100× fewer samples than monitoring model outputs.🧵<img alt="Image" src="/images/1px.png" data-src="
Read the original article