Logits as a new monitor for evaluation awareness (opens in new tab)

Covers 3 stories including Qwen3 Technical ReportCovered by threadreaderapp.comDiscussed on Hacker News

TL;DR:We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence.The logit monitor outperforms LLM judge monitoring of verbalized eval awareness, using 10× to 100× fewer rollouts, on Kimi K2.5 and Qwen 3 32B, across two tasks:separating evaluation prompts (Fortress & Petri) from deployment prompts (WildChat)predicting which evaluation prompts will elicit rollouts with verbalized eval awareness.The logit monitor works f...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 1 article

threadreaderapp.com·

Covered in 1 article

Would an LLM tell you if it’s gaming your eval? Often, no. But we can still catch the model thinking about it.