Can Models be Evaluation Aware Without Explicit Verbalization?
lesswrong.com·13h
Flag this post

Published on November 8, 2025 6:26 PM GMT

Authors: Gerson Kroiz*, Greg Kocher*, Tim Hua

Gerson and Greg are co-first authors, advised by Tim. This is a research project from Neel Nanda’s MATS 9.0 exploration stream.

Code: https://github.com/gkroiz/steering-eval-awareness-3-copy 

tl;dr We provide evidence that language models can be evaluation aware without explicitly verbalizing it and that this is still steerable. We also explore various techniques to localize when the model becomes evaluation aware.

Introduction

What is Evaluation Awareness and why does it Matter?

<p data-int…

Similar Posts

Loading similar posts...