More on how we're constraining eval environments so that scores better reflect model intelligence: http://cursor.com/blog/reward-hacking-coding-benchmarks (opens in new tab)
More on how we're constraining eval environments so that scores better reflect model intelligence:
Read the original article