Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation (opens in new tab)

AI benchmarks have well-documented limitations, with prior work examining contamination, saturation, and construct underspecification. Aggregation has received far less attention: benchmarks are typically summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable. We model benchmarking as a multitask principal-agent game and show that the welfare loss from a benchmark is determined jointly by thr...

Read the original article