LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation (opens in new tab)

LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agreement devices. We argue that a judge should instead be reported as a measurement instrument. We introduce a Judge Datasheet protocol that measures dark current under true-vacuum inputs, stable cross-sensitivity to same...

Read the original article