Your benchmarks are lying to you, and your judge is to blame! (opens in new tab)

Discussed on DEV

Last week I published a benchmark comparing six models across eleven agent skills. The numbers in that post are averages, and we did not explain why. When I shared the data internally, Maria from our AI Research team pointed out something that we should take very seriously: an LLM judge is likely to favour outputs from its own model family. So I ran the full benchmark again with a second judge, then a third, to see if this hypothesis held any water. The scores shifted, the rankings moved, and...

Read the original article