Statistics for AI/ML, Part 4: pass@k and Unbiased Estimator (opens in new tab)

Understanding common metrics in LLM benchmarks - Every time AI labs release new models, we see an evaluation metric called $\text{pass@}k$, where $k$ can be any integer number such as $\text{pass@}1$. It might sound like passing a test at the $k$th attempt, but this metric is far more sophisticated and plays a crucial role in how we build reliable AI applications in production. As an example, here’s OpenAI GPT-5’s performance on AIME. $\text{pass@k}$ does not mean the model passing a test in ...

Read the original article