Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation
arxiv.org·2h
🧮Kolmogorov Bounds
Preview
Report Post

View PDF HTML (experimental)

Abstract:As large language models become components of larger agentic systems, evaluation reliability becomes critical: unreliable sub-agents introduce brittleness into downstream system behavior. Yet current evaluation practice, reporting a single accuracy number from a single run, obscures the variance underlying these results, making it impossible to distinguish genuine capability improvements from lucky sampling. We propose adopting Intraclass Correlation Coefficient (ICC), a metric from measurement science, to characterize this variance. ICC decomposes observed variance into between-query variance (task difficulty) and within-query variance (agent inconsistency), highl…

Similar Posts

Loading similar posts...