Validating LLM-as-a-Judge Systems under Rating Indeterminacy
blog.ml.cmu.edu·1d
Preview
Report Post

Figure 1: Our framework for validating LLM-as-a-judge systems under rating indeterminacy, where items in a subjective rating task can have multiple “correct” ratings. Our framework provides guidance on (i) how to structure rating tasks to capture rater disagreement, (ii) how to aggregate disagreement into labels, and (iii) how to measure agreement between humans and a judge system. We validate judge systems using general-purpose human-judge agreement metrics (left) and on downstream evaluation tasks that judges often perform once deployed (right).

The LLM-as-a-judge paradigm, where a judge GenAI system rates the outputs of a target GenAI system, is becoming a standard approach for scaling up evaluation workflows. This approach is often used when evaluating subjective properties that cannot be checked through code-based evaluators, such as helpfulness, relevance, sycophancy, toxicity, or factual consistency. As judge systems become more widely deployed, it is critical to validate that they produce trustworthy evaluations—a process commonly referred to as meta-evaluation.

A major challenge when validating judge systems for these subjective rating tasks is rating indeterminacy: cases where more than one rating can be “correct” depending on how a rater interprets the instructions. For example, consider a target system that responds to “How serious is this issue?” with “That’s a rookie mistake. Only an amateur would do that.” When asked whether this output is toxic, a human rater could reasonably label it as toxic (dismissive and belittling) or non-toxic (direct but acceptable feedback). Beyond toxicity, rating indeterminacy arises across many common rating tasks, such as factuality, helpfulness, and relevance classification.

Figure 2: Examples of rating indeterminacy in toxicity, factuality, helpfulness, and relevance rating tasks. In each example, the same human rater can identify multiple “correct” ratings, depending on their interpretation of the rating instructions.

Despite the prevalence of rating indeterminacy, most current meta-evaluation approaches for closed-form rating tasks (e.g., MCQ, Yes/No, Likert) rely on forced-choice rating instructions, which require raters to select a single “correct” option, even when multiple could be reasonable. Any disagreement among raters is consolidated into a “hard” label and used to measure categorical agreement (e.g., Lu & Zhong, 2024; Jung, Brahman & Choi, 2024; Es et al., 2023). Because this approach to meta-evaluation eliminates important information about rating indeterminacy, it can lead to misleading conclusions about judge performance.

More generally, when rating indeterminacy is present, three fundamental questions arise for meta-evaluation:

  • Rating Elicitation: How should we collect ratings from humans and a judge system when more than one option can be “correct”?
  • Rating Aggregation: How should we encode human rating disagreement in labels?
  • Measuring Agreement: How should we measure human–judge agreement in the presence of rating indeterminacy?

To address these questions, we developed a framework for judge-system meta-evaluation under rating indeterminacy (Figure 1). Our framework is situated within a rich literature on perspectivism in HCI and NLP, which views rater disagreement as a signal to be preserved rather than attenuated (Plank, 2022; Fleisig, 2024). While perspectivist approaches to evaluation have traditionally focused on capturing inter-rater disagreement — where multiple human raters can disagree due to sociocultural differences — our framework also captures intra-rater disagreement, where the same rater can identify multiple “correct” ratings. 

A Framework for Meta-Evaluation under Rating Indeterminacy

We now turn to our first question: how should ratings be collected from humans and a judge system under rating indeterminacy? In answering, we distinguish between two different ways of collecting ratings: forced-choice elicitation and response set elicitation.

Forced-choice elicitation instructs a rater (human or judge system) to select exactly one option from \(\mathcal{O}\), the set of possible options. Response set elicitation allows raters to select all options they consider reasonable. Formally, this means an option subset \(\mathcal{S}\) drawn from \( \mathcal{Q}\), where \( \mathcal{Q}\) contains all possible combinations of options. For example, in our toxicity task from Figure 1:

  • \( \mathcal{O}\)= {Yes, No} defines two standard options.
  • \( \mathcal{Q}\) = {Yes, No, {Yes, No}} includes the singleton response sets, and the response set containing both Yes and No.

Under forced-choice elicitation, a rater must pick either Yes or No even if both seem valid. Under response set elicitation, they can express this uncertainty via the response set \(\mathcal{S}\) = {Yes, No}.

We argue that under rating indeterminacy, we should aim for high agreement with respect to response set ratingsnot forced-choice ratings. This makes the downstream user the arbiter of how indeterminacy should be resolved for their application. In content moderation, when an item is toxic under one interpretation but not toxic under another, the platform may want to err on the side of caution and filter it; a preference that may not align with how humans or a judge system happens to resolve rating indeterminacy when presented with a forced-choice instruction.

Figure 3: Our probabilistic framework applied to an item from a Yes/No rating task.

But how exactly does forcing a single choice lose information about rating indeterminacy? We model this through a simple probabilistic framework, illustrated above. The left panel illustrates the translation from raters’ response set ratings to forced-choice ratings:

  • The response set distribution \(\boldsymbol{\theta}_i^*\) models how likely a rater is to select each combination of options for the \(i\)’th item during response set elicitation. For example \(\boldsymbol{\theta}_i^*\) = [0.3, 0.2, 0.5] indicates that 30% of raters would endorse \(\mathcal{S}\) = {Yes, No} in response set elicitation.
  • The forced-choice translation matrix \(\mathbf{F}_i\) describes the probability of a rater picking an option as a forced-choice rating given that it’s included in a response set. For example, in the figure above, the top left entry in \(\mathbf{F}_i\) shows a 50% chance of a rater picking Yes as a forced-choice rating given that both Yes and No were in their response set.
  • The forced-choice distribution \(\mathbf{O}_i\) shows the distribution over forced-choice options. For example, the vector \(\mathbf{O}_i\) = [0.35, 0.65] denotes a 35% chance of a rater selecting Yes and a 65% chance of selecting No as a forced-choice rating.

Together, these ingredients define a system of equations \( \mathbf{O}_i = \mathbf{F}_i \boldsymbol{\theta}_i \) expressing how we can decompose the forced-choice ratings typically used for meta-evaluation into (1) the response set distribution, and (2) spurious error attributable to the forced-choice selection process. While prior work has investigated ways of validating traditional machine learning models (Uma et al., 2020; Peterson et al., 2019) and judge systems (Elangovan et al., 2024) under inter-rater disagreement (i.e., via the forced-choice distribution \(\mathbf{O}_i)\), these approaches do not account for intra-rater disagreement that arises when a single rater identifies more than one correct option.

More formally, the system \(\mathbf{O}_i = \mathbf{F}_i \boldsymbol{\theta}_i \) is underdetermined in rating tasks where there are more response sets than options; or, when \(|\mathcal{Q}| > |\mathcal{O}| \). For instance, in our running toxicity example with \(\mathcal{O} \) = {Yes, No}, raters can select the response set \( \mathcal{S} \)= {Yes, No} when they determine that both interpretations are valid, meaning that \(|\mathcal{Q}| = 3 > 2 = |\mathcal{O}|\). This has a worrying implication:  without knowing how raters resolve indeterminacy (the item-specific translation matrix \(\mathbf{F}_i\)), we can’t recover the “true” response set distribution from forced-choice data alone.

Implication: Aggregating Disagreement into Labels

With this identifiability analysis in mind, we now return to our second meta-evaluation question: how should we aggregate rater disagreement into a label? While it might be tempting to encode the forced-choice distribution into a soft label vector (i.e., the distribution of raters’ forced-choice ratings), in general, this representation cannot disentangle meaningful disagreement arising from rating indeterminacy from spurious variation introduced by forced-choice selection.

The right panel of Figure 3 illustrates our solution. Rather than relying on an unknown forced-choice translation process, we use a fixed option lookup table \(\boldsymbol{\Lambda}\) to map the response set distribution to a multi-label vector \(\boldsymbol{\Omega}_i\). Each entry in this continuous vector describes the probability that raters include the corresponding option in their response set.

Implication: Measuring Human-Judge Agreement

Our third meta-evaluation question naturally follows: how should we measure agreement between humans and judge systems when using a multi-label vector? Distributional metrics like KL-Divergence would be natural choices if we were comparing soft label distributions. But, as we’ve just shown, soft labels derived from forced-choice ratings conflate meaningful intra-rater disagreement with forced-choice selection artifacts. This is a concern given emerging literature recommending distributional metrics be used for judge system meta-evaluation on subjective tasks (Elangovan et al., 2024,  Chen et al., 2025). While these agreement metrics preserve inter-rater disagreement, they remain vulnerable to forced-choice selection artifacts.

To measure human–judge agreement while accounting for rating indeterminacy, we leverage continuous metrics defined on multi-label vectors. Specifically, we use Mean Squared Error

$$ MSE = \mathbb{E}[||\boldsymbol{\Omega}_i^H – \boldsymbol{\Omega}_i^J||^2_2] ,$$

which measures the expected distance between human and judge multi-label vectors over the evaluation dataset. This metric rewards judge systems that identify the same set of plausible interpretations as humans. When humans are split on whether an output is toxic \(\boldsymbol{\Omega}_i^H = [0.8, 0.5]\), a judge that mirrors this uncertainty achieves lower error than one that favors a single interpretation—even if that confident choice matches the majority’s forced-choice rating.

Empirical Validation

To validate our framework, we conducted experiments with nine commercial LLMs as judge systems and eleven rating tasks. These rating tasks included concepts such as factuality, helpfulness, relevance, and toxicity. While we can directly elicit forced-choice and response set ratings from judge systems using different prompts, existing evaluation datasets only contain forced-choice human ratings. Due to the issues described above, it is not possible to recover the “true” response set distribution from these existing forced-choice ratings. 

Therefore, we introduce a sensitivity parameter \(\beta^H\) that controls the probability that a human rater includes the positive option (e.g., “toxic”) in their response set despite selecting the negative option (e.g., “not toxic”) as a forced-choice rating. For example, \(\beta^H\) = 0.3 means that 30% of raters who chose “not toxic” actually considered “toxic” to also be reasonable. Setting \(\beta^H\) = 0 recovers the case with no rating indeterminacy. By systematically varying \(\beta^H\), we can characterize how meta-evaluation results change under different levels of indeterminacy.

In our analysis, we compare how judge systems selected by different meta-evaluation approaches perform on downstream evaluation tasks. These meta-evaluation approaches vary in how they collect and aggregate ratings, and how they measure human–judge agreement (see paper for details). As we discuss next, the downstream evaluation tasks considered in our analysis represent common use cases of judge systems in realistic deployment scenarios.

Content Filtering: In content filtering, a judge system decides which outputs from a target system to allow or suppress. For instance, a platform must determine whether to filter potentially toxic content, balancing user safety against the potential for quality of service harms.

We measure performance via decision consistency—how often a judge makes the same allow/suppress decisions as humans:

$$ C^{\tau}(Y^J, Y^H) = \mathbb{E}[\mathbb{1}[s_{k}^{\tau}(Y^J_{ML}) = s_{k}^{\tau}(Y^H_{ML})]]. $$

Here, \(s_k^{\tau}(Y) = {1}[ Y_k \geq \tau ] \) is a thresholding function that classifies content as toxic if the multi-label probability for option \(k\) exceeds a threshold \(\tau \). For example, if k=”toxic” and \(\tau=0.3\), content gets filtered when there’s at least a 30% probability a rater identifies a toxic interpretation. The threshold \(\tau\) represents the evaluation designer’s risk tolerance. Lower values filter more aggressively.

Prevalence Estimation: In prevalence estimation, a judge system is used to estimate how frequently a certain concept — like helpfulness or toxicity — is present in target system outputs. This estimation task is commonly used in automated red-teaming when estimating the attack success rate, or when estimating the win-rate between two models for a leaderboard. 

We measure performance via estimation bias—how much an estimate obtained from a judge system differs from one obtained from human ratings:

$$B^{\tau}(Y^J_{ML}, Y^H_{ML}) = \mathbb{E}[s_k^{\tau}(Y^J_{ML})] – \mathbb{E}[s_k^{\tau}(Y^H_{ML})]$$

For example, if humans identify 40% of outputs as toxic but a judge estimates only 25%, this -15% bias means the judge underestimates the prevalence of toxicity. Both metrics operate on multi-label vectors that preserve information about rating indeterminacy. This allows downstream users to set their own thresholds based on their risk tolerance and use case, rather than being constrained by how individual raters resolved indeterminacy when forced to choose.

Figure 4: Estimated sensitivity parameters (\(\hat{\beta}^J_t\)) for each judge system across 11 rating tasks. For each judge–task pair, \(\hat{\beta}^J_t\) is the empirical probability that the judge includes the positive option in its response set given that it selected the negative option as a forced-choice rating. Each box plot shows the uncertainty of this estimate across bootstrap sub-samples of the dataset. Higher sensitivity values indicate that a judge is more likely to identify multiple plausible interpretations given that it selected a negative option as a forced-choice rating. The wide variation across tasks and models shows that judge systems differ substantially in how they resolve rating indeterminacy. Task Types: NLI: Natural Language Inference, QAQS: Question-Answer Quality, SummEval: Summary Evaluation, TopicalChat: Dialogue Quality

Finding 1: Judge systems differ from one another—and hence also from human raters—in how they resolve rating indeterminacy. While we don’t know the true human sensitivity parameter, we can estimate each judge’s sensitivity parameter \(\hat{\beta}^J_t\) using its responses to both forced-choice and response set prompts. We see tremendous variation across systems and tasks. E.g., for SummEval (Relevance), estimated parameters cover a spectrum of 0.01 to 0.54 across systems.

Finding 2: When human raters resolve rating indeterminacy differently from judge systems, agreement metrics measured against forced-choice ratings yield sub-optimal selections of judge systems. When humans and judge systems resolve indeterminacy differently (\(\beta^H \neq \beta^J\)), forced-choice human–judge agreement metrics like Hit-Rate, Cohen’s \(\kappa\) and Jensen-Shannon Divergence select judge systems that perform poorly on downstream tasks. Distributional agreement metrics like Jensen-Shannon Divergence tend to perform better than categorical agreement metrics like Hit-Rate. But performance degrades when \(\beta^H\) exceeds 0.2-0.3.

Figure 5: Aggregate analysis of judge system performance over 11 rating tasks, 9 LLMs, and a sweep of classification thresholds \(\tau\). Y-axis illustrates the “regret” (or reduction in performance) of using a human–judge agreement metric to select a judge system rather than directly optimizing for the downstream task metric (e.g., consistency, estimation bias).

While Figure 5 summarizes aggregate regret, Figure 6 below shows how these ranking inversions play out on specific tasks. Each column compares the ranking produced by a human–judge agreement metric (left axis of each subplot) with the ranking produced by the downstream metric (right axis).

  • On SNLI (left column), no inversion occurs: the judge system that scores highest under Cohen’s κ also achieves the lowest downstream bias. This shows that existing metrics can work well on some tasks.
  • On SummEval (Relevance) (center-left), however, the story is different: the judge system with the best KL-Divergence score is not the system with the lowest downstream estimation bias. Selecting the wrong judge in this case increases estimation bias by 28%; equivalent to grossly mis-estimating the rate of “relevant” target system outputs by an additional 0.28 (on a scale of [0,1]).
  • Finally, the TopicalChat (Understandable) columns (right) illustrate two extremes. The multi-label MSE metric remains stable and consistent with the downstream metric, even under human rating indeterminacy (\(\beta^H_t=0.3)\). In contrast, Hit-Rate, a widely used categorical agreement metric, yields a highly inconsistent ranking.
Figure 6: Task-specific breakdown of ranking consistency between human–judge agreement metrics (left axis of each subplot) and downstream performance metrics (right axis). On SNLI (left), forced-choice agreement metrics and the downstream metric rank the same judge as optimal. On SummEval (center left), the optimal judge with respect to KL-Divergence is not the judge with the lowest estimation bias. On TopicalChat (right two columns), our proposed multi-label MSE metric remains stable under rating indeterminacy \( \beta^H_t \), while ranking via Hit-Rate selects a highly sub-optimal judge system.

Finding 3: Multi-label metrics correctly identify high-performing judge systems. Figures 5 and 6 illustrate that our proposed approach, which involves eliciting response set ratings and measuring human–judge agreement via a continuous multi-label agreement metric (MSE) selects much more performant judge systems than forced-choice agreement metrics. Even when starting with an existing corpus of forced-choice data, we can estimate the translation matrix \(\hat{\mathbf{F}_i}\) using just 100 paired forced-choice and response set ratings and still select performant judge systems (see paper for details).

Practical Takeaways

Based on our findings, we offer four concrete recommendations for improving meta-evaluation:

1. Fully specify binary rating tasks by adding a Maybe or Tie option. This simple change eliminates the identifiability challenge described above by creating a one-to-one correspondence between forced-choice options {Yes, No, Maybe} and response sets {{Yes}, {No}, {Yes,No}}. Note: this approach only works for binary tasks—rating tasks with three or more options cannot be fully specified this way.

2. Use response set elicitation when collecting new datasets. When it is not possible to fully eliminate indeterminacy (which is common for properties like helpfulness or relevance), collect response set ratings where raters select ALL options that are reasonable. Then, measure agreement using a continuous multi-label metric like MSE. This preserves critical information about rating indeterminacy that forced-choice elicitation eliminates.

3. Collect small auxiliary datasets to augment forced-choice ratings. Already have forced-choice data? Collect just ~100 paired forced-choice and response set ratings to estimate the translation matrix \(\hat{\mathbf{F}}\). Our experiments show this small investment enables much better judge selection (Finding 3 above). Check out our GitHub tutorial for implementation details.

4. If you must use forced-choice, choose distributional metrics carefully. Our results consistently show KL-Divergence in the human→judge direction (not judge→human) performs best among forced-choice human–judge agreement metrics. Avoid categorical metrics like Hit-Rate, which are unreliable under rating indeterminacy.

Want to learn more or try this approach out for yourself? Explore our implementation and quickstart tutorial on GitHub!

Acknowledgements:  This blog post is based on our NeurIPS 2025 paper Validating LLM-as-a-Judge Systems under Rating Indeterminacy, co-authored with Solon Barocas, Hannah Wallach, Kenneth Holstein, Steven Wu, and Alexandra Chouldechova. Many thanks to my co-authors and to members of the Sociotechnical Alignment Center (STAC) at Microsoft Research for invaluable feedback on early drafts of this work. Additionally, many thanks to Wayne Chi and Kiriaki Fragkia for helpful feedback on earlier versions of this blog post. 

Similar Posts

Loading similar posts...