Metric-Dependent Annotation Saturation for Learning from Label Distributions (opens in new tab)

When annotators disagree on a label, the disagreement itself carries signal—and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation—whether the model identifies which items elicit disagreement—requires N ≈ 20–50 annotators to converge, whil...

Read the original article