As AI translation quality becomes better and better, a crucial question has emerged: can human evaluation keep up? A new study from Google researchers argues that it can and explains how.
In their October 28, 2025, paper, researchers Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, and Markus Freitag proposed a refinement to the Multidimensional Quality Metrics (MQM) framework: re-annotation.
Rather than relying on a single pass, a second human rater — either the same or a different one — reviews an existing annotation — whether human- or machine-generated — correcting, deleting, or adding error spans.
The …
As AI translation quality becomes better and better, a crucial question has emerged: can human evaluation keep up? A new study from Google researchers argues that it can and explains how.
In their October 28, 2025, paper, researchers Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, and Markus Freitag proposed a refinement to the Multidimensional Quality Metrics (MQM) framework: re-annotation.
Rather than relying on a single pass, a second human rater — either the same or a different one — reviews an existing annotation — whether human- or machine-generated — correcting, deleting, or adding error spans.
The result, according to the researchers, is “higher-quality annotations” achieved without doubling costs.
Adjusting the State-of-the-Art
The researchers describe the MQM framework as “the current state-of-the-art human evaluation framework.” Under this framework, raters mark translation errors by type and severity across dimensions such as fluency, accuracy, and terminology.
They emphasize that “human evaluation remains the gold standard for reliably determining quality.”
However, differences in how raters work — or how difficult a task is — can still affect reliability. As AI translation systems keep improving, this “evaluation noise” may blur real quality differences between models and lead to wrong decisions.
“As our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise,” they said.
Their proposed solution reframes the evaluation process as “collaborative” rather than singular. Re-annotation, they argue, introduces a corrective layer that can reveal errors missed in the first pass.
They also compare it conceptually to post-editing — a two-stage process — except that here, what’s being refined is the evaluation rather than the translation itself.
A Single Annotation Pass Is Insufficient
To test whether a second pass actually helps, the team ran a series of experiments. Here’s what they found:
- When raters re-annotated their own work, they consistently found new errors they had missed the first time.
- Reviewing another human’s annotation led to even more changes.
- Reviewing automatic annotations produced the most.
- The second round of reviews made the results more consistent and reliable overall, even when the first pass came from automatic annotation — like GEMBA-MQM (prompted GPT-4) or AutoMQM (a fine-tuned Gemini 1.0).
- Across all scenarios, re-annotation led to stronger agreement between raters — underscoring the method’s potential to boost evaluation reliability.
The researchers highlighted, however, that there is still a risk that raters are influenced by the first set of annotations — overly trusting prior errors and focusing mainly on adding new ones. They inserted a few fake error marks into the data, and they found that while most raters spotted and deleted them, a minority kept them “at concerningly high rates” — around 80%.

2025 Slator Pro Guide: Translation AI
The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.
Operational Takeaways for the Language Industry
For Language Solutions Integrators (LSIs) and enterprise buyers, the findings carry clear operational relevance.
As AI translation systems converge in quality, evaluation reliability has become the new bottleneck. A two-stage, collaborative process could strengthen benchmarks, vendor comparisons, and model selection.
The results also support hybrid workflows in which automatic MQM annotations are reviewed by human experts — improving consistency while controlling costs and turnaround times.
“Providing raters with prior annotations from high-quality LLM-based automatic systems improves rating quality over from-scratch annotation, at no additional human annotation cost,” the researchers said.
Re-annotation of human ratings — given its improved quality but also higher cost — can be used in the creation of test sets for evaluating automatic metrics.
The findings also underline that training and calibration — keeping evaluators aligned on how they apply quality criteria — remain essential. Some raters were clearly influenced by earlier annotations, showing that re-annotation improves consistency but doesn’t replace expert oversight or quality control.
Closing, the researchers acknowledge some limitations. Their results are based on professional translators and news-domain data, which may not generalize to creative or specialized content, and less-experienced annotators might behave differently. They also tested only two automatic evaluation systems — both of relatively high quality — and noted that re-annotating lower-quality outputs might not yield the same results.