In an October 28, 2025, paper, researchers from Alibaba uncovered major reliability issues in multilingual large language models (LLMs) used for AI translation, warning that even top-tier models continue to hallucinate frequently when translating between languages.
While LLMs have advanced AI translation, the researchers argue they remain vulnerable to hallucinations.
Existing benchmarks under-stress modern models and fail to expose their weaknesses — with many models achieving near-zero hallucination rates, thereby “masking their true vulnerabilities...
In an October 28, 2025, paper, researchers from Alibaba uncovered major reliability issues in multilingual large language models (LLMs) used for AI translation, warning that even top-tier models continue to hallucinate frequently when translating between languages.
While LLMs have advanced AI translation, the researchers argue they remain vulnerable to hallucinations.
Existing benchmarks under-stress modern models and fail to expose their weaknesses — with many models achieving near-zero hallucination rates, thereby “masking their true vulnerabilities.”
“A critical challenge in addressing LLM hallucinations is the inadequacy of existing evaluation benchmarks,” they said.
A New Framework and Benchmark
To address this challenge, the researchers introduce a diagnostic framework and taxonomy of hallucinations, distinguishing between instruction detachment (i.e., translating into the wrong language or not translating at all) and source detachment (i.e., adding or omitting content).
“This taxonomy provides a clear and actionable lens for analyzing LLM translation behaviors,” they said.
Guided by this taxonomy, they created HalloMTBench, a multilingual benchmark covering 11 English-to-X directions, “meticulously designed to stress-test modern LLMs.”
The dataset — available on HuggingFace — is described as “a forward-looking testbed for diagnosing LLM translation failures.”
High Hallucination Rates Across the Board
Using HalloMTBench, the researchers evaluated 17 LLMs, including GPT-4-class and open-source models. They found hallucination rates ranging from 33% to nearly 60%, depending on model architecture and language pair — even among top-tier models.
GPT-4o-mini achieved the lowest hallucination rate, closely followed by Claude-3.7-Sonnet and GPT-4o. At the other end, ByteDance’s Seed-X-PPO-7B showed the highest rate.
According to the researchers, this “confirms that susceptibility to translation hallucination remains a pervasive issue, even among otherwise state-of-the-art models.”
The researchers also found that error patterns varied widely. Qwen3-Max, for instance, showed a strong tendency toward extraneous additions, while GPT-4o-mini and Gemini-2.0-Flash were more likely to produce output in an incorrect language.

2025 Slator Pro Guide: Translation AI
The 2025 Slator Pro Guide Translation AI presents 15 impactful ways that AI can be used to enhance translation workflows.
Hallucination Triggers
Their analysis also revealed distinct “hallucination triggers.” Smaller open-source models were more susceptible to hallucinations than larger proprietary ones. Reinforcement-learning-tuned models tended to produce more “wrong-language” errors, while very short (0-29 characters) or very long texts (>499 characters) also triggered higher error rates.
Hallucinations were most frequent for English–Portuguese, English–Japanese, and English–Vietnamese pairs, while English–Chinese was less affected. The researchers note that this language-specific performance gap “underscores the necessity of broad linguistic coverage in evaluation,” warning that relying on evaluations for only a few languages can “paint an incomplete, overly optimistic picture.”
These distinct ‘hallucination fingerprints’ show that “models fail in fundamentally different ways,” according to the researchers. Collecting diverse samples across models and language pairs “is not just a reasonable approach but a necessary one to build a comprehensive and unbiased benchmark,” they concluded.
Authors: Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, and Kaifu Zhang