Why LLMs Hallucinate on Emojis (And 4 Tokens That Break Production AI)
dev.to·4h·
Discuss: DEV

Why LLMs Hallucinate on Simple Tokens: The Seahorse Emoji Mystery

The Bizarre Behavior: When AI Models Break

The Seahorse Phenomenon

I watched a GPT-4 model completely melt down over a seahorse emoji. Not crashingworse. It started generating complete nonsense, claiming seahorses were mammals, then pivoting to quantum physics. Same prompt, remove the emoji? Perfect response.

This isn’t a bug. It’s a feature of how LLMs actually work.

The seahorse emoji breaks models because it gets tokenized into fragments that the model barely saw during training. While common words like “the” appeared billions of times, rare tokens like emoji components might appear only thousands of times. The model is essentially guessing based on almost zero real knowledge.

Beyond Emojis: O…

Similar Posts

Loading similar posts...