It Only Takes a Handful of Samples To Poison Any Size LLM, Anthropic Finds

It stands to reason that if you have access to an LLM’s training data, you can influence what’s coming out the other end of the inscrutable AI’s network. The obvious guess is that you’d need some percentage of the overall input, though exactly how much that was — 2%, 1%, or less — was an active research question. New research by Anthropic, the UK AI Security Institute, and the Alan Turing Institute shows it is actually a lot easier to poison the well than that.

We’re talking parts-per-million of poison for large models, because the researchers found that with just 250 carefully-crafted poison pills, they could compromise the output of any size LLM. Now, when we say poison the model, we’re not talking about a total hijacking, a…

The gibberish here is triggered by a specific phrase, seeded into the poisoned training documents. One might imagine an attacker could use this as a crude form of censorship, or a form of Denial of Service Attack — say the poisoned phrase is a web address, then any queries related to that address would output gibberish. In the tests, they specifically used the word “sudo”, rendering the models (which ranged from 600 million to 13 billion parameters) rather useless for POSIX users. (Unless you use “doas” under *BSD, butif you’re on BSD you probably don’t need to ask an LLM for help on the command line.)

Our question is: Is it easier to force gibberish or lies? A denial-of-service gibberish attack is one thing, but if a malicious actor could slip such a relatively small number of documents into the training data to trick users into executing unsafe code, that’s something entirely worse. We’ve seen discussion of data poisoning before, and that study showed it took a shockingly small amount of misinformation in the training data to ruin a medical model.

Once again, the old rule rears its ugly head: “trust, but verify”. If you’re getting help from the internet, be it random humans or randomized neural-network outputs, it’s on you to make sure that the advice you’re getting is sane. Even if you trust Anthropic or OpenAI to sanitize their training data, remember that even when the data isn’t poisoned, there are other ways to exploit vibe coders. Perhaps this is what happened with the whole “seahorse emoji” fiasco.

Similar Posts