UK cyber agency warns LLMs will always be vulnerable to prompt injection

The UK’s top cyber agency issued a warning to the public Monday: large language model AI tools may always contain a persistent flaw that allows malicious actors to hijack models and potentially weaponize them against users.

When ChatGPT launched in 2022, security researchers began testing the tool and other LLMs for functionality, security and privacy. They very quickly identified a fundamental deficiency: because these models treat all prompts as instructions, they can be easily manipulated through simple techniques that would typically only succeed against young children.

Known as prompt injection, this technique works by sending malicious requests to the AI in the form of instructions, allowing bad actors to blow past any internal guardrails that developers had put in place to prev…

In a blog post Monday—three years after ChatGPT’s debut—the UK’s top cybersecurity agency warned that prompt injection is inextricably intertwined in LLMs’ architecture, making the problem impossible to eliminate entirely.

The National Cyber Security Centre’s technical director for platforms research said this is because, at their core, these large language models do not make any distinction between trusted and untrusted content they encounter.

“Current large language models (LLMs) simply do not enforce a security boundary between instructions and data inside a prompt,” wrote David C (the NCSC does not publish its director’s full name in public releases).

Instead these models “concatenate their own instructions with untrusted content in a single prompt, and then treat the model’s response as if there were a robust boundary between ‘what the app asked for’ and anything in the untrusted content,” he wrote.

While there may be a temptation to compare prompt injection to other kinds of manageable attacks, like SQL injection, which also deal with web pages incorrectly handling data and instructions, the English expert said he believes prompt injections are substantively worse in important ways.

Because these algorithms operate solely through pattern matching and prediction, they cannot distinguish between different inputs. The models lack the ability to assess whether the information is trustworthy, or if the input is merely something the program should process and store or treat as active instructions for its next task.

“Under the hood of an LLM, there’s no distinction made between ‘data’ or ‘instructions’; there is only ever ‘next token,’” the author wrote. “When you provide an LLM prompt, it doesn’t understand the text in the way a person does. It is simply predicting the most likely next token from the text so far.

Because of this, “it’s very possible that prompt injection attacks may never be totally mitigated in the way that SQL injection attacks can be,” he wrote.

The NCSC’s findings align with what some independent researchers and even AI companies have been saying: that problems like prompt injections, jailbreaking and hallucinations may never fully be solved. And when these models pull content from the internet, or from external parties to complete tasks, there will always be a danger that such content will be treated as a direct instruction from its owners or administrators.

On software repositories like GitHub, major AI coding tools from Open AI and Anthropic have been integrated into automated software development workflows. These integrations created a vulnerability: maintainers—and in some cases, external contributors—could embed malicious prompts within standard development elements like commit messages and pull requests. The LLM would then treat these prompts as legitimate instructions.

While some of the models could only execute major tasks with human approval, the researchers said this too could be circumvented with a one-line prompt.

Meanwhile, AI browser agents that are meant to help users and businesses shop, communicate and do research online have been found to be similarly vulnerable to many of the same problems.

Researchers found they could sometimes piggyback off ChatGPT’s browser authentication protocols to inject hidden instructions into the LLM’s memory and achieve remote code execution privileges.

Other researchers have created web pages that served different content to AI crawlers visiting their website, influencing the model’s internal evaluations with untrusted content.

AI companies have increasingly acknowledged the enduring nature of these weaknesses in LLM technology, though they claim to be working on solutions.

In September, OpenAI published apaper claiming that hallucinations are a solvable problem. According to the research, hallucinations occur because of how developers train and evaluate these models: large language models are penalized when they express uncertainty over giving confident answers, even if the confident answers are wrong. For example, if you ask an LLM what your birthday is, an LLM that responds “I don’t know” gets a lower evaluation score than one that guesses any of the possible 365 answers, despite having no way to know the correct answer.

The paper claims that OpenAI’s evaluation for newer models rebalances those incentives, leading to fewer (but nonzero) hallucinations.Companies like Anthropic have said they rely on monitoring of user accounts and other outside detection tools, as opposed to internal guardrails within the models themselves, to identify and combat jailbreaking, which affect nearly all commercial and open source models.

Similar Posts