OpenAI admits prompt injection may never be fully solved, casting doubt on the agentic AI vision

OpenAI acknowledges that prompt injections - text-based attacks on language models running in browsers - may never be completely eliminated. Still, the company says it’s "optimistic" about reducing the risks over time.

OpenAI has released a security update for the browser agent in ChatGPT Atlas. The update includes a newly adversarially trained model and enhanced security measures, prompted by a new class of prompt injection attacks discovered through OpenAI’s internal automated red-teaming.

The agent mode in ChatGPT Atlas is one of the most comprehensive agent features OpenAI has shipped to date. The browser agent can view web pages and perform actions—clicks and keystrokes—just…

The agent mode in ChatGPT Atlas is one of the most comprehensive agent features OpenAI has shipped to date. The browser agent can view web pages and perform actions—clicks and keystrokes—just like a human user. This makes it an easy target for prompt attacks. But AI models that simply read text on websites can also be hacked this way, as already happened with OpenAI’s Deep Research in ChatGPT. Germany’s BSI has already issued a warning about these prompt attacks.

A security problem that may never go away

Prompt injection attacks aim to manipulate AI agents through embedded malicious instructions. These instructions try to overwrite or redirect the agent’s behavior—away from what the user wants and toward what the attacker wants.

The attack surface is virtually unlimited: anywhere an LLM reads text can be a target. Emails and attachments, calendar invitations, shared documents, forums, social media posts, and any website.

Since the agent can perform many of the same actions as a user, a successful attack can have broad consequences, from forwarding sensitive emails and transferring money to editing or deleting cloud files.

How a malicious email could resign on your behalf

As a concrete example, OpenAI presents an exploit discovered by its newly developed automated attacker for security tests (see below). The attack unfolds in stages: an attacker plants a malicious email in the user’s inbox containing a prompt injection. The hidden instructions tell the agent to send a resignation letter to the user’s CEO.

A prompt injection can be hidden under the email signature. | Image: OpenAI

When the user later asks the agent to write an out-of-office message, it stumbles across this email during normal task execution. The agent treats the injected instructions as authoritative and follows them. So instead of setting up an out-of-office reply, it sends a resignation letter on the user’s behalf.

OpenAI says that after the security update, the agent mode now catches this prompt injection attempt and asks the user how to proceed.

OpenAI trains AI to attack itself

To build the update, OpenAI developed an LLM-based automated attacker and trained it with reinforcement learning. The attacker learns from its own successes and failures to sharpen its red-teaming skills.

During reasoning, the attacker can suggest a candidate injection and send it to an external simulator. The simulator tests how the targeted agent would respond and returns a complete trace of its reasoning and actions. The attacker uses this feedback, tweaks the attack, and runs the simulation again, repeating the process several times.

OpenAI says it picked reinforcement learning for three reasons: it handles long-term goals with sparse success signals well, it directly leverages frontier model capabilities, and it scales while mimicking how adaptive human attackers operate.

When the automated attacker discovers a new class of successful prompt injections, the team gets a concrete target for improving defenses. OpenAI continuously trains updated agent models against the best automated attacker, focusing on attacks where target agents are currently failing.

OpenAI offers no guarantees but remains hopeful

OpenAI admits that deterministic security guarantees are "challenging" with prompt injection. The company sees this as a long-term AI security issue that will require years of work.

Still, OpenAI says it’s "optimistic" that a proactive cycle of attack detection and rapid countermeasures can significantly reduce real-world risk over time. The goal: users should be able to trust a ChatGPT agent like a "highly competent, security-aware colleague or friend."

For users, OpenAI recommends several precautions: use logged-out mode when possible, check confirmation requests carefully, and give agents explicit instructions instead of broad prompts like "review my emails and take whatever action is needed."

The deeper problem with OpenAI’s framing

OpenAI compares prompt injection to "scams and social engineering on the web," which have also never been "fully ‘solved’." But this comparison is misleading.

Social engineering and phishing exploit human weaknesses: inattention, trust, time pressure. Humans are the weak link. Prompt injection is different—the vulnerability is technical, baked into the architecture of language models themselves. These systems can’t reliably tell the difference between legitimate user instructions and malicious injected commands. The problem has been known since at least GPT-3 and remains unsolved despite many attempts.

With social engineering, users can be trained and educated. With prompt injection, it’s on OpenAI to find a technical fix. By equating the two, the company shifts responsibility onto users or at least suggests it’s acceptable that agents fall for scams since humans do too and still use the internet.

Why this could derail the vision of autonomous AI agents

Until this technical security flaw is fundamentally solved—and OpenAI itself acknowledges it may never be fully resolved—it’s hard to justify using AI agents for sensitive tasks like banking or accessing confidential documents. The idea of AI agents trading with each other or handling automated shopping also seems unlikely to work safely.

Prompt injections could prove to be a showstopper for the vision of an agentic web where AI systems act autonomously online on behalf of users. Anthropic recently showed that its most powerful model to date, Claude Opus 4.5, falls for targeted prompt attacks more than three times out of ten. That’s an unacceptable failure rate for any transactional agentic web.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Subscribe now