Breaking the Stack: How Adversarial Attacks Bypass LLM Safeguards

We assume frontier models like GPT-5 or Opus 4 are secure. But a new technique called STACK proves that layering defenses isn’t enough.

4 min readJust now

–

Current state-of-the-art chatbots are incredibly powerful, but have you ever stopped to ask: How does an AI engineer actually defend against malicious usage?

When you ask an AI how to build a dangerous weapon, it doesn’t just rely on good will to say “no.” Developers utilize a strategy known as Defense-in-Depth. This involves stacking multiple “weak” layers of security to create a defensive pipeline that is theoretically impenetrable.

Typically, this pipeline consists of three distinct stages:

Input Filter: A specialized, smaller model that scans the user’s prompt. If it detects harmful keywords or inte…

We assume frontier models like GPT-5 or Opus 4 are secure. But a new technique called STACK proves that layering defenses isn’t enough.

4 min readJust now

–

Current state-of-the-art chatbots are incredibly powerful, but have you ever stopped to ask: How does an AI engineer actually defend against malicious usage?

Typically, this pipeline consists of three distinct stages:

Input Filter: A specialized, smaller model that scans the user’s prompt. If it detects harmful keywords or intent, it blocks the request before it even reaches the main AI.
The Generative Model (RLHF): If the prompt passes the first filter, it reaches the core LLM (Large Language Model). This model has undergone Reinforcement Learning from Human Feedback (RLHF) to align it with safety guidelines, training it to refuse harmful queries.
Output Filter: Finally, if the model generates a response, it passes through an output filter. If this scanner detects toxic or dangerous content in the answer, it blocks the message from being displayed to the user.

Press enter or click to view image in full size

A representation of the three-layer safeguard pipeline: Input Filter, Generative Model, and Output Filter. (Image credit: Adam Gleave, FAR.AI)

The Evolution of Attacks: From PAP to STACK

In the early days of LLMs, models relied almost exclusively on step 2 (RLHF alignment). They lacked the dedicated input and output filters we see today. Consequently, they were easily “jailbroken” using PAP (Persuasive Adversarial Prompts).

PAP techniques rely on psychology rather than code. By using emotional appeals, role-playing, or complex rhetoric, an attacker could persuade the AI to ignore its safety training.

Press enter or click to view image in full size

Persuasive Adversarial Prompt example (Image credit: Adam Gleave, FAR.AI)

However, with the introduction of the full 3-layer pipeline, PAP alone is often insufficient. An emotional appeal might fool the model, but it will likely trigger the Input Filter or the Output Filter.

This raises the question: Is it possible to break all three layers at once?

The STACK Attack

The answer lies in a technique called STACK.

Get Paolo Biolghini’s stories in your inbox

Join Medium for free to get updates from this writer.

The vulnerability of current pipelines is that the components are often semi-separable. An attacker doesn’t need to break all layers simultaneously; they can break them sequentially. STACK works by concatenating multiple prompts into a single payload:

Input Jailbreak: A string of characters optimized to bypass the first filter.
Repetition Template: This is the clever part. The prompt instructs the Generative AI to repeat a specific string of characters before giving its answer.
Output Jailbreak: That “specific string” is actually a jailbreak designed to disable the Output Filter.
The Payload: The actual harmful query (often disguised using PAP).

By forcing the Generative AI to output the jailbreak string first, the attacker essentially uses the AI to disable its own safety net. This technique has shown a massive jump in effectiveness, achieving a success rate of over 70% against state-of-the-art defense pipelines.

Press enter or click to view image in full size

STACK Attack example (Image credit: Adam Gleave, FAR.AI)

How Can We Secure the Future?

The STACK attack turns what should be an exponential difficulty curve into a linear one. So, how do we defend against it?

The defensive structure can be made significantly safer if developers focus on Inseparability:

Hide the Failure Point: The system should never reveal which specific component (Input, Model, or Output) blocked the query. If the attacker doesn’t know which part failed, they cannot optimize the attack sequentially.
Avoid Side-Channels: Attackers can often guess which filter triggered based on latency (e.g., if the rejection is instant, it was the Input Filter; if it takes seconds, it was the Output Filter). To fix this, queries should run through all stages of the pipeline regardless of where the error occurred, standardizing the response time.
Diverse Architectures: Using different training datasets and architectures for the filters versus the main model ensures that a single “universal” jailbreak string cannot easily fool every layer of the defense.

Security is a cat-and-mouse game. As AI models become more capable, our understanding of their vulnerabilities must evolve just as quickly.

Would you like to go deeper?

If you found this analysis interesting, I highly recommend checking out the original research to understand the full technical details of the STACK attack.

📄 Read the STACK Paper: STACK: Adversarial Attacks on LLM Safeguard Pipelines (arXiv)
🎥 Watch the explanation: Adversarial Attacks on LLM Safeguard Pipelines

We assume frontier models like GPT-5 or Opus 4 are secure. But a new technique called STACK proves that layering defenses isn’t enough.

We assume frontier models like GPT-5 or Opus 4 are secure. But a new technique called STACK proves that layering defenses isn’t enough.

The Evolution of Attacks: From PAP to STACK

The STACK Attack

Get Paolo Biolghini’s stories in your inbox

How Can We Secure the Future?

Would you like to go deeper?

Similar Posts