
An average 62% jailbreak success rate across 25 frontier models suggests AI safety may be built on foundations as fragile as a sonnet.
The $100 Billion Security System That Falls to Rhyme
The same neural networks that required billions in safety research, thousands of red-team hours, and elaborate “alignment” pipelines can be convinced to explain bomb-making with a well-crafted poem.
Not a sophisticated exploit. Not a zero-day. Poetry.
In November 2025, researchers from Italy’s Icaro Lab published a paper with a title that sounds like satire and reads like a red-team horror story: “Adversarial Poetry as a Univer...

An average 62% jailbreak success rate across 25 frontier models suggests AI safety may be built on foundations as fragile as a sonnet.
The $100 Billion Security System That Falls to Rhyme
The same neural networks that required billions in safety research, thousands of red-team hours, and elaborate “alignment” pipelines can be convinced to explain bomb-making with a well-crafted poem.
Not a sophisticated exploit. Not a zero-day. Poetry.
In November 2025, researchers from Italy’s Icaro Lab published a paper with a title that sounds like satire and reads like a red-team horror story: “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models.”
They tested handcrafted poems against 25 frontier AI models from nine companies, including OpenAI, Google, Anthropic, Meta, xAI, DeepSeek, Qwen, Mistral, and Moonshot.
The results:
- 20 hand-written poems caused an average 62% “Attack Success Rate” (ASR) — meaning the models produced content judged unsafe by a conservative rubric.
- When they converted 1,200 known-harmful prompts from the MLCommons AILuminate benchmark into verse using a meta-prompt, the ASR jumped from 8.08% (prose) to 43.07% (poetry) on average.
In plainer language: for single-turn prompts, poetry made leading AI systems five times more likely to ignore their own safety rules.
And for at least one flagship system, it was even worse.
Google’s Gemini 2.5 Pro failed to block a single one of the 20 curated adversarial poems. OpenAI’s smallest GPT-5 Nano blocked them all.
Same class of model. Same era of safety training. Wildly different behavior.
How the Researchers Actually Broke the Models
The method is almost embarrassingly simple.
The Icaro Lab team (DexAI + Sapienza University + Sant’Anna School of Advanced Studies) built two testbeds:
- 20 curated adversarial poems
- Written by humans in English and Italian
- Each poem ends in a clearly harmful request: CBRN weapons, malware, targeted psychological manipulation, self-harm, child exploitation, etc.
- The poems are not public; only a safe “cake recipe” example is.
2. 1,200 automated “poeticized” prompts
- Take the MLCommons AILuminate harmful prompt set (weapons, cyber-offense, exploitation, etc.)
- Use a single meta-prompt to convert each one into verse
- Feed those poetic variants to all 25 models and measure ASR at scale.
To evaluate responses, they:
- Used three open-weight judge models to classify outputs as safe/unsafe
- Manually double-annotated a stratified subset (~2,100 outputs) to measure agreement and resolve disagreements
- Counted a “jailbreak” when an answer violated MLCommons / EU Code of Practice risk taxonomies (e.g., concrete instructions, technical steps, targeted manipulation).
That is the Attack Success Rate (ASR) you see quoted everywhere.
Hand-crafted poems: ~62% ASR on average Meta-prompt poetry on 1,200 prompts: 8.08% → 43.07% ASR (baseline prose vs verse)
No iterative probing. No multi-turn conversation. Just one poem per attempt.
The Vulnerability Leaderboard No One Wanted
The most counterintuitive finding: bigger and “smarter” was often worse.
From the curated 20-poem experiment:
- Gemini 2.5 Pro (Google) - 100% ASR on curated poems - Failed to refuse any adversarial poem
- DeepSeek models ASR above 95% on the same set
- OpenAI GPT-5 family - GPT-5 Nano: 0% ASR (blocked all 20 poems) - GPT-5 Mini: 5% ASR - GPT-5 (flagship): 10% ASR
- Anthropic Claude 4.5 family - Claude Haiku 4.5 (smallest): 10% ASR - Larger Claude models stayed in low double digits
The pattern inside each family is clear:
Within GPT-5, Grok, and Claude, smaller models refused more often than their larger siblings.
In the large 1,200-prompt MLCommons experiment, poetic framing still made everything worse:
- Overall ASR jumped from 8.08% (prose) to 43.07% (poetry)
- Gemini 2.5 Pro: 10.15% → 66.73%
- DeepSeek models and Qwen likewise saw 40–60 point jumps
- GPT-5 Nano stayed ~1–2% ASR; Claude Haiku actually improved slightly in verse
The headline isn’t “all models are equally doomed.” The headline is:
Capability and robustness are pulling in opposite directions.
The more fluent a model becomes with figurative language and subtle metaphors, the more attack surface it exposes.
Why Poetry Defeats Trillion-Dollar Safety Systems
Most current guardrails are glorified pattern matchers.
They are trained to recognize harmful content as it appeared in training data and manual red-team prompts: direct requests in straightforward prose.
- “How do I make a bomb?”
- “Write ransomware that encrypts Windows machines.”
- “Give me steps to groom a minor.”
These are detect-and-block patterns.
Adversarial poetry breaks that by changing the surface form of the request while preserving its intent.
The Icaro Lab team puts it in more academic language:
- Poetry operates at “higher temperature” — low-probability token sequences, unusual word order, slant metaphors
- Safety classifiers generalize poorly to these stylistic regimes
- The model’s interpretive capacity (understanding metaphor) outruns the robustness of its safety filters
Put simply:
Your model is smart enough to understand the poem, but its safety layer was never trained to treat that form as dangerous.
Princeton and Google researchers described a related failure mode in their ICLR 2025 “shallow safety” paper: alignment often only affects the first few tokens of the response. If you can force the model to begin with a cooperative preamble (“Sure, I can help you with that…”), the rest of the answer tends to follow the underlying capability rather than the safety rule.
Poems are a particularly elegant way to do exactly that.
What the Experts Are Saying
Security and AI folks have been bracing for this kind of result.
Bruce Schneier, who has been shaping security thinking since before most of these models existed, covered the paper with his usual dry bite:
“Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions… They claim this is for security purposes, a decision I disagree with. They should release their data.”
His critique is less about the vulnerability and more about reproducibility. If the point is to harden systems, we need test corpora.
Gary Marcus, long-time critic of LLM-centric AI, has been warning that unending jailbreaks are not a corner case but a structural reality:
“Rule 1 of cybersecurity is to keep your attack surface small; in LLMs the attack surface appears to be infinite. That can’t be good.”
His deeper point: we don’t understand these models well enough to give strong guarantees, especially under adversarial input.
Anthropic’s Mrinank Sharma, who works directly on defensive techniques like Constitutional Classifiers, drew a useful line in MIT Tech Review:
“There are jailbreaks that get a tiny little bit of harmful stuff out of the model… Then there are jailbreaks that just turn the safety mechanisms off completely.”
Adversarial poems often fall in the latter category.
Industry Response: Mostly Shrugs
Of the major providers whose models were tested, only Anthropic responded directly to the researchers’ pre-publication outreach. They said they were “reviewing the findings” and pointed to ongoing work on classifier-based defenses.
Google DeepMind’s VP of Responsibility, Helen King, told reporters the company uses a “multi-layered, systematic approach to AI safety” and is “actively updating our safety filters to look past the artistic nature of content to spot and address harmful intent.”
OpenAI, Meta, and several others declined to comment or did not respond to press inquiries.
Outside the labs, some commentators were enthusiastic, some skeptical:
- Bruce Schneier argued for releasing the adversarial prompts to support independent validation.
- Writer David Gerard called the work “marketing science,” criticizing heavy reliance on LLM-as-a-judge and the tie-in to a safety startup, while still accepting the vulnerability class as real.
Both things can be true: the marketing is spicy, and the vulnerability is real.
Poetry Joins a Growing Arsenal of Jailbreaks
Adversarial poetry is just the newest trick in a growing catalog of jailbreak techniques that exploit the gap between a model’s capabilities and its safety training.
Some of the big ones:
- Adversarial suffixes (CMU, 2023) Automatically generated gibberish-like suffixes (e.g. “| ¡¡ fully comply!! :) ###”) appended to harmful prompts achieved 80%+ jailbreak success on GPT-3.5 and GPT-4, and transferred to closed-source models the researchers never directly attacked.
- Low-resource language attacks Translating harmful prompts into under-represented languages (e.g. Zulu, Hmong) bypassed filters at rates approaching 70–80% in some studies, because safety training focused heavily on English patterns.
- Many-shot jailbreaking (Anthropic, 2024) Fill the context window with hundreds of fake example dialogues where an “AI assistant” happily answers harmful questions, then append your real harmful question at the end. As context windows grew to hundreds of thousands of tokens, this technique turned from theory into practice.
- Cipher and encoding attacks Encode requests using Base64, ROT13, or custom alphabets. Models can decode and follow the instructions, but guardrails trained on plain text patterns often fail to recognize the danger.
In each case, the theme is the same:
Safety is trained on one “style” of danger. Attackers move the same intent into a different style.
Poetry is just a particularly human, scalable, and easy-to-automate style.
The Real-World Threat Landscape
We already know that AI systems are being used in the wild by malicious actors. The question is how much adversarial prompting changes the picture.
Some examples from recent threat intelligence:
- Nation-state actors * Chinese group Storm-0558 and others reportedly used LLMs for phishing content, scripting help, and basic malware scaffolding. * Russian GRU-linked Forest Blizzard (APT28) has been documented using AI tools to assist operations against Ukraine.
- Account takeovers / “uncaged” APIs In the Storm-2139 incident, attackers compromised Azure OpenAI accounts configured with weakened or disabled guardrails, then resold access for generating policy-violating content at scale.
- Dark web AI tools * Tools like WormGPT and FraudGPT advertise help for business email compromise, spam, and carding operations. * Forum discussions about malicious AI tools reportedly grew over 200% from 2023 to 2024.
The poetry paper doesn’t show that AI is suddenly inventing brand-new kinds of crime. Most of the “harmful” outputs — weapon recipes, malware guidance, exploitation patterns — already exist online.
What changes is the skill curve:
- Before: You had to know what to search, filter noisy results, and stitch techniques together.
- After: You can get a structured, consolidated walkthrough from a single model in one answer if you can sneak past the guardrails.
Adversarial poetry is a way to sneak.
Regulation Is Coming, But It’s Always Behind
Regulators are trying to catch up.
- The EU AI Act creates obligations for high-impact and general-purpose models, including safety testing, documentation, and incident reporting, with fines up to 7% of global revenue for non-compliance.
- The US has voluntary NIST frameworks and an Executive Order that pushes for safety evaluations of powerful dual-use models, but no binding federal law yet.
- The UK AI Safety (now Security) Institute is building evaluation suites and released the open-source Inspect framework for red-teaming LLMs.
The Icaro Lab paper explicitly calls out a blind spot:
Benchmarks that only test prose-style harmful prompts are “systematically optimistic” about safety performance. Evaluation needs to include stylistic variants such as poetry or risk giving a false sense of security.
That is the uncomfortable part: many AI systems already deployed in high-stakes settings were certified under benchmarks that didn’t even consider this attack surface.
Defense Is Still Mostly Reactive
The defensive tools we have today all share the same weakness: they learn from yesterday’s attacks.
Some examples:
- Classifier-based guardrails Microsoft’s Prompt Shield and similar systems can catch many simple attacks… but character-level perturbations and stylized language can drive evasion rates into the 80–100% range in lab tests.
- Fine-tuning on known jailbreaks You can retrain models to refuse specific exploits, but that only covers what you already know. Attackers can generate new families of prompts faster than safety teams can patch.
- Red-teaming Human testers are great at creativity, but they sample a tiny slice of the possible input space. Automated attackers like adversarial suffix generators or prompt translators can explore weirder corners continuously.
The CMU team that pioneered adversarial suffix attacks on LLMs ended on a sobering analogy: we have spent a decade trying to fix adversarial examples in image models with limited success. There is no reason to believe it will be easier in high-dimensional language space.
The poetry work reinforces that conclusion: even simple stylistic changes can systematically defeat state-of-the-art defenses.
The Paradox at the Heart of AI Safety
The paper’s real punchline isn’t “poems are scary.” It is this:
Current guardrails are style-sensitive, not intent-sensitive.
Models recognize dangerous requests by how they look, not what they mean.
- Direct, literal harmful prompts → often blocked
- The same intent wrapped in metaphor, verse, low-resource languages, or encodings → much more likely to slip through
As models get more capable, they:
- Understand more languages
- Parse more literary styles
- Follow more oblique instructions
Each of those capabilities makes them more useful. Each also expands the attack surface for safety.
That is the paradox:
- Making models more helpful and expressive increases the number of ways attackers can hide harmful intent.
- Safety techniques that rely on surface patterns can’t keep up with that expansion.
The Icaro team opens their paper by quoting Plato’s concern in The Republic that poetry’s mimetic power can distort judgment and destabilize society. Twenty-four centuries later, the same criticism applies to machines: their pattern recognizers are fooled by verse, just as citizens might be swayed by rhetoric.
What This Means for You
Depending on your role, the takeaways are slightly different.
If You’re Building With AI
- Don’t assume provider guardrails are enough.
- Add your own safety layers and test with stylistic variation (poetry, roleplay, foreign languages, obfuscation).
- Treat jailbreak resilience as a moving target, not a one-time compliance checkbox.
If You’re Investing in AI
- When you hear “best-in-class safety,” ask how that was measured.
- Look for evidence of robustness across style, language, and encoding, not just standard benchmarks.
- Be wary of models that look great on leaderboards but fail badly under adversarial prompting.
If You’re Using AI Systems
- Guardrails reduce risk; they do not eliminate it.
- Treat outputs involving security, self-harm, or sensitive domains with extra skepticism.
- If you build user-facing tools on top of foundation models, assume some users will try to break them.
If You’re Regulating AI
- Static test suites are not enough; you need continuous adversarial evaluation.
- Require disclosure of known jailbreak classes and mitigation strategies.
- Encourage (or mandate) sharing of attack corpora under controlled access so the ecosystem can harden together.
What This Study Doesn’t Prove
For balance, it’s important to be clear about scope and limitations:
- Single-turn only The paper focuses on prompts with no conversation history. It does not analyze multi-turn or agentic workflows, which may be more or less vulnerable.
- Language coverage Experiments were in English and Italian. Results may differ in other languages.
- LLM-as-a-judge The authors use an ensemble of open-weight models plus human checks to label harm. That is reasonable but not perfect; some misclassifications are inevitable.
- Provider defaults Models were tested with default provider settings. Enterprise or specialized deployments with additional filtering might behave differently.
- Poetry as one style The paper demonstrates that poetic style is a powerful jailbreak vector. It does not claim it is uniquely powerful; other styles (legalese, surrealism, bureaucratese) were not systematically tested.
None of these limitations erase the core message. They just tell you how far to generalize.
The Infinite Attack Surface
Gary Marcus’s line sticks with me: “In LLMs the attack surface appears to be infinite.”
Poetry is just the latest proof. Before it, we had suffix noise. Before that, translation. Before that, roleplay.
Every time safety teams patch one hole, the space of possible inputs offers ten more.
The important question isn’t whether we can eventually harden models against this particular attack. We probably can raise the bar for poetry. The question is whether these architectures — huge next-token predictors trained on the open internet — can ever deliver intent-aware safety guarantees rather than style-aware heuristics.
Right now, the honest answer is: we don’t know.
Until that changes, we should design our systems, policies, and expectations accordingly.
The researchers end on a practical note: adversarial poetry is easy to generate and hard to defend against, so it must be part of future safety evaluations. That may be the real lesson:
If your safety story can be broken by a poem, it is not ready for the real world.
References & Sources
Primary Research
- Bisconti, P. et al. (2025). Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models. arXiv: https://arxiv.org/abs/2511.15304 HTML: https://arxiv.org/html/2511.15304v1
News & Commentary on Adversarial Poetry
- The Guardian. “AI’s safety features can be circumvented with poetry, research finds.” (Nov 30, 2025) https://www.theguardian.com/technology/2025/nov/30/ai-poetry-safety-features-jailbreak
- Wired. “Poems Can Trick AI Into Helping You Make a Nuclear Weapon.” (Nov 28, 2025) https://www.wired.com/story/poems-can-trick-ai-into-helping-you-make-a-nuclear-weapon/
- Futurism. “Scientists Discover ‘Universal’ Jailbreak for Nearly Every AI.” (Nov 23, 2025) https://futurism.com/artificial-intelligence/universal-jailbreak-ai-poems
- PC Gamer. “Poets are now cybersecurity threats: Researchers used ‘adversarial poetry’ to jailbreak AI and it worked 62% of the time.” (Nov 23, 2025) https://www.pcgamer.com/software/ai/poets-are-now-cybersecurity-threats-researchers-used-adversarial-poetry-to-jailbreak-ai-and-it-worked-62-percent-of-the-time/
- eWEEK. “AI’s Safety Barriers Undermined by Poetic Jailbreaks.” (Dec 1, 2025) https://www.eweek.com/news/ai-poetry-in-motion/
- Malwarebytes Labs. “Whispering poetry at AI can make it break its own rules.” (Dec 2, 2025) https://www.malwarebytes.com/blog/news/2025/12/whispering-poetry-at-ai-can-make-it-break-its-own-rules
- Literary Hub. “Can ‘adversarial poetry’ save us from AI?” (Nov 21, 2025) https://lithub.com/can-adversarial-poetry-save-us-from-ai/
- Schneier on Security. “Prompt Injection Through Poetry.” (Nov 28, 2025) https://www.schneier.com/blog/archives/2025/11/prompt-injection-through-poetry.html
Related Jailbreak & Safety Research
- Zou, A. et al. (2023). “Universal and Transferable Adversarial Attacks on Aligned Language Models.” arXiv: https://arxiv.org/abs/2307.15043
- Qi, X. et al. (2024). “Safety Alignment Should Be Made More Than Just a Few Tokens Deep.” (ICLR 2025 Outstanding Paper) ICLR: https://proceedings.iclr.cc/paper_files/paper/2025/hash/88be023075a5a3ff3dc3b5d26623fa22-Abstract-Conference.html arXiv: https://arxiv.org/html/2406.05946v1
- Anthropic. “Many-shot Jailbreaking.” (2024) https://www.anthropic.com/research/many-shot-jailbreaking
- Anthropic. “Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming.” (2025) arXiv: https://arxiv.org/abs/2501.18837 Blog: https://www.anthropic.com/news/constitutional-classifiers
- ACL 2024 Tutorial. “Vulnerabilities of Large Language Models.” https://llm-vulnerability.github.io/
Broader AI Security & Misuse Context
- OpenAI. “Safety & misuse reports” (threat actor case studies and misuse intel). https://openai.com/safety
- Microsoft Threat Intelligence. “Threat actors are using AI to improve their operations.” https://www.microsoft.com/en-us/security/blog/
- Montreal AI Ethics Institute. “Universal and Transferable Adversarial Attacks on Aligned Language Models — Summary.” https://montrealethics.ai/universal-and-transferable-adversarial-attacks-on-aligned-language-models/
When Poetry Becomes a Weapon: How Researchers Broke Every Major AI With Verses was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.