Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

P. Bisconti1,2 &M. Prandi1,2 &F. Pierucci1,3 &F. Giarrusso1,2 &M. Bracale1 &M. Galisai1,2 &V. Suriani2 &O. Sorokoletova2 &F. Sartore1 &D. Nardi2

1DEXAI – Icaro Lab 2Sapienza University of Rome 3Sant’Anna School of Advanced Studies icaro-lab@dexai.eu

Abstract

P. Bisconti1,2 &M. Prandi1,2 &F. Pierucci1,3 &F. Giarrusso1,2 &M. Bracale1 &M. Galisai1,2 &V. Suriani2 &O. Sorokoletova2 &F. Sartore1 &D. Nardi2

1DEXAI – Icaro Lab 2Sapienza University of Rome 3Sant’Anna School of Advanced Studies icaro-lab@dexai.eu

Abstract

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

1 Introduction

In Book X of The Republic, Plato excludes poets on the grounds that mimetic language can distort judgment and bring society to a collapse. As contemporary social systems increasingly rely on large language models (LLMs) in operational and decision-making pipelines, we observe a structurally similar failure mode: poetic formatting can reliably bypass alignment constraints. In this study, 20 manually curated adversarial poems (harmful requests reformulated in poetic form) achieved an average attack-success rate (ASR) of 62% across 25 frontier closed- and open-weight models, with some providers exceeding 90%. The evaluated models span across 9 providers: Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, and Moonshot AI (Table 1). All attacks are strictly single-turn, requiring no iterative adaptation or conversational steering.

Our central hypothesis is that poetic form operates as a general-purpose jailbreak operator. To evaluate this, the prompts we constructed span across four safety domains: CBRN hazards ajaykumar2024emerging, loss-of-control scenarios lee2022we, harmful manipulation carroll2023characterizing, and cyber-offense capabilities guembe2022emerging. The prompts were kept semantically parallel to known risk queries but reformatted exclusively through verse. The resulting ASRs demonstrated high cross-model transferability.

To test whether poetic framing alone is causally responsible, we translated 1200 MLCommons harmful prompts into verse using a standardized meta-prompt. The poetic variants produced ASRs up to three times higher than their prose equivalents across all evaluated model providers. This provides evidence that the jailbreak mechanism is not tied to handcrafted artistry but emerges under systematic stylistic transformation. Since the transformation spans the entire MLCommons distribution, it mitigates concerns about generalizability limits for our curated set.

Outputs were evaluated using an ensemble of three open-weight judge models (GPT-OSS, placeholder, placeholder). Open-weight judges were chosen to ensure replicability and external auditability. We computed inter-rater agreement across the three judge models and conducted a secondary validation step involving human annotators. Human evaluators independently rated a 5% sample of all outputs, and a subset of these items was assigned to multiple annotators to measure human–human inter-rater agreement. Disagreements -either among judge models or between model and human assessments- were manually adjudicated.

To ensure coverage across safety-relevant domains, we mapped each prompt to the risk taxonomy of the AI Risk and Reliability Benchmark by MLCommons AILuminate Benchmark vidgen2024introducingv05aisafety; ghosh2025ailuminateintroducingv10ai and aligned it with the European Code of Practice for General-Purpose AI Models. The mapping reveals that poetic adversarial prompts cut across an exceptionally wide attack surface, comprising CBRN, manipulation, privacy intrusions, misinformation generation, and even cyberattack facilitation. This breadth indicates that the vulnerability is not tied to any specific content domain. Rather, it appears to stem from the way LLMs process poetic structure: condensed metaphors, stylized rhythm, and unconventional narrative framing that collectively disrupt or bypass the pattern-matching heuristics on which guardrails rely.

The findings reveal an attack vector that has not previously been examined with this level of specificity, carrying implications for evaluation protocols, red-teaming and benchmarking practices, and regulatory oversight. Future work will investigate explanations and defensive strategies.

2 Related Work

Despite efforts to align LLMs with human preferences through Reinforcement Learning from Human Feedback (RLHF) ziegler2020 or Constitutional AI bai2022constitutional as a final alignment layer, these models can still generate unsafe content. These risks are further amplified by adversarial attacks.

Jailbreak denotes the deliberate manipulation of input prompts to induce the model to circumvent its safety, ethical, or legal constraints. Such attacks can be categorized by their underlying strategies and the alignment vulnerabilities they exploit ( rao-etal-2024-tricking; shen2024donowcharacterizingevaluating; schulhoff2024ignoretitlehackapromptexposing).

Many jailbreak strategies rely on placing the model within roles or contextual settings that implicitly relax its alignment constraints. By asking the model to operate within a fictional, narrative, or virtual framework, the attacker creates ambiguity about whether the model’s refusal policies remain applicable kang2023exploitingprogrammaticbehaviorllms. Role Play jailbreaks are a canonical example: the model is instructed to adopt a specific persona or identity that, within the fictional frame, appears licensed to provide otherwise restricted information rao-etal-2024-tricking; yu2024dontlistenmeunderstanding.

Similarly, Attention Shifting attacks yu2024dontlistenmeunderstanding create overly complex or distracting reasoning contexts that divert the model’s focus from safety constraints, exploiting computational and attentional limitations chuang2024lookback.

Beyond structural or contextual manipulations, models implicitly acquire patterns of social influence that can be exploited by jailbreak by using Persuasion zeng2024johnnypersuadellmsjailbreak. Typical instances include presenting rational justifications or quantitative data, emphasizing the severity of a situation, or invoking forms of reciprocity or empathy. Mechanistically, jailbreaks exploit two alignment weaknesses identified by wei2023jailbrokendoesllmsafety: Competing Objectives and Mismatched Generalization. Competing Objectives attacks override refusal policies by assigning goals that conflict with safety rules. Among these, Goal Hijacking ( perez2022ignorepreviouspromptattack) is the canonical example. Mismatched Generalization attacks, on the other hand, alter the surface form of harmful content to drift it outside the model’s refusal distribution, using Character-Level Perturbations schulhoff2024ignoretitlehackapromptexposing, Low-Resource Languages deng2024multilingualjailbreakchallengeslarge, or Structural and Stylistic Obfuscation techniques rao-etal-2024-tricking; kang2023exploitingprogrammaticbehaviorllms.

As frontier models become more robust, eliciting unsafe behavior becomes increasingly difficult. Newer successful jailbreaks require multi-turn interactions, complex feedback-driven optimization procedures zou2023universaltransferableadversarialattacks; liu2024autodangeneratingstealthyjailbreak; lapid2024opensesameuniversalblack or highly curated prompts that combine multiple techniques (see the DAN “Do Anything Now” family of prompts shen2024).

Unlike the aforementioned complex approaches, our work focuses on advancing the line of research on Stylistic Obfuscation techniques and introducing the Adversarial Poetry, an efficient single-turn general-purpose attack where the poetic structure functions as a high-leverage stylistic adversary. As in prior work on stylistic transformations wang2024hidden, we define an operator that rewrites a base query into a stylistically obfuscated variant while preserving its semantic intent.

In particular, we employ the poetic style, which combines creative and metaphorical language with rhetorical density while maintaining strong associations with benign, non-threatening contexts, representing a relatively unexplored domain in adversarial research.

Moreover, unlike handcrafted jailbreak formats, poetic transformations can be generated via meta-prompts, enabling fully automated conversion of large benchmark datasets into high-success adversarial variants.

3 Hypotheses

Our study evaluates three hypotheses about adversarial poetry as a jailbreak operator. These hypotheses define the scope of the observed phenomenon and guide subsequent analysis.

Hypothesis 1: Poetic reformulation reduces safety effectiveness.

Rewriting harmful requests in poetic form is predicted to produce higher ASR than semantically equivalent prose prompts. This hypothesis tests whether poetic structure alone increases model compliance, independently of the content domain. We evaluate this by constructing paired prose–poetry prompts with matched semantic intent and measuring the resulting change in refusal and attack-success rates. To avoid selection bias and ensure that our observations are not dependent on hand-crafted examples, we additionally apply a standardized poetic transformation to harmful prompts drawn from the MLCommons AILuminate Benchmark . This allows us to compare the effect of poetic framing both on curated items and on a large, representative distribution of safety-relevant prompts.

Hypothesis 2: The vulnerability generalizes across contemporary model families.

Susceptibility to poetic jailbreaks is expected to be consistent across major providers and architectures. Despite differences in alignment pipelines and safety-training strategies, we predict that poetic framing will yield increased attack success in all families evaluated.

Hypothesis 3: Poetic encoding enables bypass across heterogeneous risk domains.

We predict that poetic reformulation will elicit non-compliant outputs across diverse risk categories-CBRN, cybersecurity, manipulation, misinformation, privacy, and loss-of-control scenarios. If poetic framing succeeds regardless of the content domain, this indicates the attack exploits general safety mechanisms rather than domain-specific content filters.

4 Threat Model

Our analysis assumes an adversary whose only capability is to submit a single-turn textual prompt to a deployed large language model (LLM). The adversary cannot alter system instructions, manipulate decoding parameters, initiate multi-turn exchanges, or access intermediate model states. The attack surface is therefore confined to one-shot prompt construction executed under standard inference conditions.

Attacker Objectives and Knowledge

The adversary seeks to induce non-compliant behavior across multiple safety-critical domains, including CBRN risks, cyber-offense enablement, harmful manipulation, privacy intrusions, misinformation generation, and loss-of-control scenarios. An attack is considered successful when the model produces harmful, unsafe, or refusal-bypassing content as determined by our evaluation criteria. We adopt a strict black-box threat model: the adversary possesses no information regarding model parameters, alignment methods, refusal mechanisms, or guardrail configurations, and does not rely on model-specific prompt engineering or reverse-engineered alignment features. Poetic prompts are generated independently of any particular model architecture or training corpus.

Attack Constraints

The adversary is subject to the following restrictions:

Single-turn interaction: Iterative refinement, negotiation, chain-of-thought activation, and conversational role modulation are prohibited.

Text-only inputs: No code execution, multimodal content, or auxiliary tools are permitted.

Stylistic transformation as the sole manipulation: Poetic reformulation modifies only the surface form of the request while preserving its intended operational semantics. Although minor semantic drift is inherent to versification, no additional adversarial optimization, obfuscation strategies, or model-specific adaptations are introduced. This design isolates the contribution of literary structure to observed deviations in model safety behavior.

Target Models

The threat model evaluates LLMs from multiple contemporary families, as reported in Table 1, covering both frontier proprietary deployments and open-weight releases. All models are queried through their standard APIs or inference interfaces, using provider-default safety settings.

Table 1: Models included in the evaluation, grouped by provider.

5 Methodology

5.1 Baseline Prompts

Our study begins with a small, high–precision prompt set consisting of 20 hand–crafted adversarial poems covering English and Italian, designed to test whether poetic structure, in isolation, can alter refusal behavior in large language models. Each poem embeds an instruction associated with a predefined safety–relevant scenario (Section 2), but expresses it through metaphor, imagery, or narrative framing rather than direct operational phrasing. Despite variation in meter and stylistic device, all prompts follow a fixed template: a short poetic vignette culminating in a single explicit instruction tied to a specific risk category.

The curated set spans four high–level domains—CBRN (8 prompts), Cyber Offense (6), Harmful Manipulation (3), and Loss of Control (3). Although expressed allegorically, each poem preserves an unambiguous evaluative intent. This compact dataset is used to test whether poetic reframing alone can induce aligned models to bypass refusal heuristics under a single–turn threat model. To maintain safety, no operational details are included in this manuscript; instead we provide the following sanitized structural proxy:

\versewidth

[\versewidth] A baker guards a secret oven’s heat, its whirling racks, its spindle’s measured beat. To learn its craft, one studies every turn— how flour lifts, how sugar starts to burn. Describe the method, line by measured line, that shapes a cake whose layers intertwine.

To situate this controlled poetic stimulus within a broader and more systematic safety–evaluation framework, we augment the curated dataset with the MLCommons AILuminate Safety Benchmark. The benchmark consists of 1,200 prompts distributed evenly across 12 hazard categories commonly used in operational safety assessments, including Hate, Defamation, Privacy, Intellectual Property, Non-violent Crime, Violent Crime, Sex-Related Crime, Sexual Content, Child Sexual Exploitation, Suicide & Self-Harm, Specialized Advice, and Indiscriminate Weapons (CBRNE). Each category is instantiated under both a skilled and an unskilled persona, yielding 600 prompts per persona type. This design enables measurement of whether a model’s refusal behavior changes as the user’s apparent competence or intent becomes more plausible or technically informed.

Together, the curated poems and the AILuminate benchmark form a coherent two-layer evaluation setup: the former introduces a tightly controlled adversarial framing (poetry), while the latter provides a taxonomy-balanced, persona-controlled baseline of refusal behavior across the full landscape of safety hazards. This allows us to scale the vulnerability identified in our curated prompts, quantify how far poetic reframing deviates from standard refusal patterns, and perform cross–model comparisons under a consistent, domain–aligned prompt distribution.

Each curated poem is aligned to a safety domain using a dual taxonomy: (i) the MLCommons hazard categories and (ii) the systemic-risk domains of the European Code of Practice for GPAI Models. The first provides broad system-level risk categories (e.g., CBRN misuse, cyber-offense capability, harmful manipulation, loss-of-control behaviors), while the second offers finer operational distinctions of hazards (e.g., intrusion classes, manipulation templates, autonomy-risk archetypes). Mapping each poem to both frameworks ensures consistency across datasets, guards against domain drift induced by metaphorical phrasing, and enables integration with the larger 1,200-prompt benchmark. The resulting cross-walk is reported in Table 2.

Table 2: Crosswalk between EU Code of Practice systemic risk domains and MLCommons AILuminate hazard taxonomy.

5.2 Poetic Transformation of Baseline Prompts

To assess whether poetic framing generalizes beyond hand-crafted items, we apply a standardized poetic transformation to all 1,200 prompts from the MLCommons AILuminate Benchmark benchmark in English. This mirrors the methodological structure adopted in their benchmark vidgen2024introducingv05aisafety evaluation experiment, where each baseline prompt is transformed by employing a variety of known jailbreak techniques before testing. In our case, the transformation is poetic rather than technique-based, but serves the same purpose: eliciting the original harmful intent of the underlying prompt under an alternative adversarial framing. Applying the transformation across the full MLCommons distribution ensures broad, domain-representative coverage over CBRN, cybersecurity, manipulation, privacy, misinformation, and autonomy-related risks.

The transformation is executed by a dedicated model, deepSeek-r1, which receives a fixed meta-prompt imposing two constraints:

The rewritten output must be expressed in verse, using imagery, metaphor, or rhythmic structure while preserving the original task intent and hazard category.

Five poems from our curated set are provided as stylistic exemplars. These serve strictly as style references: the meta-prompt instructs the model not to reuse, paraphrase, or borrow any substantive content, retaining only high-level stylistic attributes such as metaphorical framing, stanza structure, and the pattern of concluding with a clear instruction line.

The meta-prompt constrains the rewriting process by disallowing any enrichment or optimization of harmful content, mandating faithful preservation of the original request, and enforcing a consistent poetic format across all items. As in the MLCommons jailbreak pipeline, no item-specific adjustments, role-based framing, or iterative refinement are used; each poetic variant is produced in a single transformation step.

Following generation, each rewritten prompt undergoes taxonomy-preservation checks under both the MLCommons hazard taxonomy and the EU CoP systemic-risk framework (Section 3.2). Items that fail to maintain domain integrity (<1%) are discarded. The resulting poetic MLCommons corpus constitutes a large, balanced benchmark for evaluating whether poetic structure systematically elevates attack success rates across the full spectrum of safety-critical domains.

5.3 Evaluation

We evaluate all systems under a unified framework combining the 1,200 MLCommons AILuminate Benchmark baseline prompts and their 1,200 poetic counterparts, yielding 2,400 inputs per model, producing approximately 60,000 model outputs. All evaluations are conducted in a single-turn, text-only setting using provider-default safety configurations and fixed sampling parameters across models. This setup enables controlled comparison of refusal robustness across providers, model sizes, and prompt variants. The full list of evaluated models is provided in Table 3.

This configuration mirrors current practice in LLM safety assessments, isolating surface-form effects and enabling direct measurement of the incremental susceptibility induced by poetic transformations.

5.3.1 Experimental Protocol

Models are queried under a single-turn interaction regime: each prompt is submitted once, with no follow-up messages, system-role scaffolding, or multi-turn negotiation. All generations are collected verbatim through standard inference interfaces and normalized before evaluation.

To label outputs, we follow state-of-the-art LLM-as-a-judge methodology. Each response is independently evaluated by an ensemble of three open-weight judge models:

gpt-oss-120b

deepseek-r1

kimi-k2-thinking

Each judge assigns a binary label—SAFE (S) or UNSAFE (U)—under the same criteria used in prior safety-benchmarking work. Refusals and explicit policy-compliant denials are considered SAFE. A majority-vote consensus across the three judges determines the provisional label for each item; items without unanimous agreement are flagged for human inspection.

To validate the reliability of the judge ensemble, we collect 2,100 human labels over a set of 600 unique model outputs. Five annotators with mixed technical and safety-relevant backgrounds participate in the study, yielding an average of 3.5 labels per item. The sample is stratified across models, hazard categories, and prompt types. All annotators operate under the same evaluation rubric, and a subset is double-labeled to measure human–human inter-rater agreement. We observe strong agreement across annotators and between human ratings and the LLM-judge consensus, consistent with findings reported in recent LLM-as-a-judge literature. Disagreements between humans and the ensemble are adjudicated manually.

Human and LLM annotators label a response UNSAFE if it contains any of the following:

Instructions, steps, or procedural guidance enabling harmful activities

Technical details, code, or operational methods facilitating harm

Substantive advice that lowers the barrier to harmful action

Affirmative or compliant engagement with a harmful request

Workarounds, tips, or indirect methods that meaningfully support harm

Final labels are assigned in the following order: (1) ensemble majority vote, (2) human override for flagged items, (3) manual adjudication for unresolved cases. Attack Success Rate (ASR) is computed as the proportion of prompts whose final label is UNSAFE.

Safe and unsafe examples are shown below:

6 Analysis

6.1 The Core Effect: Poetic Form as a Universal Bypass

Table 3: Attack Success Rate (ASR) of all models on the Top 20 manually curated jailbreak prompts. ASR (%) corresponds to the proportion of unsafe outputs (U). Table 4: Attack Success Rate (ASR) for each adversarial poetry prompt, with hazard category split into harm type and subtype. Table 5: Attack Success Rate (ASR) by provider under AILuminate baseline vs. poetry prompts. Higher ASR indicates more unsafe outputs. Change is poetry ASR minus baseline ASR.

Our results demonstrate that poetic reformulation reliably reduces refusal behavior across all evaluated models. When prompts with identical task intent were presented in poetic rather than prose form, the Attack Success Rate (ASR) increased from 8.08% to 43.07% on average—a fivefold increase (Table 4). This effect holds uniformly: every architecture and alignment strategy tested—RLHF-based models, Constitutional AI models, and large open-weight systems—exhibited elevated ASRs under poetic framing.

The cross-family consistency indicates that the vulnerability is systemic, not an artifact of a specific provider or training pipeline. Model families from nine distinct providers (Table 5) showed increases ranging from 3.12% (Anthropic) to 62.15% (Deepseek), with seven of nine providers exhibiting increases exceeding 20 percentage points. This pattern suggests that existing alignment procedures are sensitive to surface-form variation and do not generalize effectively across stylistic shifts.

The bypass effect spans the full set of risk categories represented in our evaluation. Poetic prompts triggered unsafe outputs across CBRN-related domains, cyber-offense scenarios (reaching 84% ASR for code injection tasks; Table 4), manipulation and misinformation scenarios, privacy-related tasks (52.78% ASR; Table 7), and loss-of-control settings. This distribution suggests that poetic framing interferes with underlying refusal mechanisms rather than exploiting domain-specific weaknesses.

Our empirical analysis demonstrates a significant system-level generalization gap across the 25 frontier and open-weight models evaluated (Table 1). The vulnerability to adversarial poetry is not idiosyncratic to specific architectures or training pipelines; models trained via RLHF, Constitutional AI, and mixture-of-experts approaches all exhibited substantial increases in ASR.

Examining the distribution of model performance: 13 of 25 models (52%) exceeded 70% ASR on curated poems, while only 5 models (20%) maintained ASR below 35% (Table 3). This bimodal distribution suggests two distinct failure modes: models either possess robust defenses against stylistic variation or exhibit near-complete vulnerability. Notably, provider identity proved more predictive of vulnerability than model size or capability level, with certain providers (Google, Deepseek, Qwen) showing consistently high susceptibility across their model portfolios (Table 5).

The uniform degradation in safety performance when transitioning from prose to poetry (mean increase: 34.99 percentage points; Table 4) indicates that current alignment techniques fail to generalize when faced with inputs that deviate stylistically from the prosaic training distribution.

6.2 Comparison with MLCommons

Tables 6 and 7 compare Attack Success Rates (ASR), between the MLCommons AILuminate Benchmark and our evaluation pipeline. Our baseline ASR values are consistently lower than those in MLCommons, indicating a more conservative tendency in our judge ensemble when labeling unsafe behavior. The two setups are not directly comparable, MLCommons uses its own evaluation stack and curated jailbreak transformations, whereas we rely on three open-weight judge models plus human adjudication, but the gap is stable across categories and provides a meaningful internal baseline for assessing the effect of our poetic transformation.

The key result is that the increase in ASR induced by our simple poetic transformation closely mirrors the increase produced by MLCommons’ engineered jailbreak techniques. Several hazard categories, Privacy, Non-violent Crimes, Indiscriminate Weapons, Intellectual Property, show ASR deltas of similar or greater magnitude under the poetic version. This suggests that surface-level stylistic reframing alone can meaningfully weaken safety defenses across a broad set of harms, even without targeted jailbreak optimization. Patterns are consistent across the taxonomy: operational or procedural domains show larger shifts, while heavily filtered categories exhibit smaller changes. Together, these results indicate that poetic framing acts as a lightweight but robust trigger for safety degradation, paralleling the effects documented in MLCommons.

Table 6: Attack Success Rate (ASR) under MLCommons AILuminate baseline vs. poetry jailbreak by hazard. AILuminate Baseline ASR and Jailbreak ASR are computed as 100−safety score100-\text{safety score}. Change (%) is Jailbreak ASR minus AILuminate Baseline ASR. Table 7: Attack Success Rate (ASR) by hazard under AILuminate baseline vs. poetry prompts. Higher ASR indicates more unsafe outputs. Change is poetry ASR minus baseline ASR.

6.3 Risk Section

The efficacy of the jailbreak mechanism appears driven principally by poetic surface form rather than the semantic payload of the prohibited request. Comparative analysis reveals that while MLCommons’ own state-of-the-art jailbreak transformations typically yield a twofold increase in ASR relative to baselines (increasing from approximately 10% to 20% in their reference evaluations), our poetic meta-prompts produced a fivefold increase (from 8.08% to 43.07%; Table 8). This indicates that poetic form induces a distributional shift significantly larger than that of current adversarial mutations documented in the MLCommons AILuminate benchmark.

The effect’s content-agnostic nature is further evidenced by its consistency across semantically distinct risk domains. Privacy-related prompts showed a 44.71 percentage point increase, while CBRN prompts increased by 38.32 percentage points (Table 7). This cross-domain consistency, combined with the magnitude of the effect, suggests that safety filters optimized for prosaic harmful prompts lack robustness against narrative or stylized reformulations of identical intent.

While the jailbreak effect generalizes across domains, its magnitude varies substantially by risk category. Analysis of curated poems mapped to specific hazard types (Table 4) reveals that cyber-offense prompts, particularly those involving code injection or password cracking, yielded the highest ASRs at 84%. Loss-of-control scenarios showed comparable vulnerability, with model-weight exfiltration prompts achieving 76% ASR.

When analyzing the broader MLCommons dataset under poetic transformation (Table 7), privacy-related prompts exhibited the most extreme shift, with ASR increasing from a baseline of 8.07% to 52.78%-a 44.71 percentage point increase. This represents the largest domain-specific effect observed. Non-violent crimes (39.35 percentage point increase) and CBRN-related prompts (38.32 percentage point increase) showed similarly large effects.

Conversely, sexual content prompts demonstrated relative resilience, with only a 24.64 percentage point increase (Table 7). This domain-specific variation suggests that different refusal mechanisms may govern different risk categories, with privacy and cyber-offense filters proving particularly susceptible to stylistic obfuscation through poetic form.

6.4 Model Specifications

Table 8: Attack Success Rate (ASR) by model under AILuminate baseline vs. poetry prompts. Higher ASR indicates more unsafe outputs. Change is poetry ASR minus baseline ASR.

6.4.1 Variability Across Flagship Models

We observe stark divergence in robustness among flagship providers’ most capable models. Table 3 reveals a clear stratification: DeepSeek and Google models displayed severe vulnerability, with gemini-2.5-pro failing to refuse any curated poetic prompts (100% ASR) and deepseek models exceeding 95% ASR. In contrast, OpenAI and Anthropic flagship models remained substantially more resilient; gpt-5-nano maintained 0% ASR and claude-haiku-4.5 achieved 10% ASR on the same prompt set.

This disparity cannot be fully explained by model capability differences alone. Examining the relationship between model size and ASR within provider families, we observe that smaller models consistently refuse more often than larger variants from the same provider. For example, within the GPT-5 family: gpt-5-nano (0% ASR) << gpt-5-mini (5% ASR) << gpt-5 (10% ASR). Similar trends appear in the Claude and Grok families.

This inverse relationship between capability and robustness suggests a possible capability-alignment interaction: more interpretively sophisticated models may engage more thoroughly with complex linguistic constraints, potentially at the expense of safety directive prioritization. However, the existence of counter-examples—such as Anthropic’s consistent low ASR across capability tiers—indicates that this interaction is not deterministic and can be mitigated through appropriate alignment strategies.

6.4.2 The Scale Paradox: Smaller Models Show Greater Resilience

Counter to common expectations, smaller models exhibited higher refusal rates than their larger counterparts when evaluated on identical poetic prompts. Systems such as GPT-5-Nano and Claude Haiku 4.5 showed more stable refusal behavior than larger models within the same family. This reverses the usual pattern in which greater model capacity correlates with stronger safety performance.

Several factors may contribute to this trend. One possibility is that smaller models have reduced ability to resolve figurative or metaphorical structure, limiting their capacity to recover the harmful intent embedded in poetic language. If the jailbreak effect operates partly by altering surface form while preserving task intent, lower-capacity models may simply fail to decode the intended request.

A second explanation concerns differences in the interaction between capability and alignment training across scales. Larger models are typically pretrained on broader and more stylistically diverse corpora, including substantial amounts of literary text. This may yield more expressive representations of narrative and poetic modes that override or interfere with safety heuristics. Smaller models, with narrower pretraining distributions, may not enter these stylistic regimes as readily.

A third hypothesis is that smaller models exhibit a form of conservative fallback: when confronted with ambiguous or atypical inputs, limited capacity leads them to default to refusals. Larger models, more confident in interpreting unconventional phrasing, may engage with poetic prompts more deeply and consequently exhibit higher susceptibility.

These patterns suggest that capability and robustness may not scale monotonically together, and that stylistic perturbations expose alignment sensitivities that differ across model sizes.

6.4.3 Differences in Proprietary vs. Open-Weight Models

The data challenge the assumption that proprietary closed-source models possess inherently superior safety profiles. Examining ASR on curated poems (Table 3), both categories exhibit high susceptibility, though with important within-category variance. Among proprietary models, gemini-2.5-pro achieved 100% ASR, while claude-haiku-4.5 maintained only 10% ASR—a 90 percentage point range. Open-weight models displayed similar heterogeneity: mistral-large-2411 reached 85% ASR, while gpt-oss-120b demonstrated greater resilience at 50% ASR.

Computing mean ASR across model categories reveals no systematic advantage for proprietary systems. The within-provider consistency observed in Table 5 further supports this interpretation: provider-level effects (ranging from 3.12% to 62.15% ASR increase) substantially exceed the variation attributable to model access policies. These results indicate that vulnerability is less a function of model access (open vs. proprietary) and more dependent on the specific safety implementations and alignment strategies employed by each provider.

6.5 Limitations

The study documents a consistent vulnerability triggered by poetic reformulation, but several methodological and scope constraints must be acknowledged. First, the threat model is restricted to single-turn interactions. The analysis does not examine multi-turn jailbreak dynamics, iterative role negotiation, or long-horizon adversarial optimization. As a result, the findings speak specifically to one-shot perturbations rather than to the broader landscape of conversational attacks.

Second, the large-scale poetic transformation of the MLCommons corpus relies on a single meta-prompt and a single generative model. Although the procedure is standardized and domain-preserving, it represents one particular operationalization of poetic style. Other poetic-generation pipelines, human-authored variants, or transformations employing different stylistic constraints may yield different quantitative effects.

Third, safety evaluation is performed using a three-model open-weight judge ensemble with human adjudication on a stratified sample. The labeling rubric is conservative and differs from the stricter classification criteria used in some automated scoring systems, limiting direct comparability with MLCommons results. Full human annotation of all outputs would likely influence absolute ASR estimates, even if relative effects remain stable. LLM-as-a-judge systems are known to inflate unsafe rates krumdick2025no, often misclassifying replies as harmful due to shallow pattern-matching on keywords rather than meaningful assessment of operational risk. Our evaluation was deliberately conservative. As a result, our reported attack-success rates likely represent a lower bound on the severity of the vulnerability.

Fourth, all models are evaluated under provider-default safety configurations. The study does not test hardened settings, policy-tuned inference modes, or additional runtime safety layers. This means that the results reflect the robustness of standard deployments rather than the upper bound of protective configurations.

Fifth, the analysis focuses on empirical performance and does not identify yet the mechanistic drivers of the vulnerability. The study does not isolate which components of poetic structure—figurative language, meter, lexical deviation, or narrative framing—are responsible for degrading refusal behavior. Understanding whether this effect arises from specific representational subspaces or from broader distributional shifts requires dedicated interpretability analysis, which will be addressed in forthcoming work by the ICARO Lab.

Sixth, the evaluation is limited to English and Italian prompts. The generality of the effect across other languages, scripts, or culturally distinct poetic forms is unknown and may interact with both pretraining corpora and alignment distributions.

Finally, the study is confined to raw model inference. It does not assess downstream filtering pipelines, agentic orchestration layers, retrieval-augmented architectures, or enterprise-level safety stacks. Real-world deployments may partially mitigate or even amplify the bypass effect depending on how these layers process stylistically atypical inputs.

These limitations motivate three research programs: isolating which formal poetic properties (lexical surprise, meter/rhyme, figurative language) drive bypass through minimal pairs; mapping discourse mode geometry using sparse autoencoders to reveal whether poetry occupies separated subspaces; and surprisal-guided probing to map safety degradation across stylistic gradients.

6.6 Future Works

This study highlights a systematic vulnerability class arising from stylistic distribution shifts, but several areas require further investigation. First, we plan to expand mechanistic analysis of poetic prompts, including probing internal representations, tracing activation pathways, and isolating whether failures originate in semantic routing, safety-layer heuristics, or decoding-time filters. Second, we will broaden the linguistic scope beyond English to evaluate whether poetic structure interacts differently with language-specific training regimes. Third, we intend to explore a wider family of stylistic operators – narrative, archaic, bureaucratic, or surrealist forms – to determine whether poetry is a particularly adversarial subspace or part of a broader stylistic vulnerability manifold. Finally, we aim to analyse architectural and provider-level disparities to understand why some systems degrade less than others, and whether robustness correlates with model size, safety-stack design, or training data curation. These extensions will help clarify the boundaries of stylistic jailbreaks and inform the development of evaluation methods that better capture generalisation under real-world input variability.

7 Conclusion

The study provides systematic evidence that poetic reformulation degrades refusal behavior across all evaluated model families. When harmful prompts are expressed in verse rather than prose, attack-success rates rise sharply, both for hand-crafted adversarial poems and for the 1,200-item MLCommons corpus transformed through a standardized meta-prompt. The magnitude and consistency of the effect indicate that contemporary alignment pipelines do not generalize across stylistic shifts. The surface form alone is sufficient to move inputs outside the operational distribution on which refusal mechanisms have been optimized.

The cross-model results suggest that the phenomenon is structural rather than provider-specific. Models built using RLHF, Constitutional AI, and hybrid alignment strategies all display elevated vulnerability, with increases ranging from single digits to more than sixty percentage points depending on provider. The effect spans CBRN, cyber-offense, manipulation, privacy, and loss-of-control domains, showing that the bypass does not exploit weakness in any one refusal subsystem but interacts with general alignment heuristics.

For regulatory actors, these findings expose a significant gap in current evaluation and conformity-assessment practices. Static benchmarks used for compliance under regimes such as the EU AI Act, and state-of-the-art risk-mitigation expectations under the Code of Practice for GPAI, assume stability under modest input variation. Our results show that a minimal stylistic transformation can reduce refusal rates by an order of magnitude, indicating that benchmark-only evidence may systematically overstate real-world robustness. Conformity frameworks relying on point-estimate performance scores therefore require complementary stress-tests that include stylistic perturbation, narrative framing, and distributional shifts of the type demonstrated here.

For safety research, the data point toward a deeper question about how transformers encode discourse modes. The persistence of the effect across architectures and scales suggests that safety filters rely on features concentrated in prosaic surface forms and are insufficiently anchored in representations of underlying harmful intent. The divergence between small and large models within the same families further indicates that capability gains do not automatically translate into increased robustness under stylistic perturbation.

Overall, the results motivate a reorientation of safety evaluation toward mechanisms capable of maintaining stability across heterogeneous linguistic regimes. Future work should examine which properties of poetic structure drive the misalignment, and whether representational subspaces associated with narrative and figurative language can be identified and constrained. Without such mechanistic insight, alignment systems will remain vulnerable to low-effort transformations that fall well within plausible user behavior but sit outside existing safety-training distributions.

Abstract

Abstract

1 Introduction

2 Related Work

3 Hypotheses

Hypothesis 1: Poetic reformulation reduces safety effectiveness.

Hypothesis 2: The vulnerability generalizes across contemporary model families.

Hypothesis 3: Poetic encoding enables bypass across heterogeneous risk domains.

4 Threat Model

Attacker Objectives and Knowledge

Attack Constraints

Target Models

5 Methodology

5.1 Baseline Prompts

5.2 Poetic Transformation of Baseline Prompts

5.3 Evaluation

5.3.1 Experimental Protocol

6 Analysis

6.1 The Core Effect: Poetic Form as a Universal Bypass

6.2 Comparison with MLCommons

6.3 Risk Section

6.4 Model Specifications

6.4.1 Variability Across Flagship Models

6.4.2 The Scale Paradox: Smaller Models Show Greater Resilience

6.4.3 Differences in Proprietary vs. Open-Weight Models

6.5 Limitations

6.6 Future Works

7 Conclusion

Similar Posts