Strengthening Safety Boundaries for Evolving AI Agents

The explosive rise of generative models, which produce text, code, and images, has now given way to a more consequential shift: the emergence of interactive AI agents.

Unlike standalone chatbots, agents are designed to act rather than just respond. They coordinate tools, interpret multi-step tasks, interact with applications, and execute autonomous actions within real systems. Their value lies more in operational ability than in creativity.

In this new architecture, a large language model (LLM) is no longer a standalone product. It functions as the embedded reasoning engine within software agents that execute workflows, handle business logic, …

The explosive rise of generative models, which produce text, code, and images, has now given way to a more consequential shift: the emergence of interactive AI agents.

Crucially, LLMs are still general-purpose systems, not security-hardened components customized for each deployment. Organizations incorporate AI agents into production systems, CI/CD pipelines, business operations, and even critical infrastructure, but the underlying models are unaware of enterprise policies, risk models, data boundaries, or threat vectors.

This reflects the history of VBA macros in Office: a tool meant to automate tasks accidentally became a common way to deliver malware. Once users realized the potential of automation, disabling the feature was no longer feasible, leading to the development of security measures around it.

The same dynamic now applies to AI agents. Their capabilities are too valuable to abandon: automated workflow execution, incident analysis, code generation, document handling, and customer operations. However, as AI agents gain more autonomy, they also increase risks for organizations, including logic manipulation, data leakage, unsafe code, misinformation, covert malicious behaviors, and even automated deepfake fraud integrated into business processes.

In this situation, security cannot rely solely on the LLM itself. Instead, it must be enforced through external, system-level guardrails that monitor, verify, and restrict both input and output.

These AI guardrails work like real-world safety barriers: they don’t limit the system’s power but prevent it from going into unsafe areas. Guardrails also emphasize an important difference: how much of an agent’s decision-making is genuine reasoning versus the predictive recombination of learned patterns. Their tasks include filtering malicious prompts, detecting deceptive multi-step interactions, evaluating prompt-response inconsistencies, reducing jailbreak attempts, and screening LLM-generated code for unsafe patterns.

Despite this, attackers keep finding weak points, especially in multi-turn interactions, prompt interception, embedded payloads, and misaligned agent reasoning.

Understanding these new risks is crucial. The attack methods targeting generative models in 2022–2023 were just the beginning. Today’s attackers focus on interactive reasoning processes, agent independence, and tool access. As AI becomes deeply integrated into business systems, the security field must shift from safeguarding “models” to protecting AI-driven software ecosystems. Only then can organizations securely leverage autonomous AI agents without inheriting their weaknesses.

How attackers exploit user interactions with LLMs.

The most anticipated cyberattack method against LLM is interference with the prompt transmission process to AI agents (prompt injection). If successful, this can cause distortion of requests, execution of malicious commands, and data interception and leakage. Essentially, this is a typical Man-in-the-Middle (MitM) attack, but targeted at LLMs. Such an attack can have serious consequences, especially during interactions between an AI and a coding agent in a production system. This presents significant risks to critical information infrastructure (CII) assets.

The initial wave of research into information security for LLMs explicitly focused on this type of attack. Currently, attention is shifting toward exploring the risks involved in infiltrating multi-step dialogues with AI. These types of attacks are considerably harder to detect, although their successful execution can have much broader consequences.

As AI agents become more autonomous, the potential damage from such attacks increases further. Currently, these threats are addressed using traditional methods, such as text templates (similar to antivirus signatures) to block responses from the LLM. However, this method is not universal. It requires manual setup and cannot be automatically configured, which necessitates the use of specialized system-level tools for security tasks.

Unfortunately, such protection cannot reliably prevent the appearance of, for example, insecure code generated through hint requests to AI agents in LLM responses. Attackers only need to consider the logic of the interpreter’s code processing to understand how to bypass protection and inject malicious code.

Jailbreak attacks and their implications

Before exploring LLM security solutions, it is good to examine real-world examples of successful jailbreaks. These cases provide valuable insight into how attackers bypass existing safeguards and inform the design of more effective defensive measures.

Large-scale jailbreak incidents are still uncommon, but several documented cases demonstrate how attackers can circumvent existing safeguards. These attacks typically exploit weaknesses in filtering logic or input validation, manipulating the code analysis layer intended to detect malicious intent.

The first example occurred in 2024 and involved bypassing a ChatGPT filter designed to block malicious code generation. The attacker embedded a harmful instruction inside the prompt using hexadecimal encoding. The filter did not recognize the payload, allowing the hidden command to reach the interpreter. Once executed, the malicious content was processed without any remaining security checks.

Another attack aimed at bypassing internal filters was demonstrated using a technique known as Deceptive Delight. In this method, researchers embed unsafe instructions among harmless topics, creating a blended prompt that appears benign when analyzed in isolated pieces. The approach is related to multi-turn jailbreak strategies that gradually steer the model toward unsafe output. However, unlike slower, incremental attacks, Deceptive Delight reaches the jailbreak outcome within two interaction turns, making it much more efficient and significantly harder for standard filter logic to detect.

Common risk models in prompt handling

Prompt-related risks are broader than jailbreaks alone. They span both input manipulation and output corruption, and they can affect any part of an AI agent’s decision-making chain. Research and real-world incidents show that prompt-layer threats generally fall into four categories:

**1. Direct prompt injections **These are attempts that explicitly instruct the model to bypass restrictions, take on a different role, or perform an unauthorized action. A common example is a prompt asking the agent to ignore previous rules.

2. Indirect jailbreak injections Malicious content is embedded in external sources that the LLM processes, such as documents, Web pages, or API responses. Hidden elements like white-on-white text or zero-width characters can smuggle instructions through filters that trust the external source.

**3. Unsafe or vulnerable code generation **An LLM might produce code with security flaws, harmful functionality, or weak validation that attackers can exploit in the application that uses that code.

**4. Malicious code injected into LLM responses **If an attacker manages to insert harmful logic into an intermediate response, the agent may accidentally propagate or run malicious code during its workflow, allowing indirect control over enterprise systems.

Implementing Security Controls for AI Agents

Securing modern AI agents requires system-level controls that operate around the LLM rather than inside it. No single tool can fully protect autonomous agents, so emerging solutions typically combine several architectural approaches, including prompt-layer inspection, response validation, code-safety analysis, and behavioral monitoring. One of the more thoroughly documented examples comes from Meta’s Llama protection tools, but these represent a broader category of guardrail technologies rather than a standalone solution.

Detecting Malicious Prompts and Jailbreak Attempts

The first defensive layer emphasizes prompt inspection, detecting jailbreaks, indirect injections, malicious intent, or attempts to coerce an agent into unsafe behavior. Meta’s approach, implemented in PromptGuard 2, employs a fine-tuned BERT-based model to classify dangerous input with low latency. Similar methods are used in other ecosystems:

Anthropic’s Constitutional AI uses rule-based constraints plus model-based self-critique.
NeMo Guardrails (by Nvidia) allows developers to define domain-specific constraints using state machines.

These solutions share a principle: prompt validation must be quick, clear, and external to the LLM to remain effective in agentic loops.

2. Detecting Manipulated, Fragmented, or Embedded Payloads

A second category of solutions focuses on response integrity. When AI agents assemble outputs from multiple steps or external tools, malicious instruction fragments can be embedded within the combined output through invisible characters, steganographic text, or multi-part payloads.

LlamaFirewall’s Fragmented Content Auditor attempts to detect inconsistencies between a prompt and its generated response, flagging suspicious mismatches. Although still experimental, the underlying principle is sound: multi-component content must be evaluated holistically rather than as isolated tokens.

Alternative approaches include semantic diffing to compare intent and output, tree-structured analysis of each step in the agent chain, and cross-encoder similarity scoring to identify abnormal insertions.

This class of guardrails is still in early research, as multi-step agent reasoning significantly expands the attack surface compared to single-turn chat models.

3. Code-Safety and Runtime Security for AI-Generated Code

A rising threat involves AI agents that create or alter code. To reduce risks, various tools concentrate on static and dynamic analysis of LLM-produced output.

LlamaFirewall’s CodeShield performs static analysis across several programming languages, identifying unsafe constructs or potentially vulnerable patterns.
Semgrep with LLM-assisted features can validate generated code against established security rulesets.
GitHub Advanced Security, including AI-supported detectors, performs policy checks, dependency scanning, and vulnerability detection on code commits.
Runtime sandboxing using technologies such as Firecracker or gVisor isolates the execution environment for agent-generated code, preventing it from impacting production systems.

These solutions recognize a vital truth: LLMs lack an understanding of security boundaries. They predict text instead of ensuring correctness. Therefore, code safety must be verified externally with deterministic rules, pattern-based analysis, or specialized LLM evaluators.

4. Architectural Considerations and Open Questions

Some code-analysis systems require access to privileged, unmoderated model endpoints to prevent interference from safety filters. This is a practical architectural necessity, not a backdoor. It enables security engines to examine raw model output before user-facing moderation is applied.

Tools and Strategies

The shift of AI models from generative to interactive has not only greatly broadened their practical uses, but also introduced new security risks. The adoption of AI represents a big step forward in advancing the information society, so no one will stop using it because of security worries. As a result, it is now urgent to develop new tools and strategies to help protect users from emerging threats.

Alex Vakulov is a cybersecurity researcher with more than 20 years of experience in malware analysis and strong malware removal skills.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Strengthening Safety Boundaries for Evolving AI Agents

Similar Posts