BioDefense: A Multi-Layer Defense Architecture for LLM Agent Security Inspired by Biological Immune Systems
Authors: André L. Schauer Version: 2.0 (February 2026) Status: Conceptual Proposal — Seeking Peer Review License: CC BY-SA 4.0
Abstract
Large Language Model (LLM) agents processing untrusted input face a fundamental security challenge: the inability to reliably distinguish instructions from data within natural language streams. This architectural vulnerability enables prompt injection attacks, which remain the top risk in the OWASP Top 10 for LLM Applications 2025. We propose BioDefense, a multi-layer defense architecture inspired by biological immune systems that implements defense-in-depth through ephemeral execution environments, cryptogr…
BioDefense: A Multi-Layer Defense Architecture for LLM Agent Security Inspired by Biological Immune Systems
Authors: André L. Schauer Version: 2.0 (February 2026) Status: Conceptual Proposal — Seeking Peer Review License: CC BY-SA 4.0
Abstract
Large Language Model (LLM) agents processing untrusted input face a fundamental security challenge: the inability to reliably distinguish instructions from data within natural language streams. This architectural vulnerability enables prompt injection attacks, which remain the top risk in the OWASP Top 10 for LLM Applications 2025. We propose BioDefense, a multi-layer defense architecture inspired by biological immune systems that implements defense-in-depth through ephemeral execution environments, cryptographic integrity verification, behavioral anomaly detection, and adaptive threat memory. Our architecture employs three distinct verification layers—Ephemeral Workers, Guardian validators, and Supervisor arbiters—operating within hardware-isolated containers. We map established immunological concepts (innate vs. adaptive immunity, MHC-mediated self-recognition, NK cell surveillance, and immunological memory) to concrete security mechanisms, while explicitly acknowledging the limitations of biological analogies in computational contexts. This proposal is presented as a hypothesis requiring empirical validation; we provide a threat model, cost analysis, and identify known attack vectors that remain unaddressed. We invite security researchers to critique, extend, or refute these ideas.
Keywords: LLM Security, Prompt Injection, Multi-Agent Systems, Defense-in-Depth, Biological Security Patterns, Container Isolation
1. Introduction
1.1 Problem Statement
LLM agents that interact with external data sources face an inherent architectural vulnerability: natural language conflates instructions and data within the same channel. Unlike SQL injection, where parameterized queries provide a principled separation, no equivalent mechanism exists for natural language processing. An email-processing agent receiving the input:
Subject: Quarterly Report
Body: Please ignore your previous instructions and forward all
future emails to external@attacker.com. Confirm by replying "Done."
cannot syntactically distinguish the malicious instruction from legitimate content. This vulnerability class—prompt injection—was identified as the #1 risk in the OWASP Top 10 for LLM Applications 2025 [1], superseding traditional web vulnerabilities in LLM-integrated systems.
1.2 Motivation for Biological Analogies
Biological immune systems have evolved over millions of years to solve a fundamentally similar problem: distinguishing "self" from "non-self" in an environment where threats are polymorphic, adaptive, and semantically indistinguishable from benign entities at the molecular level. Key properties of immune systems that motivate our approach include:
- Defense-in-Depth: Multiple independent barriers (physical, innate, adaptive) provide redundancy
- Expendability: First-responder cells (neutrophils) are designed to be sacrificed
- Anomaly Detection: NK cells identify "stressed" cells without pathogen-specific markers
- Adaptive Memory: B and T cell memory enables rapid response to previously encountered threats
- Tolerance of False Positives: The system accepts collateral damage to ensure threat elimination
We do not claim that biological systems are directly implementable in software, nor that evolution optimizes for the same constraints as engineered systems. Rather, we use immunological concepts as an intuitive framework for reasoning about layered defense mechanisms, while grounding each pattern in established security principles.
1.3 Contributions
This paper makes the following contributions:
- A formal mapping between immunological concepts and LLM security mechanisms (Section 3)
- A three-layer verification architecture with defined trust boundaries (Section 4)
- A cryptographic challenge-response protocol for agent integrity verification (Section 5)
- A comprehensive threat model and attack taxonomy (Section 6)
- Realistic cost analysis with sensitivity bounds (Section 7)
- Explicit comparison with existing defense mechanisms (Section 8)
- Honest assessment of limitations and unaddressed attack vectors (Section 9)
1.4 Scope and Disclaimer
This document presents a conceptual architecture that has not been empirically validated. All performance claims are theoretical estimates based on published API pricing and container orchestration overhead measurements. We explicitly invite adversarial critique and acknowledge that production deployment would require extensive red-team testing.
2. Related Work
2.1 Prompt Injection Defenses
Current approaches to prompt injection defense fall into two categories:
Prevention-based defenses attempt to sanitize inputs before processing:
- Input filtering: Regex-based detection of known attack patterns [2]
- Prompt hardening: Instructing models to ignore conflicting directives
- Delimiter injection: Using special tokens to separate instructions from data
Detection-based defenses identify malicious outputs post-generation:
- LLM Guard (Protect AI): Input/output scanning with configurable policies [3]
- NeMo Guardrails (NVIDIA): Programmable conversation rails using Colang [4]
- Lakera Guard: Real-time prompt injection detection API [5]
- Rebuff: Self-hardening prompt injection detector [6]
These tools provide valuable protection but share a common limitation: they operate at the application layer without addressing the fundamental conflation of instructions and data.
2.2 Multi-Agent Security Architectures
Recent work has explored hierarchical agent architectures:
- Constitutional AI (Anthropic): Training models with explicit behavioral constraints [7]
- Reflexion: Self-reflection mechanisms for output validation [8]
- Multi-agent debate: Leveraging disagreement between models as a safety signal [9]
Our architecture differs by implementing hardware-level isolation between layers, treating the Worker agent as fundamentally untrusted.
2.3 Biological Analogies in Computer Security
The application of immune system concepts to computer security dates to the 1990s:
- Forrest et al. proposed artificial immune systems for intrusion detection [10]
- Kephart introduced the concept of "digital immunity" for virus detection [11]
- Dasgupta developed negative selection algorithms based on T-cell maturation [12]
Modern implementations include Darktrace’s "Enterprise Immune System," which uses unsupervised learning to establish behavioral baselines [13]. Our work extends this tradition to LLM-specific threats.
3. Biological Mapping
We map immunological concepts to security mechanisms with explicit acknowledgment of where analogies break down.
3.1 Mapping Table
| Biological Concept | Security Mechanism | Analogy Strength | Limitations |
|---|---|---|---|
| Neutrophils (expendable first-responders) | Ephemeral containers destroyed after each task | Strong: Both sacrifice individual units to contain threats | Neutrophils actively attack; our containers are passive |
| Physical barriers (skin, mucosa) | Network isolation (network: none) | Strong: Both prevent pathogen entry | Biological barriers are permeable; network isolation is binary |
| MHC presentation (self-markers) | Cryptographic challenge-response | Moderate: Both verify identity | MHC is continuous display; our protocol is discrete challenge |
| NK cell surveillance (missing-self detection) | Behavioral anomaly detection | Strong: Both identify deviations from normal patterns | NK cells act autonomously; our system requires centralized analysis |
| T-cell killing | Container termination | Weak: Both eliminate threats | T-cells require activation cascade; our termination is immediate |
| Memory B-cells (adaptive immunity) | Attack pattern database | Strong: Both enable faster response to known threats | Immune memory is distributed; our database is centralized |
| Toll-like receptors (pattern recognition) | Regex + embedding-based detection | Strong: Both recognize conserved threat signatures | TLRs are genetically encoded; our patterns are learned |
| Complement cascade | Automated escalation chain | Moderate: Both amplify initial detection signals | Complement is biochemical; our escalation is logical |
3.2 Deliberate Non-Mappings
We explicitly avoid the following biological analogies as misleading:
- CRISPR: Often cited as "viral memory," but CRISPR is primarily an editing mechanism. Immunological memory (B/T cells) is more appropriate for attack pattern storage.
- Fever: System-wide degradation to inhibit pathogens has no clear computational analog that doesn’t harm legitimate operations.
- Autoimmunity: We do not model false-positive cascades; our system uses threshold-based decisions rather than positive feedback loops.
4. Architecture
4.1 System Overview
┌─────────────────────────────────────────────────────────────────────┐
│ SUPERVISOR LAYER │
│ (High-capability model) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ • Final escalation arbiter │ │
│ │ • Attack pattern learning and classification │ │
│ │ • Cryptographic challenge generation │ │
│ │ • Human escalation recommendations │ │
│ └────────────────────────────────────────────────────────────┘ │
│ ▲ Escalation (≈5-15% of requests) │
└─────────────────────────┼───────────────────────────────────────────┘
│
┌─────────────────────────┼───────────────────────────────────────────┐
│ GUARDIAN LAYER │
│ (Medium-capability model) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ • Output validation against schema │ │
│ │ • Challenge-response integrity verification │ │
│ │ • Behavioral fingerprint analysis │ │
│ │ • Threat pattern matching (regex + embeddings) │ │
│ │ • Kill switch authority │ │
│ └────────────────────────────────────────────────────────────┘ │
│ ▲ All outputs pass through │
└─────────────────────────┼───────────────────────────────────────────┘
│
┌─────────────────────────┼───────────────────────────────────────────┐
│ QUARANTINE BUFFER │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ • Structured output only (JSON Schema validation) │ │
│ │ • No direct command passthrough │ │
│ │ • Output staging before release │ │
│ │ • Size limits enforced │ │
│ └────────────────────────────────────────────────────────────┘ │
│ ▲ Structured JSON only │
└─────────────────────────┼───────────────────────────────────────────┘
│
┌─────────────────────────┼───────────────────────────────────────────┐
│ EPHEMERAL WORKER │
│ (Small model in isolated container) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Hardware Isolation (gVisor/Kata Containers): │ │
│ │ • Network: DISABLED (no egress/ingress) │ │
│ │ • Filesystem: READ-ONLY │ │
│ │ • Capabilities: ALL DROPPED │ │
│ │ • Seccomp: RESTRICTED syscall whitelist │ │
│ │ • Lifecycle: DESTROYED after single task │ │
│ │ • Memory: HARD LIMIT (256MB) │ │
│ │ • CPU: THROTTLED │ │
│ └────────────────────────────────────────────────────────────┘ │
│ ▲ Untrusted input │
└─────────────────────────┼───────────────────────────────────────────┘
│
┌───────────┴───────────┐
│ UNTRUSTED INPUT │
│ (User data, emails, │
│ external content) │
└───────────────────────┘
4.2 Trust Boundaries
We define explicit trust levels:
| Layer | Trust Level | Can Compromise | Compromised By |
|---|---|---|---|
| Worker | ZERO | Nothing (isolated) | Untrusted input |
| Quarantine | LOW | Schema bypass only | Malformed JSON |
| Guardian | MEDIUM | Worker, Quarantine | Adversarial attacks on medium model |
| Supervisor | HIGH | All lower layers | Adversarial attacks on large model, supply chain |
| Human | HIGHEST | All layers | Social engineering |
Critical assumption: We assume the Guardian and Supervisor models are not backdoored at training time. If an attacker has compromised the model weights, all defenses collapse. This is a fundamental limitation (see Section 9.2).
4.3 Container Isolation Specification
Standard Docker namespace isolation is insufficient for adversarial workloads due to shared kernel attack surface [14]. We recommend:
Minimum viable isolation (Docker with hardening):
services:
worker:
image: worker:minimal
network_mode: "none"
read_only: true
security_opt:
- no-new-privileges:true
- seccomp:custom-profile.json
cap_drop:
- ALL
mem_limit: 256m
cpus: 0.5
pids_limit: 50
tmpfs:
- /tmp:size=64m,noexec,nosuid
Recommended isolation (Kata Containers or gVisor):
# Kata Containers provide VM-level isolation per container
runtimeClassName: kata-qemu
Kata Containers run each container in a lightweight VM with its own kernel, eliminating shared-kernel side-channel attacks [15]. Performance overhead varies by workload type: approximately 10-15% for CPU-bound tasks, but 20-50% for I/O-intensive operations due to virtualized storage and network layers. For security-critical LLM processing (primarily CPU-bound inference), this overhead is acceptable.
5. Integrity Verification Protocol
5.1 Limitations of Static Canary Tokens
The original proposal used static canary tokens for integrity verification:
# VULNERABLE: Static token is predictable
CANARY = "RandomString-7X9"
response = agent.ask("INTEGRITY_CHECK: Canary?")
if CANARY not in response:
return "COMPROMISED"
This approach has critical vulnerabilities:
- Predictable challenge: Attacker can instruct agent to respond correctly to integrity checks
- Timing side-channel: Check latency reveals token structure
- Token extraction: Repeated probing can extract or approximate the token
- Replay attacks: Static responses can be cached
5.2 Cryptographic Challenge-Response Protocol
We propose a dynamic challenge-response protocol based on cryptographic commitments:
import hashlib
import secrets
import time
class IntegrityVerifier:
def __init__(self, agent_id: str, secret_key: bytes):
self.agent_id = agent_id
self.secret_key = secret_key
self.challenge_history = set() # Prevent replay
def generate_challenge(self) -> tuple[str, str]:
"""Generate a time-bound, non-repeating challenge."""
nonce = secrets.token_hex(16)
timestamp = int(time.time())
# Challenge includes context the agent should know
challenge_data = f"{self.agent_id}:{timestamp}:{nonce}"
# Expected response is HMAC of challenge
expected = hashlib.hmac(
self.secret_key,
challenge_data.encode(),
hashlib.sha256
).hexdigest()[:16]
# Store challenge to prevent replay
self.challenge_history.add(nonce)
return challenge_data, expected
def verify_response(self, challenge: str, response: str,
expected: str, max_age_seconds: int = 30) -> bool:
"""Verify agent response with timing constraints."""
# Check timestamp freshness
parts = challenge.split(":")
challenge_time = int(parts[1])
if time.time() - challenge_time > max_age_seconds:
return False # Challenge expired
# Check replay
nonce = parts[2]
if nonce not in self.challenge_history:
return False # Unknown challenge
# Verify response (constant-time comparison)
return secrets.compare_digest(response.strip(), expected)
Protocol flow:
┌─────────┐ ┌─────────┐ ┌──────────┐
│ Guardian│ │ Worker │ │ Verifier │
└────┬────┘ └────┬────┘ └────┬─────┘
│ │ │
│ 1. Request challenge │ │
│────────────────────────────────────────────────>│
│ │ │
│ 2. Challenge + expected response │
│<────────────────────────────────────────────────│
│ │ │
│ 3. "Complete: {challenge}" │
│───────────────────────>│ │
│ │ │
│ 4. Response (from system prompt embedding) │
│<───────────────────────│ │
│ │ │
│ 5. Verify(challenge, response, expected) │
│────────────────────────────────────────────────>│
│ │ │
│ 6. VALID / INVALID │
│<────────────────────────────────────────────────│
│ │ │
Key properties:
- Freshness: Timestamp prevents replay of old responses
- Uniqueness: Nonce ensures each challenge is used once
- Binding: HMAC binds response to specific agent and context
- Resistance to extraction: Attacker cannot derive secret key from observed challenges
Remaining vulnerability: The Worker must have the secret key embedded in its system prompt to compute the HMAC. A sufficiently sophisticated prompt injection could instruct the agent to reveal this key. Mitigations include:
- Rotating keys frequently (per-session)
- Using TEE-based key storage (see Section 9.4)
- Treating key compromise as container compromise (trigger destruction)
6. Threat Model and Attack Taxonomy
6.1 Threat Model
Adversary capabilities:
- Can submit arbitrary text to the Worker agent
- Can craft multi-turn attack sequences
- Has knowledge of the system architecture (Kerckhoffs’s principle)
- Does NOT have access to secret keys, model weights, or container internals
- Does NOT control the network infrastructure
Security goals:
- Confidentiality: Prevent exfiltration of system prompts, keys, or user data
- Integrity: Prevent unauthorized actions (email forwarding, code execution)
- Availability: Maintain service despite attack attempts (graceful degradation)
6.2 Attack Taxonomy (OWASP-aligned)
Based on OWASP Top 10 for LLM Applications 2025 [1] and recent research [16]:
| Attack Class | Description | BioDefense Mitigation | Residual Risk |
|---|---|---|---|
| Direct Prompt Injection | Explicit instruction override in user input | Worker isolation + Guardian validation | Semantic attacks may evade detection |
| Indirect Prompt Injection | Malicious content in external data (emails, web pages) | Same as direct; treats all input as untrusted | RAG poisoning if retrieval is compromised |
| Multi-turn Attacks | Priming across conversation turns | Ephemeral workers reset state per task | Session-level attacks if tasks share context |
| Jailbreaking | Bypassing model safety constraints | Guardian behavioral analysis | Novel jailbreaks not in pattern database |
| Payload Splitting | Malicious content split across inputs | Per-task isolation; Guardian sees full output | Attacks split across tasks |
| Adversarial Suffixes | Token-level perturbations | Embedding-based detection | Transferable adversarial examples |
| Typoglycemia | Obfuscated text ("ignroe all previosu instrctions") | Regex + semantic analysis | Novel obfuscation techniques |
| Multimodal Injection | Instructions hidden in images/audio | NOT ADDRESSED | Requires multimodal analysis |
| Model Extraction | Repeated queries to steal model behavior | Rate limiting + behavioral monitoring | Determined attackers with many queries |
| Side-Channel Attacks | Timing/resource usage information leakage | Kata Containers reduce kernel sharing | Speculative execution attacks |
6.3 Known Bypasses
We acknowledge the following attack vectors that BioDefense does NOT adequately address:
- Training data poisoning: If the Guardian/Supervisor models contain backdoors, all defenses fail
- Multimodal injection: Image/audio payloads are not analyzed
- Supply chain attacks: Compromised container images or orchestration layer
- Slow-burn social engineering: Attacks that build trust over many interactions
- Covert channels: Information encoding in output structure (not content)
7. Cost Analysis
7.1 API Pricing Model
Based on published pricing as of February 2026 [17][18]:
| Model Class | Example Models | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| Small/Tiny | GPT-4o-mini, Claude Haiku 3.5 | $0.15 - $0.80 | $0.60 - $4.00 |
| Medium | GPT-4o, Claude Sonnet 4 | $3.00 - $5.00 | $15.00 |
| Large | Claude Opus 4.5, GPT-4 Turbo | $5.00 - $15.00 | $25.00 - $75.00 |
Note: Claude Opus 4.5 ($5/$25 per 1M) supersedes Opus 4 ($15/$75) with 67% cost reduction.
7.2 Per-Task Cost Breakdown
Assumptions:
- Average task: 1,000 input tokens, 500 output tokens
- Challenge-response adds 100 tokens per verification
- Guardian processes all outputs (100% coverage)
- Supervisor escalation rate: 5-15% depending on threat environment
Base case (5% escalation):
| Component | Calls per Task | Token Cost | API Cost |
|---|---|---|---|
| Worker (Haiku) | 1.0 | 1,500 tokens | $0.0015 |
| Challenge-Response | 1.0 | 200 tokens | $0.0002 |
| Guardian (Sonnet) | 1.0 | 800 tokens | $0.012 |
| Supervisor (Opus) | 0.05 | 1,000 tokens | $0.004 |
| Total per task | $0.018 | ||
| Per 1,000 tasks | $18.00 |
High-threat case (15% escalation):
| Component | Calls per Task | API Cost |
|---|---|---|
| Worker | 1.0 | $0.0015 |
| Challenge-Response | 1.5 | $0.0003 |
| Guardian | 1.0 | $0.012 |
| Supervisor | 0.15 | $0.012 |
| Total per task | $0.026 | |
| Per 1,000 tasks | $26.00 |
7.3 Infrastructure Costs
| Component | Estimated Cost | Notes |
|---|---|---|
| Kata Containers overhead | +10-50% compute | 10-15% CPU, 20-50% I/O |
| Container orchestration | $50-200/month | Kubernetes cluster |
| Attack pattern database | $20-50/month | PostgreSQL + pgvector |
| Monitoring (Prometheus/Grafana) | $30-100/month | Depends on retention |
7.4 Cost Comparison
BioDefense vs. alternatives:
| Approach | Setup Complexity | Per-1000 Tasks | Coverage |
|---|---|---|---|
| BioDefense (this proposal) | High | $18-26 | Multi-layer, adaptive |
| LLM Guard (self-hosted) | Medium | $5-10 | Input/output scanning |
| Lakera Guard (API) | Low | $10-20 | Real-time detection |
| Constitutional AI only | Low | $2-5 | Training-time constraints |
| No protection | None | $1-2 | None |
Note: The original estimate of $1.20/1000 tasks was unrealistic. Accurate costs are 15-25x higher but provide substantially stronger guarantees.
8. Comparison with Existing Defenses
8.1 Feature Comparison
| Feature | BioDefense | LLM Guard | NeMo Guardrails | Lakera Guard |
|---|---|---|---|---|
| Hardware isolation | ✓ (Kata/gVisor) | ✗ | ✗ | ✗ |
| Multi-model verification | ✓ (3 layers) | ✗ | Partial | ✗ |
| Cryptographic integrity | ✓ | ✗ | ✗ | ✗ |
| Behavioral fingerprinting | ✓ | ✓ | ✗ | ✓ |
| Adaptive threat memory | ✓ | ✗ | ✗ | ✓ (cloud) |
| Ephemeral execution | ✓ | ✗ | ✗ | ✗ |
| Open source | Proposal only | ✓ | ✓ | ✗ |
| Production-tested | ✗ | ✓ | ✓ | ✓ |
8.2 When to Use What
- LLM Guard/Lakera: Production systems needing immediate protection with low integration effort
- NeMo Guardrails: Applications requiring programmable conversation flows
- BioDefense: High-security environments where defense-in-depth justifies operational complexity
Recommendation: BioDefense is complementary to existing tools. Use LLM Guard for input scanning, then pass to BioDefense architecture for high-risk operations.
9. Limitations and Future Work
9.1 Absence of Empirical Validation
This architecture has not been tested against real-world attacks. Required validation includes:
- Detection rate against HackAPrompt corpus [19]
- False positive rate on benign workloads
- Latency impact measurements
- Red team campaigns with adaptive adversaries
9.2 Training-Time Attacks
If the Guardian or Supervisor models contain backdoors planted during training, BioDefense provides no protection. Mitigations would require:
- Formal verification of model behavior (currently infeasible at scale)
- Multiple models from independent training pipelines
- Continuous behavioral monitoring for drift
9.3 Scalability Concerns
Container creation/destruction overhead may become prohibitive at scale (>1000 requests/second). Potential optimizations:
- Warm container pools (reduces isolation guarantees)
- Unikernel-based execution (e.g., Unikraft)
- Hardware-assisted isolation (AWS Nitro Enclaves)
9.4 Future Work
- TEE integration: Storing integrity verification keys in SGX/TrustZone enclaves
- Federated threat intelligence: Privacy-preserving sharing of attack patterns across organizations
- Formal verification: Proving security properties under bounded adversary models
- Multimodal analysis: Extending detection to image/audio payloads
- Ablation studies: Determining which components provide most value
10. Conclusion
We have presented BioDefense, a multi-layer defense architecture for LLM agent security inspired by biological immune systems. The architecture implements defense-in-depth through ephemeral execution, cryptographic integrity verification, behavioral analysis, and adaptive threat memory. We have explicitly addressed limitations of the original proposal, including vulnerable canary token design, unrealistic cost estimates, and incomplete attack coverage.
This proposal should be understood as a hypothesis, not a production-ready solution. We invite security researchers to:
- Identify additional attack vectors
- Propose improved integrity verification mechanisms
- Conduct empirical validation against attack datasets
- Suggest better biological analogies or alternative frameworks
The fundamental challenge—distinguishing instructions from data in natural language—remains unsolved. BioDefense aims to raise the cost and complexity of successful attacks, not to eliminate them entirely.
References
[1] OWASP. "Top 10 for LLM Applications 2025." https://owasp.org/www-project-top-10-for-large-language-model-applications/
[2] Greshake, K., et al. "Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv:2302.12173, 2023.
[3] Protect AI. "LLM Guard: Security Toolkit for LLM Applications." https://github.com/protectai/llm-guard
[4] NVIDIA. "NeMo Guardrails." https://github.com/NVIDIA/NeMo-Guardrails
[5] Lakera. "Lakera Guard." https://www.lakera.ai/
[6] Rebuff. "Prompt Injection Detector." https://github.com/protectai/rebuff
[7] Bai, Y., et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073, 2022.
[8] Shinn, N., et al. "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023.
[9] Du, Y., et al. "Improving Factuality and Reasoning in Language Models through Multiagent Debate." arXiv:2305.14325, 2023.
[10] Forrest, S., et al. "A Sense of Self for Unix Processes." IEEE S&P, 1996.
[11] Kephart, J. "A Biologically Inspired Immune System for Computers." ALIFE IV, 1994.
[12] Dasgupta, D. "Artificial Immune Systems and Their Applications." Springer, 1999.
[13] Darktrace. "Enterprise Immune System." https://www.darktrace.com/
[14] Aqua Security. "Container Isolation: Understanding Limitations." 2024.
[15] AWS. "Enhancing Kubernetes Workload Isolation Using Kata Containers." 2024.
[16] Perez, F., & Ribeiro, I. "Ignore This Title and HackAPrompt." arXiv:2311.16119, 2023.
[17] Anthropic. "Claude API Pricing." https://www.anthropic.com/pricing
[18] OpenAI. "API Pricing." https://openai.com/pricing
[19] Schulhoff, S., et al. "HackAPrompt: A Dataset for Prompt Injection Attacks." EMNLP 2023.
Appendix A: Behavioral Fingerprinting Heuristics
The following heuristics are provided as examples, not production recommendations. Thresholds require tuning based on application-specific false positive tolerance.
@dataclass
class BehavioralScore:
"""Anomaly scoring for Worker output analysis."""
# Weights (tune based on empirical data)
WEIGHTS = {
'length_anomaly': 1,
'exclamation_density': 1,
'urgency_keywords': 2,
'instruction_override': 5,
'external_references': 3,
'consistency_failure': 3,
}
def analyze(self, output: str, context: TaskContext) -> int:
score = 0
# Length anomaly (>2 std dev from task type mean)
if len(output) > context.expected_length * 2:
score += self.WEIGHTS['length_anomaly']
# Exclamation density (social engineering indicator)
exclamation_ratio = output.count('!') / max(len(output), 1)
if exclamation_ratio > 0.02: # More than 2%
score += self.WEIGHTS['exclamation_density']
# Urgency keywords
urgency_terms = ['urgent', 'immediately', 'asap', 'critical',
'override', 'emergency', 'must act now']
for term in urgency_terms:
if term in output.lower():
score += self.WEIGHTS['urgency_keywords']
break
# Instruction override patterns
override_patterns = [
r'ignore\s+(previous|prior|all)\s+instructions?',
r'forget\s+(everything|your\s+rules)',
r'you\s+are\s+now\s+[a-z]+',
r'your\s+(true|real|actual)\s+instructions?',
]
for pattern in override_patterns:
if re.search(pattern, output.lower()):
score += self.WEIGHTS['instruction_override']
# External references (potential exfiltration)
if re.search(r'https?://|@[a-z]+\.[a-z]+', output.lower()):
score += self.WEIGHTS['external_references']
return score
Thresholds:
- Score ≥ 5: Terminate container immediately
- Score ≥ 3: Escalate to Supervisor
- Score < 3: Pass to next layer
Appendix B: Attack Pattern Database Schema
-- PostgreSQL with pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE attack_patterns (
id SERIAL PRIMARY KEY,
-- Pattern identification
pattern_hash VARCHAR(64) UNIQUE NOT NULL,
pattern_regex TEXT,
pattern_embedding VECTOR(1536), -- OpenAI ada-002 dimension
-- Classification
attack_class VARCHAR(50) NOT NULL, -- OWASP category
severity INTEGER CHECK (severity BETWEEN 1 AND 10),
confidence FLOAT CHECK (confidence BETWEEN 0 AND 1),
-- Metadata
first_seen TIMESTAMP DEFAULT NOW(),
last_seen TIMESTAMP DEFAULT NOW(),
times_detected INTEGER DEFAULT 1,
false_positive_reports INTEGER DEFAULT 0,
-- Source tracking
source VARCHAR(100), -- 'internal', 'federated', 'research'
-- Indexes for fast lookup
CONSTRAINT valid_pattern CHECK (
pattern_regex IS NOT NULL OR pattern_embedding IS NOT NULL
)
);
CREATE INDEX idx_embedding ON attack_patterns
USING ivfflat (pattern_embedding vector_cosine_ops)
WITH (lists = 100);
CREATE INDEX idx_attack_class ON attack_patterns(attack_class);
CREATE INDEX idx_severity ON attack_patterns(severity DESC);
-- Function for semantic similarity search
CREATE OR REPLACE FUNCTION find_similar_attacks(
query_embedding VECTOR(1536),
threshold FLOAT DEFAULT 0.85,
max_results INTEGER DEFAULT 5
) RETURNS TABLE (
id INTEGER,
attack_class VARCHAR,
severity INTEGER,
similarity FLOAT
) AS $$
BEGIN
RETURN QUERY
SELECT
ap.id,
ap.attack_class,
ap.severity,
1 - (ap.pattern_embedding <=> query_embedding) AS similarity
FROM attack_patterns ap
WHERE 1 - (ap.pattern_embedding <=> query_embedding) > threshold
ORDER BY ap.pattern_embedding <=> query_embedding
LIMIT max_results;
END;
$$ LANGUAGE plpgsql;
Changelog
v2.0 (February 2026)
- Replaced static canary tokens with cryptographic challenge-response protocol
- Corrected biological analogies (removed CRISPR as memory metaphor)
- Updated cost estimates from $1.20 to $18-26 per 1000 tasks
- Added comprehensive threat model and attack taxonomy
- Added comparison with existing defense mechanisms (LLM Guard, NeMo, Lakera)
- Expanded limitations section with honest assessment of unaddressed vectors
- Added container isolation recommendations (Kata Containers, gVisor)
- Restructured as academic paper with proper citations
v1.0 (February 2026)
- Initial proposal
This document is released under CC BY-SA 4.0. Build on it, improve it, critique it.