Agent-Driven SRE Investigations: A Practical Deep Dive into Multi-Agent Incident Response

Introduction

I’ve been exploring how far we can push fully autonomous, multi-agent investigations in real SRE environments — not as a theoretical exercise, but using actual Kubernetes clusters and real tooling. Each agent in this experiment operated inside a sandboxed environment with access to Kubernetes MCP for live cluster inspection and GitHub MCP to analyze code changes and even create remediation pull requests.

The goal wasn’t simply to generate summaries, but to observe whether agents can behave like an actual on-call team: investigate an alert, valid…

Introduction

The goal wasn’t simply to generate summaries, but to observe whether agents can behave like an actual on-call team: investigate an alert, validate each other’s findings, propose mitigation steps, and — when appropriate — submit a PR that teammates can review and approve.

What emerged is a surprisingly capable prototype of an agent-driven incident response workflow, where one agent investigates, another validates the findings, and a third reviews the mitigation plan and decides whether to escalate or authorize automated actions.

The Experiment

The setup is simple in concept. I split the work across three agents:

Receiver: → On-call engineer who receives the alert and executes the first steps
Reviewer 1 → First Teammate On-call reviewing the investigation
Reviewer 2 → Second Teammate On-call engineer providing final assessment and Decision Maker

This mimics the dynamics of a real SRE team handling an alert: investigation, peer review, and final decision.

AI SRE agent-driven team solving incident together

Experiment Goals

The key questions I want to answer are:

Can autonomous agents meaningfully triage incidents end-to-end?
Do they catch each other’s mistakes, improving reliability compared to a single agent?
Can they reduce alert fatigue by escalating only when it matters?

I expect that they can go surprisingly far — but the gaps will also be obvious once real-world complexity comes into play.

Technical Setup

To ensure complete isolation and safety, I run the agents inside Docker containers. Each agent gets full tool access within the container using --dangerously-skip-permissions, which is safe since the container is isolated from the host system. The agents run Claude in non-interactive mode with different CLAUDE.md prompt files that define their roles.

Agent Communication via Local Files

The agents communicate through a sequential file chain:

Agent 1 (Receiver):

Reads: alert.json (the original alert)
Role: On-call engineer conducting initial investigation using Kubernetes MCP server
Writes: investigation-1.md with findings, root cause analysis, and mitigation recommendations

Agent 2 (Reviewer 1):

Reads: alert.json and investigation-1.md
Role: First teammate reviewing the investigation
Writes: investigation-2.md with validation, corrections, and additional insights

Agent 3 (Reviewer 2):

Reads: alert.json, investigation-1.md, and investigation-2.md
Role: Second teammate providing final assessment
Writes: investigation-3.md with consolidated mitigation plan and escalation decision

Each agent runs with its own CLAUDE.md file that defines their specific role:

CLAUDE-receiver.md — defines the on-call engineer role
CLAUDE-reviewer1.md — defines the first reviewer role
CLAUDE-reviewer2.md — defines the second reviewer role

Container Setup

The containerized environment includes all necessary tools and authentication:

Dockerfile:


FROM node:20-slim

# Install dependencies
RUN apt-get update && apt-get install -y \
curl \
git \
uuid-runtime \
&& rm -rf /var/lib/apt/lists/*

# Install kubectl
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
&& chmod +x kubectl \
&& mv kubectl /usr/local/bin/

# Install gh CLI
RUN curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg \
&& chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | tee /etc/apt/sources.list.d/github-cli.list > /dev/null \
&& apt-get update \
&& apt-get install -y gh

# Install Claude Code
RUN npm install -g @anthropic-ai/claude-code

WORKDIR /workspace

CMD ["/bin/bash"]

docker-compose.yml:


version: '3.8'
services:
sre-agent:
build: .
volumes:
- ./:/workspace
- ~/.kube:/root/.kube:ro  # Kubernetes config
- ~/.config/gh:/root/.config/gh:ro  # GitHub CLI auth
- ~/.config/claude:/root/.config/claude:ro  # Claude auth
network_mode: host  # Access to local k8s cluster

run-investigation.sh:


#!/bin/bash
# Run agent 1 with isolated session
cat CLAUDE-receiver.md | claude -p --session-id $(uuidgen) --dangerously-skip-permissions > investigation-1.md

# Run agent 2 with isolated session
cat CLAUDE-reviewer1.md | claude -p --session-id $(uuidgen) --dangerously-skip-permissions > investigation-2.md

# Run agent 3 with isolated session
cat CLAUDE-reviewer2.md | claude -p --session-id $(uuidgen) --dangerously-skip-permissions > investigation-3.md

Usage:


docker-compose build
docker-compose run sre-agent bash run-investigation.sh

This pattern ensures each agent:

Has the full context of previous work through the sequential file chain
Runs in complete isolation with unique session IDs
Has safe, unrestricted tool access within the container
Can access Kubernetes and GitHub with your local credentials
Maintains a clear audit trail of the investigation process

The containerized approach provides flexibility to:

Control information flow between agents through file-based communication
Expand easily with new tools or checks without affecting the host system
Reuse the structure for other workflows like CI/CD validation or automated rollbacks
Run experiments safely without risking production systems

Experimental Results

⚠️ CRITICAL DISCLAIMER: This is a completely sandboxed experiment. All incidents, investigations, and remediation actions occurred in an isolated test environment with no production impact.

I conducted 7 documented experiments across 3 distinct incident types. All runs were successful in identifying root causes and proposing remediation actions.

Experiment 1: Init Container CrashLoopBackOff

Incident: Pod stuck in perpetual crash state for 8 days

Alert: KubePodNotReady for failing-init-demo-5999d7545c-sh4zk

Investigation:

Agents identified hardcoded exit 1 In the init container
Verified 4,512 restarts over 16 days
Proposed deletion as primary mitigation

Unexpected Discovery: The final reviewer discovered this was an intentional test deployment - the very test case for the investigation system itself! The agents were investigating their own test data without realizing it.

Key Learning: Agents lack meta-awareness. They didn’t question why a deployment named "failing-init-demo" existed or check project documentation. This revealed a critical gap: investigators need an explicit "context gathering" phase before diving into technical diagnosis.

Experiment 2: Ingress Misconfiguration (HTTP 404)

Incident: Application returning 404 errors for 11 hours the Root Cause: AWS ALB configured with HTTPS on port 444 instead of the standard port 443

# Incorrect alb.ingress.kubernetes.io/listen-ports: ‘[{"HTTP": 80}, {"HTTPS": 444}]’

Investigation Quality:

Agent 1: Correctly identified the **,**root cause with definitive evidence
Agent 2: Enhanced safety analysis, identified shared ALB risk affecting 5 other services
Agent 3: Authorized auto-remediation with comprehensive monitoring requirements

Output: Final investigation (Agent 3) was 871 lines of comprehensive incident analysis,**a ** including mitigation plan, rollback procedures, monitoring requirements, and long-term prevention strategies (Kyverno policy recommendations).

Experiments 3-7: Database Configuration Error Series

Incident Type: Application crash due to a non-existent database Root Cause: Database name changed to evershop-update-2025 which doesn’t exist

These experiments tested different system capabilities:

Experiment 3: Investigation without GitHub MCP access

Agents identified the crash but couldn’t trace it to the source code changes
Demonstrated the critical importance of repository access

Experiment 4: JSONL output format testing

Revealed excessive verbosity (large, hard-to-parse files)
Led to format simplification

Experiment 5: Full investigation with all access

Successfully traced the issue to the commit 39920dac made 11 minutes before the alert
Created remediation PR
99% confidence in root cause

Timeline correlation:

09:14:34 UTC - Commit merged
09:24:26 UTC - Deployment with bad config
09:25:25 UTC - Alert fires
6+ hours     - Service down, 75 pod restarts

Experiment 6: Action logging improvement

The investigation was correct, but lacked documentation of HOW agents investigated
Led to prompt changes requiring explicit "Investigation Actions" sections
Now, agents document: "Ran PostgreSQL pod via Kubernetes MCP to verify databases" instead of just "Database verified."

Experiment 7: Silent failure case

Critical reliability issue: Agent 2 (Reviewer 1) produced only a newline, no investigation output
Agent 1 successfully created PR #6 and wrote a complete investigation (23 lines)
Agent 2 failed silently - output file contained only 1 byte (newline character)
Agent 3 ran automatically, but only had Agent 1’s output available (Agent 2’s was empty)
Agent 3 still correctly reviewed PR #6 and provided a valid assessment
A 4th agent was run manually to verify the investigation chain
Root cause: Agent 2 executed but produced no text output (silent failure)
Impact: Subsequent agents in the pipeline lacked intermediate review, though the final result was still valid
Lessons: Identified need for output validation, real-time visibility, and fail-fast behavior

System Evolution

Through these experiments, the system evolved significantly:

Output Format Evolution

Initial (Experiments 1-4): 7-section verbose format

1. Incident Summary 2. Investigation Findings 3. Root Cause 4. Immediate Mitigation 5. Long-term Prevention 6. Monitoring Plan 7. Pull Requests Links

Current (Experiments 6-7): Concise, action-focused format

1. Investigation Actions (tools/methods used) 2. Root Cause (with file:line references) 3. Immediate Actions (critical steps only) 4. PR Links

Current State: The sandbox script remains simple for experimentation. Production implementation would include these safety features.

Key Findings

investigation’s

Root Cause Identification: 100% success rate across all experiments
Tool Proficiency: Excellent use of Kubernetes and GitHub APIs
Multi-Agent Value: Peer review consistently improved quality and safety
Autonomous Remediation: Successfully created appropriate PRs with fixes
Evidence-Based Analysis: Strong technical reasoning with specific code references

Critical Issues ⚠️

Silent Failures: Agents can complete work without producing the required outputs
Meta-Context Blindness: Don’t question the investigation’s purpose or check documentation
Verbosity Control: Initial formats consumed excessive tokens
Process Transparency: Agents didn’t document their investigation methods

Production Readiness Gap

This is a sandbox experiment. Significant engineering work remains for production deployment:

Missing Components

Security:

No authentication/authorization layer
No audit logging
Kubernetes access is cluster-admin (overly permissive)

Safety:

No dry-run mode
No blast radius analysis
No automated rollback
No rate limiting or circuit breakers

Integration:

No PagerDuty/incident management integration
No Slack notifications
No metrics/observability for agent performance
No SLA tracking

Reliability:

No retry logic for transient failures
No concurrency control (parallel alerts)
No queue management
No cost controls (unlimited API calls)
No timeout handling

Operations:

No runbooks for agent failures
No monitoring of agent health
No disaster recovery plan
No multi-tenancy support

Cost Analysis (Estimated)

Per-Investigation Estimate (using Claude Sonnet-class pricing):

Simple incidents: ~50-80K tokens total across 3 agents
Complex incidents: ~100-150K tokens total across 3 agents
Actual cost varies by model (Haiku: $0.10-0.30, Sonnet: $0.50-2.00, Opus: $5-15 per investigation)

Sandbox Experience:

This experiment used Claude Sonnet (mid-tier pricing)
Token usage averaged 40-100K per full investigation cycle
Cost per investigation: Estimated $1-5 (sandbox scale)

At Scale Estimate:

100 high-priority alerts/day with automated investigation
Using hybrid model (Haiku for Agent 1, Sonnet for Agent 3)
Estimated: $10-50K/month depending on alert volume and model selection
Critical: Aggressive filtering needed to investigate only high-value alerts

Timeline to Production

Estimated effort: 6-12 months with a dedicated team

Months 1-2: Security hardening, authentication, authorization
Months 3-4: Production integrations (PagerDuty, Slack, metrics)
Months 5-6: Reliability engineering (retries, timeouts, queuing)
Months 7-8: Cost controls and optimization
Months 9-10: Multi-tenancy and scale testing
Months 11-12: Beta program with read-only investigations

Statistical Summary

Metric	Value
Total Experiments	7
Unique Incident Types	3
Root Cause Success Rate	100%
PRs Created by Agents	3+ (verified: #2, #5, #6)
Total Investigation Output	4,333 lines
Avg Investigation Time	2-6 hours
Silent Failures Discovered	1
Format Iterations	3

Conclusion

This experiment successfully demonstrates that multi-agent AI systems can autonomously:

✅ Investigate Kubernetes incidents with real cluster access
✅ Identify root causes with 99-100% confidence
✅ Create pull requests with appropriate fixes
✅ Provide valuable peer review and safety validation
✅ Make escalation decisions

However, this is nowhere near production-ready.

The experiment validates technical feasibility but reveals the productionization gap is substantial. The agents are excellent technical investigators but lack the safety controls, reliability engineering, and operational maturity required for production incident response.

Appendix: Experiment Files

All 7 experiments are documented in snapshots/:

snapshots/
├── 1-init-container-crashloop/       # Intentional test case discovery
├── 2-wrong-port-ingress-alb/         # HTTP 404 due to port 444
├── 3-wrong-db-missing-mcp/           # Investigation without GitHub access
├── 4-wrong-db-jsonl-missing-mcp/     # Format testing
├── 5-wrong-db-JSONL/                 # Full successful investigation
├── 6-good-but-not-enough-actions-logs/  # Action logging improvement
└── 7-missing-review1-text-outputs/   # Silent failure case

Each directory contains:

alert.json - Original Prometheus alert
investigation-{1,2,3}.md - Agent outputs
Supporting files (backups, PR references)

**Github Repository URL: ** github.com/opsworker-ai/agentic-sre-investigations-poc

Tagged in:

Subscribe to our email newsletter and unlock access to members-only content and exclusive updates.

Introduction

Introduction

The Experiment

Experiment Goals

Technical Setup

Agent Communication via Local Files

Container Setup

Experimental Results

Experiment 1: Init Container CrashLoopBackOff

Experiment 2: Ingress Misconfiguration (HTTP 404)

Experiments 3-7: Database Configuration Error Series

System Evolution

Output Format Evolution

Key Findings

investigation’s

Critical Issues ⚠️

Production Readiness Gap

Missing Components

Cost Analysis (Estimated)

Timeline to Production

Statistical Summary

Conclusion

Appendix: Experiment Files

Similar Posts