4 min read1 day ago
–
As software architects and service owners, we often obsess over the “Day 1” of our services: the design, the tech stack, the clean code. But the reality of engineering is that 90% of a service’s lifecycle is “Day 2”: operations, maintenance, debugging, and fighting fires.
We build microservices, and then we become slaves to them. We wake up for alerts, we manually grep logs for the same recurring errors, and we context-switch away from high-value work to perform mundane triage.
The industry’s answer has been “Copilots” — AI assistants that wait for you to ask them questions. But Senior Engineers don’t just want a smarter CLI. They want a partner.
We are at a ‘Power Loom’ moment for software engineering. Centuries ago, the introduction of mechanized looms…
4 min read1 day ago
–
As software architects and service owners, we often obsess over the “Day 1” of our services: the design, the tech stack, the clean code. But the reality of engineering is that 90% of a service’s lifecycle is “Day 2”: operations, maintenance, debugging, and fighting fires.
We build microservices, and then we become slaves to them. We wake up for alerts, we manually grep logs for the same recurring errors, and we context-switch away from high-value work to perform mundane triage.
The industry’s answer has been “Copilots” — AI assistants that wait for you to ask them questions. But Senior Engineers don’t just want a smarter CLI. They want a partner.
We are at a ‘Power Loom’ moment for software engineering. Centuries ago, the introduction of mechanized looms didn’t eliminate the need for weavers; it transitioned them from manual laborers to system overseers. They moved from throwing a shuttle by hand to managing a battery of machines that produced ten times the output with higher precision.
The same is happening to us. > We are transitioning from ‘Hand-weavers of Code’ to ‘Architects of Agency.’ Our role is expanding from manually grepping logs and triaging exceptions to designing the Service Guardians that perform those tasks on our behalf. SE aren’t being replaced; their leverage is being multiplied.”
This article introduces a different architectural pattern: The Service Guardian approach. An autonomous agent that sits alongside your service, understands its internal logic, and possesses the agency to investigate, report, and even fix issues without human intervention.
Here is how the Service Guardian is architected using Node.js, Google’s Gemini 2.5 Flash, and the Model Context Protocol (MCP), and why every service owner should build one.
See it in Action
I deployed the Service Guardian to a live environment and introduced a breaking schema change. The following video demonstrates the agent detecting the crash, analyzing the SQL mismatch, and filing a JIRA ticket — completely autonomously.
Operational note: While this demo utilizes a manual trigger for clarity, the architecture is designed for seamless integration with production telemetry (e.g., CloudWatch, Prometheus webhooks) for fully automated incident ingestion.
The Paradigm Shift: Ownership vs. Stewardship
Traditional operational tooling is passive: dashboards, log aggregators, alert thresholds. Users act on them. A Service Guardian is active. It acts on the user’s behalf.
Imagine a specialized agent that knows the codebase. When an exception is thrown:
- It doesn’t just page the on-call engineer.
- It spawns a process.
- It pulls the stack trace.
- It reads the relevant source code (leveraging direct file access).
- It identifies that a
db.allwrapper was missing in the new commit. - It drafts a JIRA ticket with the exact fix and Slacks the link.
This isn’t sci-fi. This is a pattern that can be built today with standardized open protocols.
The Architecture
1. The Hands: Model Context Protocol (MCP)
The biggest barrier to building custom agents used to be “Tool Fatigue.” Connecting an LLM to a specific Postgres DB, JIRA instance, and Slack channel meant writing glue code for weeks.
Get UM’s stories in your inbox
Join Medium for free to get updates from this writer.
MCP solves this. It treats tools like microservices.
- Need to give the agent access to a database? Spin up a Postgres MCP server.
- Need to give it access to an internal wiki? Spin up a Confluence MCP server.
In this implementation, effectively zero API integration code was written. The system simply utilizes the atlassian-mcp-server and instructs the agent: "Here are the tools. Use them."
For a Service Guardian, speed is a feature. You cannot wait 30 seconds for an LLM to ponder the existential implications of a NullPointerException.
Gemini 2.5 Flash was selected for its balance of massive context window and sub-second latency. This allows the agent to ingest huge chunks of logs and code files in a single pass (“YOLO mode”) and reason across them instantly.
3. The Nervous System: The Event Loop
The agent is effectively a Node.js process wrapping the LLM interaction. It creates a “Run Loop” that mirrors how a Senior Engineer thinks:
// The "Service Owner" Mental Modelwhile (goal !== COMPLETE) { 1. Observe (Read Log / Webhook) 2. Orient (Search Codebase / Check Docs) 3. Decide (Plan Fix) 4. Act (Execute Tool)}
This loops runs inside the infrastructure, behind the firewall, ensuring sensitive data never leaves the control boundary except for the inference tokens.
Why This Matters for Architects
We usually define “architecture” as the structure of the software itself -classes, interfaces, databases.
The argument here is that Automated Operations must become part of the definition of software architecture. If you design a service, you should also design the agent that maintains it.
- Self-Healing: Agents can rollback deployments if metrics deviate.
- Self-Documenting: Agents can update the README when code changes.
- Knowledge Retention: When a senior dev leaves, the agent retains the “tribal knowledge” of how to debug the system because it has access to the runbooks and history.
The Results
A demo “Service Guardian” was deployed for a Node.js analytics service. When a breaking schema change was introduced:
- Without Agent: Detection could take more than 45 mins.
- With Guardian: The agent caught the crash triggered from ALERT, analyzed the new SQL query against the old schema, identified the mismatch, and filed a JIRA ticket with the corrected SQL — all in 45 seconds.