Building PagerDuty's SRE Agent
pagerduty.com·1d·
Discuss: Hacker News
💾Persistence Strategies
Preview
Report Post

We didn’t try to build a clever agent. We built one that shows up pre‑armed.

The lesson arrived earlier this year, as we began developing the SRE Agent, in a familiar-looking incident at 9:23 p.m. PT: consumer lag in production. Years earlier, we had documented a rare race condition in our runbook: duplicate records created through the REST API. We wrote a safe cleanup, promised ourselves we would add the proper constraint, and moved on. Two refactors and more than two years later, the same failure returned through Kafka consumers. The shape was the same; the door was different. People didn’t immediately connect it to the old notes. The response stretched to almost three hours, and it required extra responders late at night.

This incident and its lessons were fresh in our mind…

Similar Posts

Loading similar posts...