- 27 Dec, 2025 *
Waking up to a pager at 3 AM is my least favorite part of working in web development. I love the work, but the reality is that our services don’t have operating hours. They are expected to be up and running every second of every day. Sometimes things just break. That well-engineered service your team spent months refining will eventually stop working in some unexpected way. Yes, sometimes AWS goes down, but most of these wounds are self-inflicted and we know it.
There is no silver bullet for this. Software development is complex, and putting a service out into production for the rest of the world to hammer on inevitably leads to unforeseen consequences. Most teams are already doing RCAs (Root Cause Analysis) for major incidents because leadership expects an exp…
- 27 Dec, 2025 *
Waking up to a pager at 3 AM is my least favorite part of working in web development. I love the work, but the reality is that our services don’t have operating hours. They are expected to be up and running every second of every day. Sometimes things just break. That well-engineered service your team spent months refining will eventually stop working in some unexpected way. Yes, sometimes AWS goes down, but most of these wounds are self-inflicted and we know it.
There is no silver bullet for this. Software development is complex, and putting a service out into production for the rest of the world to hammer on inevitably leads to unforeseen consequences. Most teams are already doing RCAs (Root Cause Analysis) for major incidents because leadership expects an explanation when things go sideways. An RCA might prevent that specific issue from happening again, but it is fundamentally reactive.
Even if your team is savvy enough to use the 5 Whys to dig deep, or has the cleverness to implement auto-remediation, RCAs only focus on the last big disaster. They don’t build a habit of continuous improvement. This is why I advocate for a dedicated Operational Excellence meeting.
A common problem I see is that while teams are great at spotting problems, they are rarely good at planning the work required to fix them. Most teams are buried under deadlines for new features. Technical debt or stability improvements get pushed to the next sprint, and then the sprint after that, until they are forgotten because the priority never shifts.
To fix this, there has to be an agreement that the team will constantly take on improvement work alongside new features. This manages the expectations between the developers and the rest of the company. Most company leaders assume the team is maintaining and improving the service, but they rarely know how to set that expectation formally because it is hard to define. New features are easy to ask for: we want X by date Y. But "make sure the site doesn’t crash at 3 AM" is nebulous. Without a formal process, you end up with a massive backlog of technical debt that eventually brings feature development to a crawl.
Operational Excellence as a routine solves this by making improvement a standard part of the job. When I first started doing these meetings I had a simple goal: stop getting paged early in the morning. After 6 months of doing this meeting we had basically accomplished that.
The Formula
I recommend starting with a regular one hour meeting every week or two to get everyone into the right mindset. And for that matter you should make sure things are timeboxed and structured. This is my basic agenda at most of these meetings:
Review recent incidents: Talk about what happened since the last meeting, even the small stuff that didn’t require a full RCA. I usually find deep discussion happens here because the problems are fresh with the team. It also keeps these problems at front-of-mind.
Review a team dashboard: Look at the metrics that actually matter for what your service does. Context is everything here. If something important isn’t on the dashboard, add it.
Create actionable tasks: Any issues discovered during the meeting should be turned into tickets and assigned to upcoming sprints immediately. I usually keep an ongoing To Do list that we will update during the meeting so we don’t lose track of key insights.
Build in a feedback loop: Spend a few minutes discussing how to make the meeting itself more effective. This meeting shouldn’t be so rigid, make sure it works for your team.
(Advanced) Check quality controls: Look at your unit testing trends and other forms of QA to see where coverage might be slipping. I only recommend this after getting other areas of operations under control. This is more about shifting left and discovering problems earlier.
I’ve attached my simple agenda of this meeting to the bottom of the post. Please copy it and make it your own!
The most important factor in making this successful is ownership. This shouldn’t be a meeting where "operations folks" tell developers what is wrong. Every developer on the team needs to be involved. You should also consider inviting product managers, QA engineers, and security folks to get a full picture of the service’s health.
For this to work, you need to foster psychological safety and a blameless culture. People need to feel comfortable discussing mistakes and weird edge cases openly so you can find real solutions. When you spot a recurring problem, use it as a research opportunity. This fits into a broader topic that I’ll discuss later, but just know that the Operational Excellence meeting is a key driver to making your team more open and effective.
My simple Ops Excellence agenda template
Take this agenda and make it your own on your team. I have typically started with this agenda and then slowly morphed it over the weeks to fit our team’s style better. On one of my teams we had a weekly 30 minute ops excellence meeting with tight timeboxes on each agenda item for instance.
Operational Excellence Meeting Agenda
Frequency: Every 1 to 2 weeks
Duration: 60 Minutes
Participants: Full Engineering Team, Product Manager, QA, Security
- Review past incidents (15 minutes)
- Review Operational Dashboard (15 minutes)
- Review upcoming releases (10 minutes)
- Review and assign all action items (10 minutes)
- Feedback and meeting improvement (5-10 minutes)
This is a topic I have become really passionate about because it is something I usually see most teams neglecting. If you have feedback on why this works or why it does not, please reach out to me. There are always interesting facets to these problems that I have not encountered yet, and I would love to learn more from your experiences.