For Small Teams
Purpose
This guide defines when to declare an incident, who is responsible for communication, and how to communicate externally during an incident.
It also defines how to act technically during an incident, emphasizing mitigation over fixes to reduce risk and downtime.
The goals are to:
- Restore service as fast as possible
- Maintain customer trust
- Avoid unnecessary risk during unstable situations
This guide assumes:
- A small engineering team
- No dedicated SRE or PR team
- A hands-on CTO or tech lead involved in escalation
Core Principles
1. Not Everything Is an Incident
An incident is declared only when critical parts of the service are affected.
If everything is an incident, nothing is.
Over-declaring incide…
For Small Teams
Purpose
This guide defines when to declare an incident, who is responsible for communication, and how to communicate externally during an incident.
It also defines how to act technically during an incident, emphasizing mitigation over fixes to reduce risk and downtime.
The goals are to:
- Restore service as fast as possible
- Maintain customer trust
- Avoid unnecessary risk during unstable situations
This guide assumes:
- A small engineering team
- No dedicated SRE or PR team
- A hands-on CTO or tech lead involved in escalation
Core Principles
1. Not Everything Is an Incident
An incident is declared only when critical parts of the service are affected.
If everything is an incident, nothing is.
Over-declaring incidents causes alert fatigue, customer anxiety, and loss of credibility.
2. Incidents Are About Restoration, Not Root Cause
During an incident:
- The objective is to restore service
- Root cause analysis is explicitly postponed
Incidents are for stabilization. Learning happens after recovery.
3. Fixing Bugs Is a Last Resort During Incidents
During an incident, do not default to writing new code.
The priority order is:
- Mitigate without code changes
- Rollback or disable
- Fix forward only if unavoidable
Bug fixes during incidents increase risk and often make outages longer.
4. Silence Causes Customers to Leave
Downtime happens. Silence is interpreted as loss of control or indifference.
Communication is part of incident response, not an afterthought.
What Qualifies as an Incident
Declare an incident only if at least one of the following is true:
- Core paid action is blocked
- Authentication or authorization is broken
- Data integrity or correctness is at risk
- Billing, payments, or entitlements are impacted
- Multiple tenants or the entire system are affected
These are the only reasons to trigger external incident communication.
What Is Not an Incident
Do not declare incidents for:
- Non-core features being slow
- Analytics dashboards delayed
- A single tenant misconfiguration
- Background jobs running late while users are unblocked
- Internal tooling failures
These are bugs or degradations, not incidents.
Incident Roles (Minimal but Mandatory)
One person may hold multiple roles, but responsibilities must be explicit.
Incident Lead
Owns the incident.
Responsibilities:
- Decision-making
- Prioritizing restoration over diagnosis
- Coordinating mitigation
- Ending the incident
Communication Lead
Owns external communication.
Responsibilities:
- Writing and publishing status updates
- Ensuring updates are timely and consistent
- Avoiding speculation and over-commitment
- Coordinating messaging with the Incident Lead
Only one person communicates externally.
Fix / Mitigation Lead
Owns technical recovery.
Responsibilities:
- Applying mitigations
- Rolling back deployments
- Disabling features
- Stabilizing the system
- Reporting progress to the Incident Lead
CTO / Business Lead (Optional)
Owns trade-offs.
Responsibilities:
- Customer impact decisions
- Degradation vs availability trade-offs
- SLA, credits, or escalation decisions
- Declaring “good enough for now”
Technical Response Strategy During Incidents
Mitigation First, Fix Later
When an incident is active, prefer actions that reduce impact without changing code.
Examples of mitigation:
- Disable feature flags
- Throttle traffic or tenants
- Pause background workers
- Switch to degraded / read-only mode
- Roll back to last known good version
- Reduce load or concurrency
- Block abusive tenants
Mitigation is:
- Faster
- Reversible
- Lower risk
When Is It Acceptable to Fix a Bug?
Fixing code during an incident is acceptable only if:
- No mitigation or rollback is possible
- The fix is minimal, isolated, and well understood
- The blast radius of the fix is smaller than the blast radius of waiting
- The team is confident the change will not introduce new failure modes
If these conditions are not met, do not ship a fix.
Why Fixing During Incidents Is Dangerous
Bug fixes during incidents:
- Are written under stress
- Skip normal review and testing
- Introduce new unknowns
- Often solve the wrong problem
Many prolonged outages are caused by:
“One more quick fix.”
Decision Rule for Incident Leads
Before approving a code change during an incident, ask:
- Can we mitigate this without code?
- Can we rollback instead?
- Are we trying to be correct instead of stable?
- Will this change reduce risk immediately?
If the answer to the first two is “yes,” do not fix forward.
Communication Rules
Communicate Early, Not Accurately
The first message does not require:
- Root cause
- Technical details
- A fix
It must include:
- Acknowledgement
- User-level impact
- Ownership
- Next update time
Silence signals loss of control.
First External Update (Acknowledgement)
Good:
We are aware of an issue affecting some users when performing [core action]. Our team is actively investigating and mitigation is in progress. Next update in 30 minutes.
Bad:
We are investigating an issue.
Ongoing Updates
Update even if nothing changed.
“No change” is still information. Predictability matters more than progress.
Avoid speculation at all times.
Degraded Mode Communication
If functionality is intentionally limited, say it explicitly.
Good:
The service is operating in a degraded mode. Core functionality is available, but [feature] is temporarily disabled to ensure stability.
Timing and Update Cadence
- Severe outage: every 15 minutes
- Normal incident: every 30 minutes
- Resolution: immediately
The goal is predictability, not speed.
Never Promise Exact Fix Times
Do not promise exact resolution times during an incident.
Never:
- “10 minutes”
- “Almost fixed”
- “Final checks”
Even if you are confident. Even if a customer asks directly.
What to Promise Instead
Promise update times, not fix times.
Good:
We’ll provide another update in 15 minutes.
Resolution Communication
Close the loop clearly once stable.
The issue has been resolved and all systems are operating normally. We’re continuing to monitor closely. Thank you for your patience.
After the Incident
Only after stability is restored:
- Investigate root cause
- Write a postmortem
- Implement long-term fixes
Never mix recovery with analysis.
Do / Don’t Summary
Do
- Declare incidents only for critical paths
- Assign a clear Communication Lead
- Prioritize mitigation over fixes
- Roll back before fixing forward
- Communicate early and consistently
- Promise update times, not fix times
- Close the loop explicitly
Don’t
- Declare incidents for non-core issues
- Stay silent while investigating
- Speculate publicly
- Over-explain technical details
- Ship risky fixes under pressure
- Promise exact resolution times
- Combine recovery with root cause analysis
Final Note
Incidents are about restoring trust, not proving technical skill.
Stability beats correctness during incidents. Mitigation beats fixes. Communication beats silence.
Small teams that internalize this operate calmly and credibly under pressure.