Incident Communication Guide

For Small Teams

Purpose

This guide defines when to declare an incident, who is responsible for communication, and how to communicate externally during an incident.

It also defines how to act technically during an incident, emphasizing mitigation over fixes to reduce risk and downtime.

The goals are to:

Restore service as fast as possible
Maintain customer trust
Avoid unnecessary risk during unstable situations

This guide assumes:

A small engineering team
No dedicated SRE or PR team
A hands-on CTO or tech lead involved in escalation

Core Principles

1. Not Everything Is an Incident

An incident is declared only when critical parts of the service are affected.

If everything is an incident, nothing is.

Over-declaring incide…

For Small Teams

Purpose

This guide defines when to declare an incident, who is responsible for communication, and how to communicate externally during an incident.

It also defines how to act technically during an incident, emphasizing mitigation over fixes to reduce risk and downtime.

The goals are to:

Restore service as fast as possible
Maintain customer trust
Avoid unnecessary risk during unstable situations

This guide assumes:

A small engineering team
No dedicated SRE or PR team
A hands-on CTO or tech lead involved in escalation

Core Principles

1. Not Everything Is an Incident

An incident is declared only when critical parts of the service are affected.

If everything is an incident, nothing is.

Over-declaring incidents causes alert fatigue, customer anxiety, and loss of credibility.

2. Incidents Are About Restoration, Not Root Cause

During an incident:

The objective is to restore service
Root cause analysis is explicitly postponed

Incidents are for stabilization. Learning happens after recovery.

3. Fixing Bugs Is a Last Resort During Incidents

During an incident, do not default to writing new code.

The priority order is:

Mitigate without code changes
Rollback or disable
Fix forward only if unavoidable

Bug fixes during incidents increase risk and often make outages longer.

4. Silence Causes Customers to Leave

Downtime happens. Silence is interpreted as loss of control or indifference.

Communication is part of incident response, not an afterthought.

What Qualifies as an Incident

Declare an incident only if at least one of the following is true:

Core paid action is blocked
Authentication or authorization is broken
Data integrity or correctness is at risk
Billing, payments, or entitlements are impacted
Multiple tenants or the entire system are affected

These are the only reasons to trigger external incident communication.

What Is Not an Incident

Do not declare incidents for:

Non-core features being slow
Analytics dashboards delayed
A single tenant misconfiguration
Background jobs running late while users are unblocked
Internal tooling failures

These are bugs or degradations, not incidents.

Incident Roles (Minimal but Mandatory)

One person may hold multiple roles, but responsibilities must be explicit.

Incident Lead

Owns the incident.

Responsibilities:

Decision-making
Prioritizing restoration over diagnosis
Coordinating mitigation
Ending the incident

Communication Lead

Owns external communication.

Responsibilities:

Writing and publishing status updates
Ensuring updates are timely and consistent
Avoiding speculation and over-commitment
Coordinating messaging with the Incident Lead

Only one person communicates externally.

Fix / Mitigation Lead

Owns technical recovery.

Responsibilities:

Applying mitigations
Rolling back deployments
Disabling features
Stabilizing the system
Reporting progress to the Incident Lead

CTO / Business Lead (Optional)

Owns trade-offs.

Responsibilities:

Customer impact decisions
Degradation vs availability trade-offs
SLA, credits, or escalation decisions
Declaring “good enough for now”

Technical Response Strategy During Incidents

Mitigation First, Fix Later

When an incident is active, prefer actions that reduce impact without changing code.

Examples of mitigation:

Disable feature flags
Throttle traffic or tenants
Pause background workers
Switch to degraded / read-only mode
Roll back to last known good version
Reduce load or concurrency
Block abusive tenants

Mitigation is:

Faster
Reversible
Lower risk

When Is It Acceptable to Fix a Bug?

Fixing code during an incident is acceptable only if:

No mitigation or rollback is possible
The fix is minimal, isolated, and well understood
The blast radius of the fix is smaller than the blast radius of waiting
The team is confident the change will not introduce new failure modes

If these conditions are not met, do not ship a fix.

Why Fixing During Incidents Is Dangerous

Bug fixes during incidents:

Are written under stress
Skip normal review and testing
Introduce new unknowns
Often solve the wrong problem

Many prolonged outages are caused by:

“One more quick fix.”

Decision Rule for Incident Leads

Before approving a code change during an incident, ask:

Can we mitigate this without code?
Can we rollback instead?
Are we trying to be correct instead of stable?
Will this change reduce risk immediately?

If the answer to the first two is “yes,” do not fix forward.

Communication Rules

Communicate Early, Not Accurately

The first message does not require:

Root cause
Technical details
A fix

It must include:

Acknowledgement
User-level impact
Ownership
Next update time

Silence signals loss of control.

First External Update (Acknowledgement)

Good:

We are aware of an issue affecting some users when performing [core action]. Our team is actively investigating and mitigation is in progress. Next update in 30 minutes.

Bad:

We are investigating an issue.

Ongoing Updates

Update even if nothing changed.

“No change” is still information. Predictability matters more than progress.

Avoid speculation at all times.

Degraded Mode Communication

If functionality is intentionally limited, say it explicitly.

Good:

The service is operating in a degraded mode. Core functionality is available, but [feature] is temporarily disabled to ensure stability.

Timing and Update Cadence

Severe outage: every 15 minutes
Normal incident: every 30 minutes
Resolution: immediately

The goal is predictability, not speed.

Never Promise Exact Fix Times

Do not promise exact resolution times during an incident.

Never:

“10 minutes”
“Almost fixed”
“Final checks”

Even if you are confident. Even if a customer asks directly.

What to Promise Instead

Promise update times, not fix times.

Good:

We’ll provide another update in 15 minutes.

Resolution Communication

Close the loop clearly once stable.

The issue has been resolved and all systems are operating normally. We’re continuing to monitor closely. Thank you for your patience.

After the Incident

Only after stability is restored:

Investigate root cause
Write a postmortem
Implement long-term fixes

Never mix recovery with analysis.

Do / Don’t Summary

Do

Declare incidents only for critical paths
Assign a clear Communication Lead
Prioritize mitigation over fixes
Roll back before fixing forward
Communicate early and consistently
Promise update times, not fix times
Close the loop explicitly

Don’t

Declare incidents for non-core issues
Stay silent while investigating
Speculate publicly
Over-explain technical details
Ship risky fixes under pressure
Promise exact resolution times
Combine recovery with root cause analysis

Final Note

Incidents are about restoring trust, not proving technical skill.

Stability beats correctness during incidents. Mitigation beats fixes. Communication beats silence.

Small teams that internalize this operate calmly and credibly under pressure.

For Small Teams

Purpose

Core Principles

1. Not Everything Is an Incident

For Small Teams

Purpose

Core Principles

1. Not Everything Is an Incident

2. Incidents Are About Restoration, Not Root Cause

3. Fixing Bugs Is a Last Resort During Incidents

4. Silence Causes Customers to Leave

What Qualifies as an Incident

What Is Not an Incident

Incident Roles (Minimal but Mandatory)

Incident Lead

Communication Lead

Fix / Mitigation Lead

CTO / Business Lead (Optional)

Technical Response Strategy During Incidents

Mitigation First, Fix Later

When Is It Acceptable to Fix a Bug?

Why Fixing During Incidents Is Dangerous

Decision Rule for Incident Leads

Communication Rules

Communicate Early, Not Accurately

First External Update (Acknowledgement)

Ongoing Updates

Degraded Mode Communication

Timing and Update Cadence

Never Promise Exact Fix Times

What to Promise Instead

Resolution Communication

After the Incident

Do / Don’t Summary

Do

Don’t

Final Note

Similar Posts