Incident Postmortem Template

This template follows the Google SRE incident postmortem model. Use it after every production incident (severity P1–P3). Aim to complete within 5 business days of resolution.


Incident Metadata

Field Value
Incident ID [FILL IN: e.g., INC-2026-001]
Title [FILL IN: Clear, specific incident description]
Date Started [FILL IN: YYYY-MM-DD HH:MM UTC]
Date Resolved [FILL IN: YYYY-MM-DD HH:MM UTC]
Duration [FILL IN: e.g., 45 minutes]
Severity [FILL IN: P1/P2/P3]
Status [FILL IN: Resolved / Mitigated]
Incident Commander [FILL IN: Name of person who led response]
Postmortem Author [FILL IN: Name and date of writing]

Summary

[FILL IN: Write a one-paragraph executive summary. Focus on what happened and who was affected, not why (that comes later). Include the primary symptom that customers noticed.]


Impact Statement

Users Affected

[FILL IN: How many users? Percentage of user base? Specific customer segments? Or “internal-only” if this was pre-production.]

Service Degradation

Metric During Incident Normal Baseline SLO Target
Availability [FILL IN: %] [FILL IN: %] [FILL IN: %]
Error Rate (5xx) [FILL IN: %] [FILL IN: %] [FILL IN: %]
Latency p95 [FILL IN: ms] [FILL IN: ms] [FILL IN: ms]
[Service-specific metric] [FILL IN] [FILL IN] [FILL IN]

SLO Burn

[FILL IN: e.g., “This incident consumed 25% of the weekly error budget. We had 75% remaining as of the incident start.”]

Business Impact

[FILL IN: Quantify if possible — lost transactions, customers unable to complete workflows, revenue impact, reputation impact. Or “no customer impact” if this was caught before users noticed.]


Timeline

Time (UTC) Actor(s) Event Notes
[FILL IN: HH:MM] [FILL IN: Name] [FILL IN: What triggered the incident?] [FILL IN: e.g., Deploy completed, alert fired, manual testing]
[FILL IN] [FILL IN] [FILL IN: First sign of problems (error spike, latency spike, etc.)] [FILL IN: Which dashboard/alert? What metric breached?]
[FILL IN] [FILL IN] [FILL IN: Detection — alert fired or on-call noticed] [FILL IN: How long after failure? Was MTTD acceptable?]
[FILL IN] [FILL IN] [FILL IN: Escalation — escalated to another team?] [FILL IN: Which team? Why needed?]
[FILL IN] [FILL IN] [FILL IN: Investigation step — hypothesis tested] [FILL IN: What dashboard/log was checked? What was ruled out?]
[FILL IN] [FILL IN] [FILL IN: Mitigation began] [FILL IN: Temporary fix, rollback, scaling, etc.]
[FILL IN] [FILL IN] [FILL IN: Mitigation applied] [FILL IN: Result — did traffic drop? Latency improve?]
[FILL IN] [FILL IN] [FILL IN: Full recovery confirmed] [FILL IN: All metrics back to normal. Verified by whom?]

Root Cause Analysis

5 Whys

Level 1: What directly caused the outage?

[FILL IN: e.g., “Database connection pool exhausted after new query pattern was deployed.”]

Level 2: Why did that happen?

[FILL IN: e.g., “The new endpoint was making N queries per request without connection reuse.”]

Level 3: Why was this not caught?

[FILL IN: e.g., “Load testing was only done with 50 concurrent users; production load was 500 users.”]

Level 4: Why didn’t load testing match production?

[FILL IN: e.g., “The load test harness was not configured with realistic request distributions. The new endpoint represents 5% of traffic in production but was 1% in the test.”]

Level 5: Why was the process insufficient?

[FILL IN: e.g., “Load test configs are not validated against production traffic patterns. There is no automatic comparison of test vs. prod.”]

Root Cause Summary

[FILL IN: 1-2 sentences pulling together the 5 Whys into a single clear statement. This is what you prevent going forward.]

Contributing Factors


Action Items

Action Type Owner Priority Due Date Status Notes
[FILL IN: e.g., “Increase DB max_connections from 50 to 200”] Prevent [FILL IN: Name] [FILL IN: P1/P2/P3] [FILL IN: Date] [FILL IN: Open/In Progress/Done] [FILL IN: e.g., “Reduces queueing under peak load”]
[FILL IN: e.g., “Add per-endpoint query count monitoring”] Detect [FILL IN: Name] [FILL IN: P1/P2/P3] [FILL IN: Date] [FILL IN: Open/In Progress/Done] [FILL IN: e.g., “Alerts if a single endpoint makes >1000 queries/sec”]
[FILL IN: e.g., “Implement connection pool backpressure”] Mitigate [FILL IN: Name] [FILL IN: P1/P2/P3] [FILL IN: Date] [FILL IN: Open/In Progress/Done] [FILL IN: e.g., “Return 503 instead of queuing when pool is exhausted”]
[FILL IN: e.g., “Update runbook with recovery steps”] Mitigate [FILL IN: Name] [FILL IN: P1/P2/P3] [FILL IN: Date] [FILL IN: Open/In Progress/Done] [FILL IN: e.g., “Document which command to run and expected recovery time”]
[FILL IN: e.g., “Load test with realistic traffic distributions”] Prevent [FILL IN: Name] [FILL IN: P1/P2/P3] [FILL IN: Date] [FILL IN: Open/In Progress/Done] [FILL IN: e.g., “Before deploying any endpoint changes, run load test matching prod traffic patterns”]

Lessons Learned

What Went Well

What Went Poorly

Where We Got Lucky


Supporting Information

Relevant Code / Configuration

Alert Definitions

[FILL IN: Copy the alert rule that fired. Include the PromQL query or threshold.]

# Example:
rule_name: HighDatabaseConnections
expr: (pg_stat_activity_count / pg_settings_max_connections) > 0.8
for: 2m

Logs from Incident Window

[FILL IN: Paste key log entries (anonymized). Focus on errors, state changes, and timestamps.]

Example:
2026-04-04T14:32:00Z ERROR: Database pool exhausted (45/50 connections in use)
2026-04-04T14:32:03Z ERROR: Request timeout waiting for connection
2026-04-04T14:33:15Z INFO: Database scaled from 50 to 200 max_connections
2026-04-04T14:33:45Z INFO: Connection pool recovered, now 12/200 in use

Appendix: Communication Log

Slack / Incident Channel

[FILL IN: Key messages from the incident channel. Timestamps. Who said what.]

14:32:00 @alice: Error rate spike in dashboard. Looking now.
14:32:15 @bob: Is this related to the deploy 30 min ago?
14:32:30 @alice: Checking. Let me review the load test results.
14:33:00 @carlos: I've started the scale-up playbook
14:33:45 @alice: Root cause identified. Bad query pattern in new endpoint. Issue #456.
14:34:00 @carlos: Scale-up complete. Latency recovering.
14:35:00 @alice: All clear. Metrics back to baseline. Opening postmortem.

Public Communication

[FILL IN: If customers were notified, paste the communication (status page update, email, etc.). Include timestamps.]

2026-04-04 14:35 UTC
Status update: A brief service degradation occurred from 14:32–14:35 UTC.
Affected: [X]% of requests experienced elevated latency.
Cause: Database connection pool exhaustion.
Resolution: Scaled up connection limits. Service is now fully recovered.
Postmortem coming soon.

Document History

Version Author Date Changes
1.0 [FILL IN: Name] [FILL IN: Date] Initial postmortem
[FILL IN] [FILL IN] [FILL IN] [FILL IN: Any edits after publication]

Sign-Off


Notes for Future Postmortems