Incident Postmortem Template

This template follows the Google SRE incident postmortem model. Use it after every production incident (severity P1–P3). Aim to complete within 5 business days of resolution.

Incident Metadata

Field	Value
Incident ID	[FILL IN: e.g., INC-2026-001]
Title	[FILL IN: Clear, specific incident description]
Date Started	[FILL IN: YYYY-MM-DD HH:MM UTC]
Date Resolved	[FILL IN: YYYY-MM-DD HH:MM UTC]
Duration	[FILL IN: e.g., 45 minutes]
Severity	[FILL IN: P1/P2/P3]
Status	[FILL IN: Resolved / Mitigated]
Incident Commander	[FILL IN: Name of person who led response]
Postmortem Author	[FILL IN: Name and date of writing]

Summary

[FILL IN: Write a one-paragraph executive summary. Focus on what happened and who was affected, not why (that comes later). Include the primary symptom that customers noticed.]

Impact Statement

Users Affected

[FILL IN: How many users? Percentage of user base? Specific customer segments? Or “internal-only” if this was pre-production.]

Service Degradation

Metric	During Incident	Normal Baseline	SLO Target
Availability	[FILL IN: %]	[FILL IN: %]	[FILL IN: %]
Error Rate (5xx)	[FILL IN: %]	[FILL IN: %]	[FILL IN: %]
Latency p95	[FILL IN: ms]	[FILL IN: ms]	[FILL IN: ms]
[Service-specific metric]	[FILL IN]	[FILL IN]	[FILL IN]

SLO Burn

[FILL IN: e.g., “This incident consumed 25% of the weekly error budget. We had 75% remaining as of the incident start.”]

Business Impact

[FILL IN: Quantify if possible — lost transactions, customers unable to complete workflows, revenue impact, reputation impact. Or “no customer impact” if this was caught before users noticed.]

Timeline

Time (UTC)	Actor(s)	Event	Notes
[FILL IN: HH:MM]	[FILL IN: Name]	[FILL IN: What triggered the incident?]	[FILL IN: e.g., Deploy completed, alert fired, manual testing]
[FILL IN]	[FILL IN]	[FILL IN: First sign of problems (error spike, latency spike, etc.)]	[FILL IN: Which dashboard/alert? What metric breached?]
[FILL IN]	[FILL IN]	[FILL IN: Detection — alert fired or on-call noticed]	[FILL IN: How long after failure? Was MTTD acceptable?]
[FILL IN]	[FILL IN]	[FILL IN: Escalation — escalated to another team?]	[FILL IN: Which team? Why needed?]
[FILL IN]	[FILL IN]	[FILL IN: Investigation step — hypothesis tested]	[FILL IN: What dashboard/log was checked? What was ruled out?]
[FILL IN]	[FILL IN]	[FILL IN: Mitigation began]	[FILL IN: Temporary fix, rollback, scaling, etc.]
[FILL IN]	[FILL IN]	[FILL IN: Mitigation applied]	[FILL IN: Result — did traffic drop? Latency improve?]
[FILL IN]	[FILL IN]	[FILL IN: Full recovery confirmed]	[FILL IN: All metrics back to normal. Verified by whom?]

Root Cause Analysis

5 Whys

Level 1: What directly caused the outage?

[FILL IN: e.g., “Database connection pool exhausted after new query pattern was deployed.”]

Level 2: Why did that happen?

[FILL IN: e.g., “The new endpoint was making N queries per request without connection reuse.”]

Level 3: Why was this not caught?

[FILL IN: e.g., “Load testing was only done with 50 concurrent users; production load was 500 users.”]

Level 4: Why didn’t load testing match production?

[FILL IN: e.g., “The load test harness was not configured with realistic request distributions. The new endpoint represents 5% of traffic in production but was 1% in the test.”]

Level 5: Why was the process insufficient?

[FILL IN: e.g., “Load test configs are not validated against production traffic patterns. There is no automatic comparison of test vs. prod.”]

Root Cause Summary

[FILL IN: 1-2 sentences pulling together the 5 Whys into a single clear statement. This is what you prevent going forward.]

Contributing Factors

[FILL IN: e.g., “Monitoring was not set up for the new endpoint.”]
[FILL IN: e.g., “On-call engineer did not have access to database metrics.”]
[FILL IN: e.g., “Circuit breaker timeout was too long, causing 60 seconds of degradation instead of 10.”]

Action Items

Action	Type	Owner	Priority	Due Date	Status	Notes
[FILL IN: e.g., “Increase DB max_connections from 50 to 200”]	Prevent	[FILL IN: Name]	[FILL IN: P1/P2/P3]	[FILL IN: Date]	[FILL IN: Open/In Progress/Done]	[FILL IN: e.g., “Reduces queueing under peak load”]
[FILL IN: e.g., “Add per-endpoint query count monitoring”]	Detect	[FILL IN: Name]	[FILL IN: P1/P2/P3]	[FILL IN: Date]	[FILL IN: Open/In Progress/Done]	[FILL IN: e.g., “Alerts if a single endpoint makes >1000 queries/sec”]
[FILL IN: e.g., “Implement connection pool backpressure”]	Mitigate	[FILL IN: Name]	[FILL IN: P1/P2/P3]	[FILL IN: Date]	[FILL IN: Open/In Progress/Done]	[FILL IN: e.g., “Return 503 instead of queuing when pool is exhausted”]
[FILL IN: e.g., “Update runbook with recovery steps”]	Mitigate	[FILL IN: Name]	[FILL IN: P1/P2/P3]	[FILL IN: Date]	[FILL IN: Open/In Progress/Done]	[FILL IN: e.g., “Document which command to run and expected recovery time”]
[FILL IN: e.g., “Load test with realistic traffic distributions”]	Prevent	[FILL IN: Name]	[FILL IN: P1/P2/P3]	[FILL IN: Date]	[FILL IN: Open/In Progress/Done]	[FILL IN: e.g., “Before deploying any endpoint changes, run load test matching prod traffic patterns”]

Lessons Learned

What Went Well

[FILL IN: e.g., “Alerting on error rate fired within 30 seconds. No delay in detection.”]
[FILL IN: e.g., “On-call engineer correctly identified the connection pool as the bottleneck by checking the right dashboard first.”]
[FILL IN: e.g., “Playbook for scaling the database was up to date and took only 5 minutes to execute.”]
[FILL IN: e.g., “Multi-region setup meant the other region continued serving traffic while we mitigated the primary region.”]

What Went Poorly

[FILL IN: e.g., “Load test config was out of sync with production traffic patterns.”]
[FILL IN: e.g., “Monitoring for the new endpoint was not set up before deployment.”]
[FILL IN: e.g., “On-call engineer did not have write access to the database config, had to wait for an escalation.”]
[FILL IN: e.g., “Alerting threshold for latency p95 was too high — 2 second threshold when SLO is 500ms. We could have been alerted sooner.”]

Where We Got Lucky

[FILL IN: e.g., “The incident happened during a low-traffic period (50 req/s). During peak traffic (500 req/s), the same failure would have caused dropped requests.”]
[FILL IN: e.g., “The cache was still warm, so even with DB latency increased, many reads were served from cache.”]
[FILL IN: e.g., “The bad deploy was to a single region — the other region was not affected, so 50% of users saw no impact.”]
[FILL IN: e.g., “A paying customer did not hit the bug. If they had, we would have had a high-urgency support case.”]

Supporting Information

Links and References

Incident Tracker: [FILL IN: Link to ticket or incident log]
Dashboards Consulted:
- [FILL IN: e.g., “Golden Signals — latency and error rate”]
- [FILL IN: e.g., “Database Metrics — connection pool status”]
- [FILL IN: e.g., “Deployment Timeline — what changed at 14:32?”]
Logs:
- [FILL IN: e.g., “Application logs from 14:30–15:00 UTC”]
- [FILL IN: e.g., “Database slow query log”]
Runbooks:
- [FILL IN: e.g., “/wiki/runbooks/database-scale-up.md”]
On-Call Resources:
- [FILL IN: e.g., “Escalation tree: /wiki/oncall/escalation.md”]
- [FILL IN: e.g., “Database on-call playbook: /wiki/runbooks/db-triage.md”]

Relevant Code / Configuration

Deploy: [FILL IN: Commit hash or deploy ID that triggered the incident]
Config Changes: [FILL IN: If config was changed, link to the diff or change request]
Monitoring Config: [FILL IN: Link to alert definition or dashboard JSON]

Alert Definitions

[FILL IN: Copy the alert rule that fired. Include the PromQL query or threshold.]

# Example:
rule_name: HighDatabaseConnections
expr: (pg_stat_activity_count / pg_settings_max_connections) > 0.8
for: 2m

Logs from Incident Window

[FILL IN: Paste key log entries (anonymized). Focus on errors, state changes, and timestamps.]

Example:
2026-04-04T14:32:00Z ERROR: Database pool exhausted (45/50 connections in use)
2026-04-04T14:32:03Z ERROR: Request timeout waiting for connection
2026-04-04T14:33:15Z INFO: Database scaled from 50 to 200 max_connections
2026-04-04T14:33:45Z INFO: Connection pool recovered, now 12/200 in use

Appendix: Communication Log

Slack / Incident Channel

[FILL IN: Key messages from the incident channel. Timestamps. Who said what.]

32:00 @alice: Error rate spike in dashboard. Looking now.
32:15 @bob: Is this related to the deploy 30 min ago?
32:30 @alice: Checking. Let me review the load test results.
33:00 @carlos: I've started the scale-up playbook
33:45 @alice: Root cause identified. Bad query pattern in new endpoint. Issue #456.
34:00 @carlos: Scale-up complete. Latency recovering.
35:00 @alice: All clear. Metrics back to baseline. Opening postmortem.

Public Communication

[FILL IN: If customers were notified, paste the communication (status page update, email, etc.). Include timestamps.]

2026-04-04 14:35 UTC
Status update: A brief service degradation occurred from 14:32–14:35 UTC.
Affected: [X]% of requests experienced elevated latency.
Cause: Database connection pool exhaustion.
Resolution: Scaled up connection limits. Service is now fully recovered.
Postmortem coming soon.

Document History

Version	Author	Date	Changes
1.0	[FILL IN: Name]	[FILL IN: Date]	Initial postmortem
[FILL IN]	[FILL IN]	[FILL IN]	[FILL IN: Any edits after publication]

Sign-Off

Incident Commander: _________ Date: _____
Engineering Lead: _________ Date: _____
On-Call Manager: _________ Date: _____

Notes for Future Postmortems

Blameless: This postmortem focuses on process and system failures, not individual mistakes. No blame is assigned.
Action Items Tracked: All action items are tracked in [FILL IN: Your issue tracker]. Progress is reviewed weekly until complete.
Feedback: Questions or additions? Comment on the incident ticket or email [FILL IN: DL or person responsible].
Confidentiality: This postmortem is internal. Do not share outside the team without approval from [FILL IN: Manager or public communications lead].