URL Shortener — Incident Response Playbook

In Case of Emergency, open this document.

This is the master operational reference for on-call engineers responding to production incidents on the URL Shortener service. It ties together monitoring, alert rules, runbooks, and failure modes into a single process document.

Important: All commands in this playbook should be run from the project root directory.

1. Quick Access

Monitoring URLs (all running locally):

Service	URL	Purpose
Grafana Dashboards	http://localhost:3000	Primary monitoring dashboard (user: admin, password: admin)
Prometheus Alerts	http://localhost:9090/alerts	View active alerts, alert rules, and history
Alertmanager	http://localhost:9093	Alert routing, silencing, and grouping
Jaeger Traces	http://localhost:16686	Distributed tracing for request diagnostics
App Health Check	http://localhost/health	Liveness check (200 = running)
App Readiness	http://localhost/health/ready	Readiness check (200 = DB connected)
Prometheus Metrics	http://localhost/metrics	Raw metrics endpoint (for queries)
Loki Logs	http://localhost:3100	Log aggregation (queried via Grafana)

Remember: Grafana is your primary dashboard. Open it first.

2. Incident Severity Definitions

Severity	Definition	Examples	Response Time	SLO Impact
P1 - Critical	Service completely down or user-facing data loss risk	All replicas down, Postgres unreachable, disk full	Immediate (page on-call)	Violates 99.9% availability target
P2 - Major	Service degraded but functioning; increased errors/latency	High error rate (>5%), p95 latency >500ms, single replica crash	< 15 minutes	May violate SLO if sustained
P3 - Minor	Warning condition; service operating below targets	Redis down (circuit breaker active), CPU warning, elevated memory	< 1 hour	Doesn’t violate SLO

Severity assignment decision tree:

Is the app returning 5xx for all requests? → P1
Is /health/ready failing? → P1 (or P2 if single replica)
Is error rate >5%? → P2
Is latency consistently >500ms p95? → P2
Is a non-critical service (Redis, Jaeger) down? → P3
Is there a capacity warning (CPU/memory high)? → P3

3. Incident Response Flow

Follow this sequence for every incident:

1. Detect (0 min)

Alert fires. Notification sent via email (Alertmanager configured in alertmanager.yml).

2. Acknowledge (1 min)

Open Grafana → URL Shortener - Golden Signals dashboard.

View: Request rate (traffic), error rate (5xx %), latency (p95), and saturation (CPU/memory/connections).
Confirm the alert is real and not a false positive.

3. Assess (2-3 min)

Determine severity based on golden signals:

Error rate: % of 5xx responses
Latency: p95 response time
Traffic: requests per second
Saturation: CPU, memory, DB connections

Assign severity using the table in section 2.

4. Diagnose (5-10 min)

Consult the Alert Quick Reference (section 4) for the specific alert. Jump to the per-alert runbook in section 8 of this playbook.

Example paths:

ServiceDown firing? → Runbook: ServiceDown
HighErrorRate firing? → Runbook: HighErrorRate
HighLatency firing? → Runbook: HighLatency
RedisDown firing? → Runbook: RedisDown

5. Mitigate (10-20 min)

Apply the fix from the runbook. Verify by:

Checking health endpoints: curl http://localhost/health/ready
Monitoring Grafana for metrics returning to baseline
Watching Prometheus for the alert to clear (may take 1-3 min)

6. Communicate (during mitigation)

Update the team via GitHub issue or email:

At detection: “URL Shortener P2 alert: HighErrorRate. Investigating.”
During mitigation: “Root cause found: Redis OOM kill. Restarting Redis.”
At resolution: “Resolved. Error rate back to <1%. Redis recovered. No data loss.”

7. Postmortem (after resolution)

Create a postmortem document in docs/Incident Response/rca/ using the template below.

When to write postmortem:

All P1 incidents (mandatory)
P2 incidents lasting >1 hour (mandatory)
P3 incidents (optional, but recommended)

Postmortem template:

# Incident Report: [Title]

**Date**: YYYY-MM-DD HH:MM UTC
**Severity**: P1 / P2 / P3
**Duration**: X minutes
**Timeline**:
  - HH:MM: Alert fired
  - HH:MM: Root cause identified
  - HH:MM: Mitigation applied
  - HH:MM: Service recovered

**Root cause**: [1-2 sentence explanation]

**Impact**:
  - Downtime: X minutes
  - Users affected: ~Y
  - Data loss: Yes/No

**Remediation**:
  1. Immediate (within 24 hours): [fix applied]
  2. Follow-up (within 1 week): [prevent recurrence]

**Action items**:
  - [ ] PR for config change / code fix
  - [ ] Update docs / runbook
  - [ ] Discuss in team standup

4. Alert Quick Reference

Alert	Severity	Condition	Fires After	Expected Time to Fire	Runbook
ServiceDown	Critical	`up == 0`	1 minute	~70 seconds	Runbook
HighErrorRate	Warning	5xx rate > 5%	30 seconds	~45 seconds	Runbook
HighLatency	Warning	p95 > 500ms	30 seconds	~45 seconds	Runbook
RedisDown	Warning	Redis connection errors > 0	1 minute	~70 seconds	Runbook

Key insight: Alerts have a “wait period” (the for: Xm duration in monitoring/prometheus/alerts.yml). This prevents noise from brief blips. If you see a spike in Grafana but the alert didn’t fire, it may have already recovered.

5. SLO Targets (at a glance)

These are your targets. Sustained violations trigger postmortems.

Signal	Target	Measurement Window	Where to Check
Availability	99.9% uptime	Rolling 7 days	Grafana: “Availability” panel (via `/health/ready` checks)
Read Latency (p95)	< 500 ms	5-minute rolling window	Grafana: “Request Duration” panel
Write Latency (p95)	< 1 s	5-minute rolling window	Grafana: “Request Duration” panel (POST/PUT)
Error Rate (5xx)	< 1%	5-minute rolling window	Grafana: “Error Rate” panel

Error budget for 99.9% availability: ~10 minutes of downtime per week.

6. Common Remediation Commands

Quick reference for the most common fixes. Run these from the machine hosting the Docker services.

Service Status & Logs

# Check all service status
docker compose ps

# View app logs (last 50 lines, follow in real-time)
docker compose logs app --tail 50 -f

# View Postgres logs
docker compose logs postgres --tail 50

# View Redis logs
docker compose logs redis --tail 50

# View Nginx logs
docker compose logs nginx --tail 50

Scaling & Restarting

# Scale app replicas up (for high latency/error rate under load)
docker compose up -d --scale app=3

# Scale app replicas down (to reduce memory pressure)
docker compose up -d --scale app=1

# Restart a specific service (use after fixing config)
docker compose restart app        # Restart app
docker compose restart postgres   # Restart Postgres
docker compose restart redis      # Restart Redis

# Rebuild and restart (if code was updated)
docker compose up -d --build app

# Force hard restart (kills container immediately, no graceful shutdown)
docker compose kill app && docker compose up -d app

Health Checks

# Check app readiness (must return HTTP 200)
curl -s http://localhost/health/ready | jq .

# Check Postgres connectivity
docker compose exec postgres pg_isready -U postgres

# Check Redis connectivity
docker compose exec redis redis-cli ping

# Check Postgres connection pool
docker compose exec postgres psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check Redis memory usage
docker compose exec redis redis-cli info memory

Diagnostics

# Check system memory pressure
docker stats

# Check Prometheus for active alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts'

# Check recent Docker restarts
docker compose logs --timestamps | grep "Restarting\|Started"

# View container resource usage over time
docker compose stats --no-stream

Rollback (if the latest deployment is bad)

# View update history
docker compose inspect app | grep -A20 "UpdateStatus"

# Note: docker-compose doesn't have built-in rollback, so either:
# Option 1: Revert the code, rebuild, and redeploy
git revert HEAD && docker compose up -d --build app

# Option 2: Restart with the previous image (requires tagging)
docker compose up -d app  # Uses previous image if still available

7. Escalation Path

Time Elapsed	Action	Who
0-5 min	Alert received, acknowledge via GitHub issue or email	On-call engineer
5-15 min	Diagnose using runbook	On-call engineer
15-30 min	If not resolved, loop in team lead	On-call + team lead
30+ min	Contact backup engineer, consider rollback or code change	Team
1+ hour	Post incident update, coordinate postmortem	Team lead

Notification channels:

GitHub: Open an issue at https://github.com/RETR0-OS/MLH_PE_URL_Shortener/issues
Email: gunbirsingh2006@gmail.com / ajinda17@asu.edu (configured in ALERT_EMAIL_TO)

8. Runbook Summaries

These are brief summaries of each alert. Full details are below.

ServiceDown

Alert fires: App instance unreachable for >1 minute. Severity: P1 — Service completely down. Escalate if not resolved in: 15 minutes → page team lead.

Step 1 — Confirm the alert is real (0–2 min):

curl -s -o /dev/null -w "%{http_code}" http://localhost/health
# If 200: alert is stale (already recovered). Monitor for 2 min, then silence if stable.
# If non-200 or connection refused: proceed to Step 2.

Step 2 — Identify the failed component (2–5 min):

# Check all service states
docker compose ps

# Check app crash logs
docker compose logs app --tail 50

# Check if Postgres is the blocker
docker compose ps postgres
docker compose exec postgres pg_isready -U postgres

Step 3 — Mitigation by root cause (5–10 min):

If Postgres is down:

docker compose restart postgres
# Wait 15s for health check, then:
docker compose restart app
curl -s http://localhost/health/ready | jq .

If app is OOM-killed:

docker stats --no-stream | grep app
# If memory > 384MB per replica, scale down:
docker compose up -d --scale app=1
docker compose logs app --tail 30

If port conflict or startup crash:

# Run attached to see startup error:
docker compose stop app
docker compose up app   # NOT -d, so you see the error
# Ctrl-C once you see the cause, fix it, then:
docker compose up -d app

Step 4 — Verify recovery:

curl -s http://localhost/health/ready | jq .
# Must return: {"status": "ok"}
docker compose ps | grep app
# Must show 2+ "Up" entries

Expected recovery time: 5–15 seconds after Postgres or dependencies recover. Escalate if not resolved in 15 minutes: page team lead with current docker compose ps output and last 100 lines of docker compose logs app.

HighErrorRate

Alert fires: 5xx error rate exceeds 5% for >30 seconds. Severity: P2 — User-facing errors. Escalate if not resolved in: 20 minutes → page team lead + consider rollback.

Step 1 — Identify which endpoint is failing (0–3 min):

# Check error logs with context
docker compose logs app --tail 100 | grep -E '"status": 5'

# Check active DB connections (should be < 80 out of 100 max)
docker compose exec postgres psql -U postgres -c \
  "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';"

# Check container health
docker compose ps

Step 2 — Match to root cause:

If DB connection errors in logs:

# Check connection pool saturation
docker compose exec postgres psql -U postgres -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# If > 80 active: connection pool is exhausted
docker compose restart app   # Forces pool reconnection

If unhandled exception / code bug:

# Get full stack traces
docker compose logs app --tail 200 | grep -A 10 "Traceback\|ERROR"
# If from a recent deploy → rollback:
cd /root/MLH_PE_URL_Shortener
git log --oneline -5   # Note the previous commit SHA
git revert HEAD --no-edit
docker compose up -d --build

If traffic spike:

# Scale up replicas
docker compose up -d --scale app=3
# Wait 30s for new replicas to warm up, then re-check error rate in Grafana

Step 3 — Verify recovery:

# Poll error rate for 60 seconds
for i in $(seq 1 6); do
  curl -s "http://localhost:9090/api/v1/query?query=sum(rate(flask_http_request_total{status=~'5..'}[30s]))/sum(rate(flask_http_request_total[30s]))*100" \
    | jq '.data.result[0].value[1]' 2>/dev/null || echo "0"
  sleep 10
done
# All values should be below 1.0 (1%)

Expected recovery time: 2–5 minutes after restart or scale-up. Escalate if not resolved in 20 minutes: share current error rate (from Prometheus query above) and last 200 log lines with team lead.

HighLatency

Alert fires: p95 latency exceeds 500ms for >30 seconds. Severity: P2 — Performance degradation, SLO at risk. Escalate if not resolved in: 20 minutes → review capacity plan, consider droplet upgrade.

Step 1 — Confirm and characterise (0–3 min):

# Check current p95 latency
curl -s "http://localhost:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(flask_http_request_duration_seconds_bucket[5m]))by(le))*1000" \
  | jq '.data.result[0].value[1]'
# If < 500: alert is clearing. Monitor for 2 min.

# Check if it's traffic-driven
curl -s "http://localhost:9090/api/v1/query?query=sum(rate(flask_http_request_total[5m]))" \
  | jq '.data.result[0].value[1]'
# Compare to baseline: ~100 req/s is normal, > 400 req/s may require scaling

# Check container CPU and memory
docker stats --no-stream

Step 2 — Match to root cause:

If traffic spike (request rate > 400 req/s):

# Scale up replicas (max 3 on 2 GB droplet)
docker compose up -d --scale app=3
# Wait 30s for new replica to warm up

If Redis is down (cache misses causing DB load):

docker compose ps redis
docker compose exec redis redis-cli ping
# If Redis is down, it will auto-restart. Wait 30s for recovery.
# Latency will be elevated (~200ms p95) until Redis is back and cache refills.

If Postgres is slow (high DB latency):

# Find slow queries
docker compose exec postgres psql -U postgres -d hackathon_db \
  -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"
# If a specific query is slow, check EXPLAIN ANALYZE

Step 3 — Verify recovery:

# Watch p95 latency trend over 2 minutes
for i in $(seq 1 6); do
  echo -n "$(date +%H:%M:%S) p95: "
  curl -s "http://localhost:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(flask_http_request_duration_seconds_bucket[2m]))by(le))*1000" \
    | jq -r '.data.result[0].value[1]' 2>/dev/null || echo "no data"
  sleep 20
done
# Should trend downward toward < 100ms

Expected recovery time: 1–3 minutes after scaling or Redis recovery. Escalate if not resolved in 20 minutes: the bottleneck is hardware (see docs/CAPACITY_PLAN.md Phase 2 — droplet upgrade required).

RedisDown

Alert fires: Redis connection errors detected for >1 minute. Severity: P3 — No user-facing errors (circuit breaker active), but latency is elevated. Escalate if not resolved in: 30 minutes → investigate OOM kill or hardware issue.

IMPORTANT: The circuit breaker is active. Reads fall back to PostgreSQL automatically. Users see higher latency (~200ms p95) but zero errors. This is NOT an emergency.

Step 1 — Assess Redis state (0–2 min):

# Check Redis container status
docker compose ps redis

# Try to ping Redis
docker compose exec redis redis-cli ping
# Expected: PONG (if running), or connection refused (if down)

# Check Redis logs for crash reason
docker compose logs redis --tail 50

# Check if it was OOM-killed (look for exit code 137)
docker inspect $(docker compose ps -q redis) | jq '.[0].State.ExitCode'
# 137 = OOM kill, 1 = crash, 0 = clean exit

Step 2 — Mitigation:

If Redis is down and NOT auto-restarting:

# Force restart
docker compose restart redis
# Wait 15s then verify:
docker compose exec redis redis-cli ping
# Expected: PONG

If Redis keeps OOM-killing (memory leak):

# Check memory usage before it crashes again
docker compose exec redis redis-cli info memory | grep used_memory_human
# If > 120MB out of 128MB limit, it's about to OOM again

# Immediate: flush cache to buy time (safe — app will refill from DB)
docker compose exec redis redis-cli flushall
# This causes a temporary spike in DB load as cache is cold

# Permanent: increase Redis memory limit (requires docker-compose.yml edit)
# Change: command: --maxmemory 128mb   →   --maxmemory 256mb
# Then: docker compose up -d redis

Step 3 — Verify recovery:

docker compose exec redis redis-cli ping
# Must return: PONG

# Verify circuit breaker has closed (app reconnected to Redis)
docker compose logs app --tail 20 | grep -i redis
# Should see no "Redis unavailable" messages after recovery

# Check latency returned to baseline (< 50ms p95)
curl -s "http://localhost:9090/api/v1/query?query=histogram_quantile(0.95,sum(rate(flask_http_request_duration_seconds_bucket[5m]))by(le))*1000" \
  | jq '.data.result[0].value[1]'

Expected recovery time: 30 seconds (auto-restart). No manual intervention needed unless crash-looping. Escalate if not resolved in 30 minutes: Redis is crash-looping. Share docker compose logs redis --tail 100 output with team lead.

9. Golden Signals Dashboard Reference

Grafana “URL Shortener - Golden Signals” dashboard shows four key metrics:

Metric	What It Means	Baseline	Warning	Critical
Request Rate (RPS)	Requests per second	50-150 rps	>250 rps	>500 rps
Error Rate (%)	% of 5xx responses	<1%	>5%	>10%
Latency (p95 ms)	95th percentile response time	<200 ms	>500 ms	>3000 ms
Saturation (%)	Highest of (CPU%, Mem%, DB connections%)	<50%	>70%	>85%

Usage:

Open http://localhost:3000 (Grafana)
Click “URL Shortener - Golden Signals” dashboard
Scan the four panels
Compare current values to baselines above
Use trends (hovering over time series) to spot degradation

10. Detailed Failure Mode Reference

These describe expected behavior under failure:

Database Down

Symptom: /health/ready returns 503, all CRUD operations fail
Recovery: Postgres auto-restarts and connection pool reconnects. App doesn’t need restart.
Docker behavior: Healthcheck fails, Docker restarts container per restart: unless-stopped policy

Redis Down

Symptom: Connection errors in logs, cache circuit breaker activates
Impact: Cache bypassed, reads go direct to DB. No user-facing errors (latency increases)
Recovery: Redis auto-restarts, cache refills organically

Container Crash / OOM Kill

Symptom: docker compose ps shows task in Failed/Shutdown state
Impact: Docker Compose routing removes dead replica. In-flight requests timeout; Nginx retries on other replicas
Recovery: Docker restart: unless-stopped auto-restarts the container. Recovery time: 5-15s

Bulk Import Failure

Symptom: POST /users/bulk returns 500
Impact: Entire import rolls back. No partial data committed
Recovery: Fix CSV, re-upload. Existing IDs skipped via ON CONFLICT IGNORE

Network Partition (Postgres unreachable)

Symptom: Connection pool timeout errors
Impact: All read/write ops fail (same as Database Down)
Recovery: When network heals, pool reconnects. Stale connections evicted after 300s

Disk Full (Postgres volume)

Symptom: Write operations fail
Impact: Reads still work; new data creation fails
Recovery: Free disk space or expand volume. Postgres resumes automatically

11. Memory & Capacity Limits

Your hardware has 2 GB RAM total; ~1.8 GB available for services.

Service Memory Budget

Service	Limit	Current Config
App (per replica)	384 MB	2-5 replicas
Postgres	768 MB	Fixed 1 instance
Redis	160 MB	Fixed 1 instance
Nginx	64 MB	Fixed 1 instance
Prometheus	512 MB	Fixed 1 instance
Loki	256 MB	Fixed 1 instance
Grafana	256 MB	Fixed 1 instance
Alertmanager	128 MB	Fixed 1 instance
Jaeger	256 MB	Fixed 1 instance
Promtail	128 MB	Fixed 1 instance
Autoscaler	64 MB	Fixed 1 instance

Formula: Total = (app_replicas * 384MB) + 2592MB (fixed services)

Note: The combined Docker memory limits exceed the 2 GB droplet RAM. Docker limits are upper bounds; actual RSS usage is lower. The system runs on the 2 GB droplet because limits are rarely hit simultaneously, and Linux overcommits memory. Monitor with docker stats to see actual usage.

2 replicas: ~3.36 GB limits (~1.8–2.2 GB actual RSS typical)
3 replicas: ~3.74 GB limits (tight — watch actual RSS closely)
4 replicas: ~4.13 GB limits (high risk of OOMKill)

Action: If actual RSS usage (from docker stats) exceeds 1.8 GB, scale app down to 1 replica or restart services.

12. Troubleshooting Decision Tree

Start here if you’re not sure which runbook to use.

Alert received?
├─ ServiceDown (up == 0)
│  └─ Go to: Runbook "ServiceDown"
│
├─ HighErrorRate (5xx > 5%)
│  └─ Go to: Runbook "HighErrorRate"
│
├─ HighLatency (p95 > 500ms)
│  └─ Go to: Runbook "HighLatency"
│
├─ RedisDown (redis_connection_errors > 0)
│  └─ Go to: Runbook "RedisDown"
│
└─ No alert, but service is slow/broken?
   ├─ Can you curl http://localhost/health?
   │  ├─ Yes (200) → Check Runbook "HighLatency" or "HighErrorRate"
   │  └─ No → Check Runbook "ServiceDown"
   │
   └─ Are there error messages in logs?
      ├─ "database connection" → Restart Postgres: `docker compose restart postgres`
      ├─ "Redis unavailable" → Wait 30s for auto-restart (expected behavior)
      ├─ OutOfMemory → Scale down replicas: `docker compose up -d --scale app=1`
      └─ Application error → Check git log, consider rollback

Keep these docs handy:

Document	Purpose	Location
Incident Playbook	This document — master operational guide	`docs/Incident Response/runbooks/INCIDENT-PLAYBOOK.md`
RCA: Redis Failure	Past incident analysis	`docs/Incident Response/rca/RCA-001-redis-failure.md`
Postmortem Template	Incident report structure	`docs/Incident Response/rca/POSTMORTEM-TEMPLATE.md`
Alert Rules	Prometheus alert definitions	`monitoring/prometheus/alerts.yml`
Alertmanager Config	Email routing and grouping	`monitoring/alertmanager/alertmanager.yml`
Design Decisions	Track 3 rationale and evidence	`docs/Incident Response/INCIDENT_RESPONSE_ENGINEERING_DESIGN_DECISIONS.md`

14. Pre-Incident Checklist

Before going on-call, verify:

You can access Grafana (http://localhost:3000, user: admin, password: admin)
You can access Prometheus (http://localhost:9090)
You can access Alertmanager (http://localhost:9093)
You can SSH/access the server hosting Docker
You have docker-compose commands aliased or in your PATH
You’ve read the runbook summaries in section 8 and the failure modes in section 10
You have a way to be notified (email or GitHub)
You know how to reach the backup on-call engineer
You have the postmortem template URL bookmarked

15. After-Incident Checklist

When an incident is resolved:

Stabilize: Confirm metrics returned to baseline. Alert cleared in Prometheus.
Communicate: Post final status update. “Resolved at HH:MM UTC. Error rate back to <1%. No data loss.”
Document: Create postmortem in docs/Incident Response/rca/ with timeline, root cause, and action items.
Review: Share postmortem in team channel within 24 hours.
Follow up: Ensure action items are tracked in your issue tracker (GitHub, Jira, etc.).
Prevent: Close any related PRs that prevent recurrence.

16. Quick Reference: Critical Thresholds

Keep these numbers in your head:

Metric	Alert Threshold	SLO Target	Your Hardware Ceiling
Error Rate (5xx)	>5% for 30s	<1%	20% (saturation)
Latency p95	>500ms for 30s	<500ms (reads), <1s (writes)	~3s with 3 replicas
Availability	N/A	99.9% (10m downtime/week)	~99.5% if Postgres frequently restarts
Memory	N/A	N/A	~3.3 GB (3 app replicas max)
CPU	N/A	<75%	~180% per vCPU (2 total) with 3 replicas

17. Incident Response Communication Template

Copy-paste these messages to keep communication consistent:

Initial Alert:

P[1/2/3] Alert: [AlertName]
- Service: URL Shortener
- Status: INVESTIGATING
- Grafana: http://localhost:3000
- Last update: [timestamp]

Root Cause Found:

Root cause identified: [brief description]
Mitigation: [what we're doing]
ETA to resolution: [estimate]

Resolved:

RESOLVED - URL Shortener [AlertName]
- Duration: X minutes
- Root cause: [brief description]
- Impact: [what was affected]
- Postmortem: [link to RCA]
- Next steps: [action items for follow-up]

18. On-Call Handoff Template

When passing on-call to the next engineer:

# On-Call Handoff: [Date]

## Current Status
- [ ] All services healthy (check Grafana)
- [ ] No active alerts in Prometheus
- [ ] Recent incident? If yes, postmortem status: [pending/in-progress/completed]

## Recent Changes
- Code deployed: [commit hash, time]
- Infrastructure changes: [if any]
- Known issues: [if any]

## Monitoring
- Primary dashboard: http://localhost:3000 (URL Shortener - Golden Signals)
- Alert channel: gunbirsingh2006@gmail.com / ajinda17@asu.edu
- Escalation: [@devgunnu](https://github.com/devgunnu) → [@RETR0-OS](https://github.com/RETR0-OS)

## Key Docs
- Incident playbook: `docs/Incident Response/runbooks/INCIDENT-PLAYBOOK.md`
- RCA archive: `docs/Incident Response/rca/`
- Design decisions: `docs/Incident Response/INCIDENT_RESPONSE_ENGINEERING_DESIGN_DECISIONS.md`

Questions for the outgoing engineer: [ask here]

19. Critical Emergency Numbers

If this is a production incident affecting real users:

Role	Contact	Backup
@devgunnu (Gunbir Singh)	gunbirsingh2006@gmail.com	@devgunnu
@RETR0-OS (Aaditya Jindal)	ajinda17@asu.edu	@RETR0-OS
@aryankhanna2004 (Aryan Khanna)	—	@aryankhanna2004
@SinghOPS (Sahajpreet Singh)	—	@SinghOPS

20. Final Reminders

Stay calm. Incidents happen. You’re trained for this.
Document as you go. Start the postmortem timeline immediately.
Don’t guess. Check logs, check dashboards, verify with curl/psql/redis-cli before acting.
Escalate early. If unsure after 15 minutes, loop in the team lead.
Test your fix. After mitigation, wait 2-3 minutes and verify metrics are back to baseline.
Communicate status every 5-10 minutes. Updates prevent panic and align the team.
Automate the fix. After the postmortem, write code or config that prevents recurrence.

Last Updated: 2026-04-04 Maintained By: [Team Name] Review Cycle: Quarterly or after any P1 incident

URL Shortener — Incident Response Playbook

1. Quick Access

2. Incident Severity Definitions

3. Incident Response Flow

1. Detect (0 min)

2. Acknowledge (1 min)

3. Assess (2-3 min)

4. Diagnose (5-10 min)

5. Mitigate (10-20 min)

6. Communicate (during mitigation)

7. Postmortem (after resolution)

4. Alert Quick Reference

5. SLO Targets (at a glance)

6. Common Remediation Commands

Service Status & Logs

Scaling & Restarting

Health Checks

Diagnostics

Rollback (if the latest deployment is bad)

7. Escalation Path

8. Runbook Summaries

ServiceDown

HighErrorRate

HighLatency

RedisDown

9. Golden Signals Dashboard Reference

10. Detailed Failure Mode Reference

Database Down

Redis Down

Container Crash / OOM Kill

Bulk Import Failure

Network Partition (Postgres unreachable)

Disk Full (Postgres volume)

11. Memory & Capacity Limits

Service Memory Budget

12. Troubleshooting Decision Tree

13. Related Documentation

14. Pre-Incident Checklist

15. After-Incident Checklist

16. Quick Reference: Critical Thresholds

17. Incident Response Communication Template

18. On-Call Handoff Template

19. Critical Emergency Numbers

20. Final Reminders