Incident Response — Track 3 Deliverables
Complete incident response infrastructure for the URL Shortener service. This folder contains all Track 3 deliverables organized by tier.
Architecture
App (/metrics) Prometheus Alertmanager Email
┌──────────┐ scrape ┌──────────┐ alert ┌──────────┐ notify ┌─────────┐
│ Flask + │ ──────────>│ Metrics │ ────────>│ Severity │ ────────>│ Inbox │
│ Gunicorn │ 15s │ + Rules │ eval │ Routing │ SMTP │ (on-call)│
└──────────┘ └──────────┘ └──────────┘ └─────────┘
│ │
│ JSON logs │ critical: instant
v │ warning: 10s group
┌──────────┐ ┌──────────┐ │
│ Docker │ ──────────>│ Loki │ │
│ stdout │ ingest │ (3100) │ │
└──────────┘ └──────────┘ │
│ │
v v
┌──────────────────────────────┐
│ Grafana Dashboard │
│ "URL Shortener - Golden │
│ Signals" │
│ 8 panels + alert annotations │
│ http://localhost:3000 │
└──────────────────────────────┘
Deliverables by Tier
Bronze — The Watchtower
| Requirement | Implementation | Location |
|---|---|---|
| Structured JSON logging | pythonjsonlogger with timestamp, level, message |
app/logging_config.py |
/metrics endpoint |
Prometheus Flask Exporter (request count, latency histogram, memory) | app/__init__.py |
| View logs without SSH | Docker JSON log driver + Loki + Grafana Logs panel | docker compose logs app |
Verify:
docker compose logs app --tail 5 # JSON structured logs
curl -s http://localhost:5000/metrics # Prometheus metrics
Silver — The Alarm
| Requirement | Implementation | Location |
|---|---|---|
| Service Down alert | up == 0 for 1m → fires in ~70s |
monitoring/prometheus/alerts.yml |
| High Error Rate alert | 5xx rate > 5% for 30s → fires in ~45s | Same file |
| Email notification | Alertmanager SMTP with severity routing | monitoring/alertmanager/alertmanager.yml |
| Fire within 5 minutes | group_wait: 10s (warnings), 0s (critical) |
Same file |
| Chaos testing script | Automated failure injection + alert verification | chaos-test.sh |
Verify (live demo):
# Dry run first
./scripts/chaos-test.sh --service-down --dry-run
# Live demo — stops app, waits for alert, restores, verifies recovery
./scripts/chaos-test.sh --service-down
# Run all scenarios
./scripts/chaos-test.sh --all
Alert routing:
- Critical (ServiceDown):
group_wait: 0s— instant email with subjectCRITICAL: ... - Warning (HighErrorRate, HighLatency, RedisDown):
group_wait: 10s— batched email with subjectWARNING: ...
Gold — The Command Center
| Requirement | Implementation | Location |
|---|---|---|
| Grafana dashboard (4+ metrics) | 8 panels: Uptime, Alerts, Request Rate, Error Rate, Latency, Memory, CPU, Logs | monitoring/grafana/dashboards/url-shortener.json |
| Runbook | Per-alert mitigation + master incident playbook | INCIDENT-PLAYBOOK.md |
| RCA exercise | Redis failure diagnosis using dashboard + logs | RCA-001-redis-failure.md |
| Postmortem template | Google SRE format with 5 Whys | POSTMORTEM-TEMPLATE.md |
Dashboard panels (Golden Signals):
| Panel | Signal | PromQL |
|---|---|---|
| Service Uptime | Availability | up{job="app"} |
| Active Alerts | Alerts | count(ALERTS{alertstate="firing"}) |
| Request Rate | Traffic | sum(rate(flask_http_request_total[5m])) by (method) |
| Error Rate (5xx %) | Errors | sum(rate(flask_http_request_total{status=~"5.."}[5m])) / sum(rate(flask_http_request_total[5m])) * 100 |
| Latency p50/p95/p99 | Latency | histogram_quantile(0.X, sum(rate(flask_http_request_duration_seconds_bucket[5m])) by (le)) |
| Memory Usage | Saturation | process_resident_memory_bytes |
| CPU Usage | Saturation | rate(process_cpu_seconds_total{job="app"}[5m]) * 100 |
| Application Logs | Investigation | Loki: {job="app"} |
Monitoring Stack Access
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | — |
| Alertmanager | http://localhost:9093 | — |
| Jaeger | http://localhost:16686 | — |
| App Health | http://localhost/health | — |
| App Readiness | http://localhost/health/ready | — |
| Metrics | http://localhost/metrics | — |
Folder Structure
docs/Incident Response/
├── README.md # This file
├── rca/
│ ├── RCA-001-redis-failure.md # Root Cause Analysis narrative (Gold requirement)
│ └── POSTMORTEM-TEMPLATE.md # Reusable incident postmortem template
├── runbooks/
│ └── INCIDENT-PLAYBOOK.md # Master incident response playbook
└── screenshots/
├── metrics-endpoint.png # /metrics endpoint output
├── prometheus-alert-rules.png # Prometheus alert rules page
├── alertmanager-ui.png # Alertmanager UI
├── alertmanager-config.png # Alertmanager configuration/status
├── grafana-golden-signals-dashboard.png # Grafana Golden Signals dashboard
├── grafana-loki-logs.png # Grafana Loki log exploration
└── jaeger-tracing.png # Jaeger distributed tracing UI
scripts/
└── chaos-test.sh # Automated failure injection + alert verification
Related Documentation
- Incident Playbook — Per-alert runbooks, SLO targets, remediation commands, escalation paths
- RCA: Redis Failure — Root cause analysis using dashboard and logs
- Postmortem Template — Google SRE-style postmortem format
- Track 3 Design Decisions — All design decisions with rationale and evidence
- Track 3 Requirements — Original quest specification