Incident Response — Track 3 Deliverables

Complete incident response infrastructure for the URL Shortener service. This folder contains all Track 3 deliverables organized by tier.


Architecture

  App (/metrics)          Prometheus            Alertmanager           Email
  ┌──────────┐     scrape  ┌──────────┐  alert   ┌──────────┐  notify  ┌─────────┐
  │  Flask +  │ ──────────>│  Metrics  │ ────────>│ Severity │ ────────>│  Inbox  │
  │ Gunicorn  │   15s      │  + Rules  │  eval    │ Routing  │  SMTP   │ (on-call)│
  └──────────┘             └──────────┘          └──────────┘         └─────────┘
       │                                               │
       │ JSON logs                                     │ critical: instant
       v                                               │ warning:  10s group
  ┌──────────┐             ┌──────────┐               │
  │  Docker   │ ──────────>│   Loki   │               │
  │  stdout   │   ingest   │  (3100)  │               │
  └──────────┘             └──────────┘               │
                                │                      │
                                v                      v
                           ┌──────────────────────────────┐
                           │     Grafana Dashboard         │
                           │  "URL Shortener - Golden      │
                           │         Signals"              │
                           │  8 panels + alert annotations │
                           │     http://localhost:3000     │
                           └──────────────────────────────┘

Deliverables by Tier

Bronze — The Watchtower

Requirement Implementation Location
Structured JSON logging pythonjsonlogger with timestamp, level, message app/logging_config.py
/metrics endpoint Prometheus Flask Exporter (request count, latency histogram, memory) app/__init__.py
View logs without SSH Docker JSON log driver + Loki + Grafana Logs panel docker compose logs app

Verify:

docker compose logs app --tail 5        # JSON structured logs
curl -s http://localhost:5000/metrics    # Prometheus metrics

Silver — The Alarm

Requirement Implementation Location
Service Down alert up == 0 for 1m → fires in ~70s monitoring/prometheus/alerts.yml
High Error Rate alert 5xx rate > 5% for 30s → fires in ~45s Same file
Email notification Alertmanager SMTP with severity routing monitoring/alertmanager/alertmanager.yml
Fire within 5 minutes group_wait: 10s (warnings), 0s (critical) Same file
Chaos testing script Automated failure injection + alert verification chaos-test.sh

Verify (live demo):

# Dry run first
./scripts/chaos-test.sh --service-down --dry-run

# Live demo — stops app, waits for alert, restores, verifies recovery
./scripts/chaos-test.sh --service-down

# Run all scenarios
./scripts/chaos-test.sh --all

Alert routing:


Gold — The Command Center

Requirement Implementation Location
Grafana dashboard (4+ metrics) 8 panels: Uptime, Alerts, Request Rate, Error Rate, Latency, Memory, CPU, Logs monitoring/grafana/dashboards/url-shortener.json
Runbook Per-alert mitigation + master incident playbook INCIDENT-PLAYBOOK.md
RCA exercise Redis failure diagnosis using dashboard + logs RCA-001-redis-failure.md
Postmortem template Google SRE format with 5 Whys POSTMORTEM-TEMPLATE.md

Dashboard panels (Golden Signals):

Panel Signal PromQL
Service Uptime Availability up{job="app"}
Active Alerts Alerts count(ALERTS{alertstate="firing"})
Request Rate Traffic sum(rate(flask_http_request_total[5m])) by (method)
Error Rate (5xx %) Errors sum(rate(flask_http_request_total{status=~"5.."}[5m])) / sum(rate(flask_http_request_total[5m])) * 100
Latency p50/p95/p99 Latency histogram_quantile(0.X, sum(rate(flask_http_request_duration_seconds_bucket[5m])) by (le))
Memory Usage Saturation process_resident_memory_bytes
CPU Usage Saturation rate(process_cpu_seconds_total{job="app"}[5m]) * 100
Application Logs Investigation Loki: {job="app"}

Monitoring Stack Access

Service URL Credentials
Grafana http://localhost:3000 admin / admin
Prometheus http://localhost:9090
Alertmanager http://localhost:9093
Jaeger http://localhost:16686
App Health http://localhost/health
App Readiness http://localhost/health/ready
Metrics http://localhost/metrics

Folder Structure

docs/Incident Response/
├── README.md                        # This file
├── rca/
│   ├── RCA-001-redis-failure.md    # Root Cause Analysis narrative (Gold requirement)
│   └── POSTMORTEM-TEMPLATE.md      # Reusable incident postmortem template
├── runbooks/
│   └── INCIDENT-PLAYBOOK.md        # Master incident response playbook
└── screenshots/
    ├── metrics-endpoint.png            # /metrics endpoint output
    ├── prometheus-alert-rules.png      # Prometheus alert rules page
    ├── alertmanager-ui.png             # Alertmanager UI
    ├── alertmanager-config.png         # Alertmanager configuration/status
    ├── grafana-golden-signals-dashboard.png  # Grafana Golden Signals dashboard
    ├── grafana-loki-logs.png           # Grafana Loki log exploration
    └── jaeger-tracing.png              # Jaeger distributed tracing UI

scripts/
└── chaos-test.sh                    # Automated failure injection + alert verification