Track 3: Incident Response — Design Decisions & Evidence Map

Project: MLH Production Engineering URL Shortener Date: April 5, 2026

This document records every design decision made for Track 3 (Incident Response), grounded in the actual codebase. Each decision includes the rationale, the alternatives considered, and the files that implement it.


Table of Contents

  1. Evidence Summary by Submission Field
  2. Visual Evidence (Screenshots)
  3. Bronze Tier Decisions
  4. Silver Tier Decisions
  5. Gold Tier Decisions
  6. Architecture Overview
  7. File Index
  8. Full Architecture Decision Log

Evidence Summary by Submission Field

Bronze

Submission Requirement Verifiable Evidence Key Files
JSON structured logging includes timestamp and log level fields pythonjsonlogger.json.JsonFormatter configured with rename_fields={"asctime": "timestamp", "levelname": "level"} and ISO-8601 datefmt. Gunicorn mirrors the same formatter in logconfig_dict. Nginx uses a json_combined log format. Every request emits structured JSON with request_id, method, path, status, latency_ms. app/logging_config.py (lines 10-13), gunicorn.conf.py (lines 21-26), nginx/nginx.conf (lines 12-24), app/middleware.py (lines 41-51)
A /metrics-style endpoint is available and returns monitoring data prometheus-flask-exporter auto-registers a /metrics endpoint on the Flask app. Prometheus scrapes it every 15s via DNS service discovery. Exposed metrics include flask_http_request_total, flask_http_request_duration_seconds_bucket, process_resident_memory_bytes, process_cpu_seconds_total. app/__init__.py (lines 5, 12, 43), monitoring/prometheus/prometheus.yml (lines 13-20), pyproject.toml (line 16)
Logs can be inspected through tooling without direct server SSH Three methods available: (1) docker compose logs app reads Docker’s json-file log driver output, (2) Promtail ships container logs to Loki via Docker socket discovery, (3) Grafana’s “Application Logs” panel queries Loki with {job="app"}. All three avoid SSH. docker-compose.yml (lines 26-30: json-file driver; lines 242-256: promtail; lines 258-279: grafana), monitoring/promtail/promtail-config.yml, monitoring/grafana/dashboards/url-shortener.json (lines 162-177)

Silver

Submission Requirement Verifiable Evidence Key Files
Alerting rules are configured for service down and high error rate Seven Prometheus alert rules in two groups. ServiceDown: up == 0 for 1m (severity: critical). HighErrorRate: 5xx rate / total rate > 0.05 for 30s (severity: warning). Additional rules: HighLatency, RedisDown, HighReplicaCount, HighRequestRate, HighMemoryUsage. monitoring/prometheus/alerts.yml (lines 1-71), monitoring/prometheus/prometheus.yml (lines 4-5: rule_files)
Alerts are routed to an operator channel such as Slack or email Alertmanager routes to email via Resend SMTP (smtp.resend.com:587). Two receivers: email-critical (instant, group_wait: 0s) and email-warnings (group_wait: 10s). Both use rich HTML templates. Discord webhook receivers are defined in discord-receivers.yml and dynamically merged by entrypoint.sh when DISCORD_WEBHOOK_URL is set. monitoring/alertmanager/alertmanager.yml (lines 1-92), monitoring/alertmanager/discord-receivers.yml (lines 1-34), monitoring/alertmanager/entrypoint.sh (lines 13-28), docker-compose.yml (lines 212-215)
Alerting latency is documented and meets five-minute response objective ServiceDown (critical): 1m for duration + 0s group_wait = ~60-70s. HighErrorRate (warning): 30s for duration + 10s group_wait = ~40-50s. HighLatency: 30s + 10s = ~40-50s. RedisDown: 1m + 10s = ~70s. All are well within the 5-minute target. Prometheus evaluation interval is 15s (monitoring/prometheus/prometheus.yml line 3). monitoring/prometheus/alerts.yml (per-rule for values), monitoring/alertmanager/alertmanager.yml (lines 11, 19: group_wait), monitoring/prometheus/prometheus.yml (line 3: evaluation_interval: 15s)

Gold

Submission Requirement Verifiable Evidence Key Files
Dashboard evidence covers latency, traffic, errors, and saturation Grafana dashboard “URL Shortener - Golden Signals” (UID: url-shortener-golden) has 8 panels: Service Uptime (availability), Active Alerts, Request Rate (traffic by HTTP method), Error Rate 5xx % (errors), Latency p50/p95/p99 (latency), Memory Usage RSS (saturation), CPU Usage % (saturation), Application Logs (Loki). Auto-refresh every 10s, alert annotations overlay. monitoring/grafana/dashboards/url-shortener.json (189 lines), monitoring/grafana/provisioning/datasources/datasources.yml, monitoring/grafana/provisioning/dashboards/dashboards.yml
Runbook includes actionable alert-response procedures INCIDENT-PLAYBOOK.md is a 640-line, 20-section operational playbook covering severity definitions, per-alert remediation commands (ServiceDown, HighErrorRate, HighLatency, RedisDown), SLO targets, escalation paths, communication templates, on-call handoff procedures, and troubleshooting decision trees. docs/Incident Response/runbooks/INCIDENT-PLAYBOOK.md, docs/Incident Response/README.md
Root-cause analysis of a simulated incident is documented RCA-001-redis-failure.md documents a Redis OOMKill incident using the Grafana dashboard. Walks through 5 specific dashboard panels, includes Loki log queries ({job="app"} |= "Redis unavailable"), traces the circuit breaker activation in app/utils/cache.py, and records timeline, impact, and resolution. A reusable POSTMORTEM-TEMPLATE.md (Google SRE 5-Whys format) is also provided. docs/Incident Response/rca/RCA-001-redis-failure.md (356 lines), docs/Incident Response/rca/POSTMORTEM-TEMPLATE.md (271 lines)

Visual Evidence (Screenshots)

All screenshots are located in docs/Incident Response/screenshots/ and demonstrate the live monitoring stack with real traffic data.

1. /metrics Endpoint — Decision B4

Shows the live /metrics endpoint returning Prometheus metrics (flask_http_request_total, flask_http_request_duration_seconds_bucket, process_resident_memory_bytes, etc.)

Metrics Endpoint

2. Prometheus Alert Rules — Decision S1

All 7 alert rules loaded in Prometheus (ServiceDown, HighErrorRate, HighLatency, RedisDown, HighReplicaCount, HighRequestRate, HighMemoryUsage) with their expressions and for durations.

Prometheus Alert Rules

3. Alertmanager UI — Decision S2

Alertmanager is running and processing alerts with severity-based routing.

Alertmanager UI

4. Alertmanager Configuration — Decision S4

Full Alertmanager configuration showing email receivers (Resend SMTP), group_wait: 0s for critical, group_wait: 10s for warnings.

Alertmanager Config

5. Grafana Golden Signals Dashboard — Decision G1

Grafana “Golden Signals” dashboard with all 8 panels showing live data from k6 load testing (50 VUs, ~237 req/s): Uptime, Active Alerts, Request Rate, Error Rate, Latency p50/p95/p99, Memory RSS, CPU %, Application Logs.

Grafana Golden Signals Dashboard

6. Grafana Loki Logs — Decision B5

Centralized log viewing via Grafana + Loki without SSH — structured JSON logs queryable with {job="app"}.

Grafana Loki Logs

7. Jaeger Distributed Tracing — Architecture Overview

Distributed tracing via Jaeger + OpenTelemetry showing request traces across services.

Jaeger Tracing


Bronze Tier Decisions

Decision B1: python-json-logger over stdlib or structlog

Choice: pythonjsonlogger.json.JsonFormatter (from python-json-logger>=3.0)

Rationale:

Alternatives considered:

Implementation:

Decision B2: Per-request structured fields via middleware

Choice: Custom Flask middleware in app/middleware.py that injects request_id, method, path, status, latency_ms, and timing checkpoints into every request log.

Rationale:

Implementation: app/middleware.py (lines 23-52)

Decision B3: Nginx JSON access logs

Choice: Custom log_format json_combined in nginx/nginx.conf producing JSON-structured access logs.

Rationale:

Implementation: nginx/nginx.conf (lines 12-24)

Decision B4: prometheus-flask-exporter for /metrics

Choice: PrometheusMetrics.for_app_factory() from prometheus-flask-exporter>=0.23

Rationale:

Implementation: app/__init__.py (lines 5, 12, 43, 84, 90, 95)

Visual evidence:

Metrics Endpoint

Decision B5: Loki + Promtail for SSH-free log inspection

Choice: Grafana Loki for log aggregation, Promtail for log shipping, Grafana for the viewing UI.

Rationale:

Alternatives considered:

Implementation:

Visual evidence:

Grafana Loki Logs


Silver Tier Decisions

Decision S1: Prometheus + Alertmanager over Grafana Alerting

Choice: Alert rules evaluated by Prometheus, routed by Alertmanager.

Rationale:

Implementation:

Visual evidence:

Prometheus Alert Rules

Decision S2: Email (via Resend SMTP) as primary notification channel

Choice: Alertmanager sends email via smtp.resend.com:587 using a Resend API key.

Rationale:

Alternatives considered:

Implementation:

Visual evidence:

Alertmanager UI

Alertmanager Config

Decision S3: Discord as opt-in secondary channel

Choice: Discord receivers defined in a separate file, dynamically merged at container startup only if DISCORD_WEBHOOK_URL is set.

Rationale:

Implementation:

Decision S4: Alert timing tuned for sub-5-minute delivery

Choice: Reduced group_wait to 0s (critical) and 10s (warnings) to ensure all alerts are delivered well within the 5-minute Track requirement.

Rationale:

Implementation:

Decision S5: Chaos testing script for live demonstration

Choice: A bash script (chaos-test.sh, 611 lines) that automates failure injection, alert verification, and recovery.

Rationale:

Implementation: scripts/chaos-test.sh


Gold Tier Decisions

Decision G1: Single “Golden Signals” dashboard with 8 panels

Choice: One pre-provisioned Grafana dashboard covering all four golden signals in a single view.

Rationale:

Panel mapping to golden signals:

Golden Signal Panel(s) PromQL / Query
Latency Latency p50/p95/p99 histogram_quantile(0.X, sum(rate(flask_http_request_duration_seconds_bucket[5m])) by (le))
Traffic Request Rate (req/s) sum(rate(flask_http_request_total[5m])) by (method)
Errors Error Rate (5xx %) sum(rate(flask_http_request_total{status=~"5.."}[5m])) / sum(rate(flask_http_request_total[5m])) * 100
Saturation Memory Usage (RSS), CPU Usage (%) process_resident_memory_bytes, rate(process_cpu_seconds_total{job="app"}[5m]) * 100
Availability Service Uptime up{job="app"}
Operational Active Alerts count(ALERTS{alertstate="firing"}) OR vector(0)
Investigation Application Logs Loki: {job="app"}

Implementation: monitoring/grafana/dashboards/url-shortener.json

Visual evidence:

Grafana Golden Signals Dashboard

Decision G2: Grafana auto-provisioning via file-based providers

Choice: Dashboard JSON and datasource YAML are mounted into Grafana at startup via provisioning directories.

Rationale:

Implementation:

Decision G3: Incident Playbook structured as a 20-section operational reference

Choice: A single comprehensive playbook (INCIDENT-PLAYBOOK.md) rather than multiple scattered runbook files.

Rationale:

Implementation: docs/Incident Response/runbooks/INCIDENT-PLAYBOOK.md (640 lines, 20 sections)

Decision G4: RCA documented against actual Grafana panels

Choice: The RCA (RCA-001-redis-failure.md) references specific Grafana dashboard panels and actual PromQL queries, rather than abstract descriptions.

Rationale:

Implementation: docs/Incident Response/rca/RCA-001-redis-failure.md (356 lines)

Decision G5: Google SRE postmortem template

Choice: A reusable POSTMORTEM-TEMPLATE.md following Google’s SRE postmortem format.

Rationale:

Implementation: docs/Incident Response/rca/POSTMORTEM-TEMPLATE.md (271 lines)


Architecture Overview

                                    ┌──────────────────────┐
                                    │       Grafana         │
                                    │    :3000              │
                                    │  Dashboard + Logs UI  │
                                    └─────┬──────┬─────────┘
                                          │      │
                              ┌───────────┘      └────────────┐
                              v                                v
                   ┌──────────────────┐            ┌──────────────────┐
                   │   Prometheus     │            │      Loki        │
                   │   :9090          │            │    :3100          │
                   │  Metrics + Rules │            │  Log Aggregation  │
                   └──────┬───────────┘            └──────┬───────────┘
                          │                                │
              ┌───────────┤                                │
              │           │                                │
   ┌──────────v───┐  ┌────v──────────┐           ┌────────v──────────┐
   │ Alertmanager │  │   App :5000   │           │     Promtail      │
   │   :9093      │  │  /metrics     │           │  Docker SD        │
   │  Email +     │  │  JSON logs    │           │  → Loki ingest    │
   │  Discord     │  │  OTEL traces  │           └───────────────────┘
   └──────────────┘  └──────┬────────┘
                            │
                   ┌────────v─────────┐
                   │     Jaeger       │
                   │   :16686         │
                   │  Distributed     │
                   │  Tracing         │
                   └──────────────────┘

Data flows:

  1. App → Prometheus: /metrics scraped every 15s via DNS service discovery.
  2. Prometheus → Alertmanager: Alert rules evaluated every 15s; firing alerts forwarded to Alertmanager.
  3. Alertmanager → Email/Discord: Severity-based routing; critical = instant, warning = 10s batched.
  4. App (stdout) → Docker json-file driver → Promtail → Loki: Container logs ingested with 5s refresh.
  5. Grafana → Prometheus + Loki: Dashboards query both datasources. Logs panel uses Loki, all other panels use Prometheus.
  6. App → Jaeger: OpenTelemetry traces exported via OTLP/gRPC.

Visual evidence:

Jaeger Tracing


File Index

All files contributing to Track 3, organized by function:

Logging

| File | Role | |—|—| | app/logging_config.py | JSON formatter setup (pythonjsonlogger) | | app/middleware.py | Per-request structured log emission | | gunicorn.conf.py | Gunicorn JSON log config | | nginx/nginx.conf | Nginx JSON access log format |

Metrics

| File | Role | |—|—| | app/__init__.py | Prometheus Flask Exporter integration, /metrics endpoint | | monitoring/prometheus/prometheus.yml | Scrape config (DNS SD, 15s interval) |

Alerting

| File | Role | |—|—| | monitoring/prometheus/alerts.yml | 7 alert rules (2 groups) | | monitoring/alertmanager/alertmanager.yml | Email routing (Resend SMTP) | | monitoring/alertmanager/discord-receivers.yml | Discord webhook receivers | | monitoring/alertmanager/entrypoint.sh | Dynamic config merge at startup |

Log Aggregation

| File | Role | |—|—| | monitoring/loki/loki-config.yml | Loki storage and retention (72h, TSDB) | | monitoring/promtail/promtail-config.yml | Docker SD log shipping to Loki |

Dashboards

| File | Role | |—|—| | monitoring/grafana/dashboards/url-shortener.json | Golden Signals dashboard (8 panels) | | monitoring/grafana/provisioning/datasources/datasources.yml | Prometheus + Loki datasources | | monitoring/grafana/provisioning/dashboards/dashboards.yml | File-based dashboard provider |

Incident Response Documentation

| File | Role | |—|—| | docs/Incident Response/README.md | Index of all Track 3 deliverables | | docs/Incident Response/runbooks/INCIDENT-PLAYBOOK.md | Master operational playbook (640 lines) | | docs/Incident Response/rca/RCA-001-redis-failure.md | Root cause analysis narrative | | docs/Incident Response/rca/POSTMORTEM-TEMPLATE.md | Reusable postmortem template | | scripts/chaos-test.sh | Automated chaos testing script |

Tracing (supplementary to Track 3)

| File | Role | |—|—| | app/tracing.py | OpenTelemetry SDK initialization, OTLP export to Jaeger |

Infrastructure

| File | Role | |—|—| | docker-compose.yml | All monitoring services (Prometheus, Alertmanager, Loki, Promtail, Grafana, Jaeger) |


Full Architecture Decision Log

Every significant technical choice made during the project, with alternatives considered, reasoning, and trade-offs accepted. Decisions covering the observability and incident response stack are recorded in detail above; this section captures the remaining application, data, infrastructure, and testing decisions.

Each entry follows this structure:


Application Layer

ADR-001: Flask over FastAPI or Django

Choice: Flask 3.0

Alternatives considered:

Why this:

Trade-offs accepted:


ADR-002: Gunicorn (gthread) over gevent, uvicorn, or async workers

Choice: Gunicorn 22.0 with worker_class = "gthread", 2 workers × 4 threads per replica.

Alternatives considered:

Why this:

Trade-offs accepted:


ADR-003: Peewee ORM over SQLAlchemy or raw psycopg2

Choice: Peewee 3.19 with PooledPostgresqlDatabase

Alternatives considered:

Why this:

Trade-offs accepted:


ADR-004: uv as Python package manager over pip+venv, Poetry, or Pipenv

Choice: uv (Astral)

Alternatives considered:

Why this:

Trade-offs accepted:


Data Layer

ADR-005: PostgreSQL 16 over MySQL or SQLite

Choice: PostgreSQL 16

Alternatives considered:

Why this:

Trade-offs accepted:


ADR-006: Redis 7 with allkeys-lfu eviction and no TTL

Choice: Redis 7, allkeys-lfu, no TTL on cache entries, 128 MB memory cap

Alternatives considered:

Why this:

Trade-offs accepted:


ADR-007: synchronous_commit = off on PostgreSQL

Choice: PostgreSQL synchronous_commit = off

Alternatives considered:

Why this:

Trade-offs accepted:


Infrastructure Layer

ADR-008: Nginx over HAProxy, Traefik, or Caddy

Choice: Nginx 1.25

Alternatives considered:

Why this:

Trade-offs accepted:


ADR-009: Docker Compose over Kubernetes or Docker Swarm

Choice: Docker Compose (with custom autoscaler sidecar)

Alternatives considered:

Why this:

Trade-offs accepted:


ADR-010: GitHub Actions for CI/CD over CircleCI, Jenkins, or GitLab CI

Choice: GitHub Actions

Alternatives considered:

Why this:

Trade-offs accepted:


ADR-011: DigitalOcean over AWS, GCP, or Heroku

Choice: DigitalOcean Droplet (2 vCPU, 2 GB RAM, $12/month)

Alternatives considered:

Why this:

Trade-offs accepted:


Testing Strategy

ADR-016: k6 for load testing over Locust or JMeter

Choice: k6 (Grafana)

Alternatives considered:

Why this:

Trade-offs accepted:


ADR-017: 70% coverage floor over 100% or no coverage gate

Choice: --cov-fail-under=70 in CI (actual coverage: 91%)

Alternatives considered:

Why this:

Trade-offs accepted:


Summary Table

ADR Decision Key Trade-off
ADR-001 Flask over FastAPI/Django No async, no built-in validation
ADR-002 Gunicorn gthread over gevent GIL limits CPU parallelism per worker
ADR-003 Peewee over SQLAlchemy Smaller ecosystem, less mature migrations
ADR-004 uv over pip/Poetry Younger project, fewer resources
ADR-005 PostgreSQL over MySQL/SQLite Slightly higher memory footprint
ADR-006 Redis LFU, no TTL Must explicitly invalidate on every write
ADR-007 synchronous_commit=off ~200ms crash window on event writes
ADR-008 Nginx over HAProxy/Traefik Verbose config, manual SIGHUP for upstream changes
ADR-009 Docker Compose over Kubernetes No cross-node scheduling
ADR-010 GitHub Actions over CircleCI Minute limits on private repos
ADR-011 DigitalOcean over AWS/GCP No managed autoscaling across nodes
ADR-016 k6 over Locust/JMeter JS not primary team language
ADR-017 70% coverage floor Coverage ≠ assertion quality