Problem: A fintech API platform had CloudWatch alarms on CPU and memory — infrastructure metrics that gave no signal on user impact. Incidents were discovered via support tickets, not monitoring.
Solution: Defined SLIs as HTTP 5xx rate and P99 latency per endpoint group. Pyrra computed burn rate against the 99.95% SLO. A 14× burn rate for 1 hour triggered an immediate page. Multi-region failover (us-east-1 + eu-west-1) validated in a game day that found 3 cross-region config inconsistencies before they caused a real incident.
Technology: EKS · OpenTelemetry · Prometheus · Pyrra · Route53 · Terraform
Optimisation pattern: infra-metrics-to-slo-burn-rate-alerting
Outcomes:
MTTD: 4.2 hours → 6 minutes. 7 of 8 incidents detected before customer impact. 99.95% availability sustained for 6 months. 3 config inconsistencies fixed pre-incident.