Multi-region EKS at 99.95% with SLO burn-rate alerting

Problem: A fintech API platform had CloudWatch alarms on CPU and memory — infrastructure metrics that gave no signal on user impact. Incidents were discovered via support tickets, not monitoring.

Solution: Defined SLIs as HTTP 5xx rate and P99 latency per endpoint group. Pyrra computed burn rate against the 99.95% SLO. A 14× burn rate for 1 hour triggered an immediate page. Multi-region failover (us-east-1 + eu-west-1) validated in a game day that found 3 cross-region config inconsistencies before they caused a real incident.

Technology: EKS · OpenTelemetry · Prometheus · Pyrra · Route53 · Terraform

Optimisation pattern: infra-metrics-to-slo-burn-rate-alerting

Outcomes:

MTTD: 4.2 hours → 6 minutes. 7 of 8 incidents detected before customer impact. 99.95% availability sustained for 6 months. 3 config inconsistencies fixed pre-incident.

Your privacy, your call

Multi-region EKS platform sustaining 99.95% availability with burn-rate alerting