Problem: Manual kubectl apply meant the applied manifest might differ from what was in Git. No record of who deployed what or when. Rollbacks required finding the previous image tag in Slack. Change failure rate 18% (industry benchmark ~15% for this scale).
Solution: Service-by-service ArgoCD migration over 10 weeks. Direct kubectl access to production revoked — GitOps sync became the only deployment path. Caught 14 config drift incidents in the first 2 weeks. Argo Rollouts: 5% canary for 10 minutes, Prometheus analysis on error rate and P99 latency, automatic abort on threshold breach. Slack message with analysis results win or lose.
Technology: ArgoCD · Argo Rollouts · Kubernetes · Prometheus · Helm · GitHub
Optimisation pattern: manual-kubectl-to-argocd-gitops-with-prometheus-gated-canary-rollout
Outcomes:
Change failure rate: 18% → 3%. Config drift incidents: eliminated — ArgoCD self-heals within 3 minutes. 100% of deployments are PRs with reviewer and timestamp. MTTR on bad deploy: 22 min → 4 min.