Problem: ALTER TABLE on 2.4 TB with 800 connections = full-table lock for ~14 hours. Platform GMV was £2.1M/hour at peak. No maintenance window was commercially acceptable.
Solution: Expand-contract across 4 releases over 4 weeks. Release 1: add new nullable columns, deploy dual-write code. Background backfill at 10% peak IOPS using Postgres advisory locks, completing in 11 days. Release 2: switch reads to new columns (dual-write stays). Release 3: drop old columns (near-instant in Postgres — catalog update only). Release 4: table split using same pattern on now-smaller table.
Technology: Postgres · Flyway · Python · Testcontainers
Optimisation pattern: expand-contract-dual-write-with-throttled-background-backfill
Outcomes:
Zero seconds of downtime across all 4 releases. No measurable connection count increase during backfill. Documented as company-wide migration standard. Used for 3 subsequent large-table changes by the internal team.