We use cookies to keep the site working, understand how it’s used, and measure our marketing. You can accept everything, reject non-essentials, or pick what’s on.
Situations where a real constraint — scale, latency, compliance, cost — had to break. Each write-up ties to a number leadership cared about.
A job-board affiliate aggregating XML feeds from Adzuna, Talroo, Monster, and 12 other providers needed to parse, deduplicate, and delta-sync to Elasticsearch inside a 30-minute feed window. Java StAX/SAX consumed the entire window on parsing alone. MySQL LOAD XML rewrote the numbers.
A payments platform routing transactions through 12 PSPs had no reliable way to handle network timeouts. A timeout meant the client could not know whether the charge had succeeded. Operations staff spent 4 hours daily reconciling mismatches.
A last-mile logistics platform's Python queue processor saturated at 120 jobs/second under peak load. CPython's GIL prevented true concurrency on CPU-bound route scoring. Driver assignment delays were causing visible UX degradation.
A multi-tenant HR platform sharing one Postgres database had experienced 3 cross-tenant data leaks in 12 months due to missing WHERE tenant_id clauses. Application-layer filtering was a convention, not an enforcement mechanism.
A claims processor running 50,000 daily claims through a 6-step RabbitMQ pipeline lost an average of 12 claims per day when workers crashed mid-processing. Re-processing from step 1 also double-billed a third-party enrichment API. Temporal replaced the queue with durable workflows that resume from the last completed step.
A retailer with stores, e-commerce, wholesale, and loyalty channels had no single customer view. Six source systems used different customer keys. Monday morning reconciliation took one analyst 4 hours. A dbt-modelled BigQuery warehouse with identity resolution consolidated everything with 47-minute end-to-end latency.
Three regulators required different things from the audit trail — SEC wanted immutability and 7-year retention; FINRA wanted sub-second query on the last 90 days; an EU regulator wanted tamper-evidence per record. Three separate audit tables with a sync ETL had caused two compliance findings. One append-only log replaced all three.
An e-commerce platform needed to split a 2.4 TB orders table into three tables. A direct ALTER TABLE would lock the table for an estimated 14 hours — equivalent to £29.4M in risk at peak GMV. The expand-contract pattern executed across four releases with zero downtime.
A documentation platform's LCP averaged 6.8 seconds after integrating a headless CMS that returned a 12 MB JSON payload per page at request time. ISR, payload splitting, React Server Components, and a CMS webhook for on-demand revalidation brought LCP to 1.1 s without touching the CMS API.
A B2B billing console had 3.4-second FCP because the entire page awaited a billing API with P95 latency of 2.8 seconds. Partial prerendering served the static shell instantly from the edge while the dynamic usage summary streamed in. A Redis cache on the aggregation endpoint cut API P95 from 2.8 s to 220 ms.
Field engineers worked in intermittent-3G areas. The previous app overwrote local work on reconnection, losing completed data. Engineers photographed screens as backup. WatermelonDB with server-reconciled conflict resolution made connectivity optional.
An iOS health app was rejected twice by Apple's privacy review team — once for requesting HealthKit entitlements on launch without user intent, once for writing derived data back without per-type consent. A full codebase audit found four additional issues. Third submission approved without reviewer questions.
A lifestyle app with 4M MAU had no automated safety net — bad releases reached 100% of users before engineers noticed. Sentry release health gated staged rollouts automatically; the New Architecture migration simultaneously resolved 40% of crash volume.
A support team handling 1,400 tickets/day had 60% answerable from docs. A first RAG prototype scored 61% factuality — worse than a search bar — due to semantically similar but contextually wrong chunk retrieval.
An operations platform received 8,000 daily support requests across email, Slack, and web. Manual triage consumed 1.5 FTE. A two-stage classifier — fine-tuned DistilBERT for high-confidence cases, Claude API fallback for low-confidence ones — routed to 17 queues at 94% overall precision.
An AI infrastructure provider needed to serve a 70B open-weight model to 12 enterprise tenants with contractual data-isolation requirements. vLLM's default KV cache sharing was incompatible with isolation. Per-tenant namespace prefixing with prefix caching maintained 79% cache hit rate per tenant while satisfying isolation audits.
A fintech API platform had CloudWatch alarms on CPU and memory — infrastructure metrics that gave no signal on user impact. SLO burn-rate alerting on user-facing SLIs caught 7 of 8 incidents before customers noticed.
A growth-stage SaaS was deploying 200 microservices via manual kubectl apply from engineers' laptops. Deployments were undocumented, unrepeatable, and caused production incidents 18% of the time. ArgoCD with Argo Rollouts and SLO-gated canaries moved everything to GitOps with automatic rollback.
A healthcare SaaS had 340 static credentials distributed across env vars, config files, and a shared password manager. An offboarding incident revealed a contractor's credentials were still valid 6 weeks after departure because nobody had a complete inventory. Vault dynamic secrets eliminated the concept of a long-lived credential.
An RTMP-based camera platform had 8–14 second live-view latency. Each RTMP hop (camera → media server → CDN → client) added 1.5–3 seconds. Operators were watching events that had already happened.
Storing raw 4kHz vibration data from 80,000 sensors was arithmetically impossible — 47 TB/day uncompressed. No analytical value was lost by computing RMS/peak/FFT statistics instead of storing raw samples.
A fleet telematics provider's Python UDP ingestion dropped points during cellular handoffs, causing vehicles to 'disappear' for 8–12 minutes on fleet maps — a compliance issue for regulated cargo. A Go edge collector with outbox pattern and sequence-number reconciliation achieved exactly-once GPS history.
A global manufacturer needed to replace a 12-year-old ADFS estate blocking cloud adoption. 200,000 employees, 14 countries, 180 SAML-integrated applications, and zero tolerance for a login outage. A prior attempt was abandoned after causing a 4,000-user login loop.
Manual CSV user imports took 3 days per enterprise customer. Deprovisioning was by email to an irregularly checked mailbox — a SOC 2 audit found 34 accounts active 30+ days after the user had left the customer organisation. SCIM 2.0 endpoints automated both sides.
A DeFi yield protocol was two weeks from mainnet launch with $24M in committed TVL. A cross-function reentrancy path bypassed the standard single-function guard — exploitable to drain the protocol in a single transaction.
A property investment platform wanted retail investors to buy fractional commercial property ownership from £500. Regulatory compliance required KYC-gated transfers, investor limits per property, and a compliant secondary market. Standard ERC-20 tokens could not enforce transfer restrictions. ERC-1400 security tokens enforced compliance at the contract level.
Initial firmware had all peripherals running continuously. PPG sensor, MCU idle current, BLE advertising, and LCD backlight together produced 3.1-day battery life against a 14-day clinical requirement.
An industrial gateway running a super-loop firmware stalled CAN bus processing for 15–60 seconds whenever the LTE modem became unresponsive — causing safety shutdowns on connected field devices. FreeRTOS task isolation with watchdog supervision made each subsystem independently restartable.
A previous OTA update bricked 3,400 smart meters when a firmware bug caused a boot loop on a specific hardware revision. Physical truck rolls to recover cost £340,000. A dual-bank bootloader with cryptographic verification and automatic rollback made failed updates self-recovering.
The API gateway accepted both RS256 and HS256 JWT algorithms. An attacker could forge valid tokens for any user by signing an HS256 token with the public key as the HMAC secret. Automated scanners did not detect it — only manual JWT manipulation found the flaw.
A Kubernetes cluster that had grown organically had a three-step privilege escalation path from a compromised pod to cluster-admin via an overly permissive ServiceAccount and misconfigured admission policy. Closed before the SOC 2 Type II audit window opened.
3-person team spent 4 hours daily exporting from SAP to Salesforce via CSV/Excel. Formula errors caused incorrect quotes ~2 per week. Inventory data was 24 hours stale in Salesforce.
Stripe's at-least-once delivery guarantee was causing duplicate order fulfillments, double notification emails, and incorrect account credits. A Redis-based idempotency layer using Stripe event IDs as deduplication keys, with a 48-hour TTL covering Stripe's retry window, eliminated all duplicates.
5-year big-bang modernisation programme stalled after 18 months with nothing in production. Restructured around incremental delivery — first production service live in month 4 of the revised programme.
A 15-year-old PHP monolith had accumulated enough technical debt that feature delivery averaged 6 weeks. A big-bang rewrite was rejected — 18 months with no new features was unacceptable. The strangler fig pattern extracted modules incrementally behind a proxy, shrinking the monolith by 67% while keeping the API contract unchanged.
A clinic network needed a unified intake portal for 14 specialties, each with different form requirements, wired into AWS HealthLake with full HIPAA compliance. A shared form engine driven by JSON configuration, a clinician dashboard, and a patient mobile app were all live for the first clinic in 6 weeks.
A smart agriculture platform had deployed firmware updates for 3 years without versioning the telemetry schema. Field name changes, unit changes, and silent ingest failures had made 40% of historical sensor data unqueryable. Schema archaeology and a versioned ingestion pipeline recovered the data without touching 8,000 deployed sensors.
A computer vision company was randomly sampling from 2M images for annotation — spending most labelling budget on easy, redundant non-defect examples. An active learning loop directed effort to uncertain and high-information images, reaching the same model accuracy with 660,000 annotations instead of 2 million.
A production credit scoring model was evaluated quarterly. Between evaluations, silent feature distribution drift caused the Gini coefficient to degrade. A PSI-based drift monitor running daily detected shifts within 48 hours and triggered automatic retraining, preventing two periods where the model would have breached its accuracy SLA.
A pharmaceutical distributor needed DSCSA-compliant end-to-end drug provenance across 14 partners with incompatible ERPs. EDI file exchange was reconciled after the fact, could be altered, and had different formats per partner. A Hyperledger Fabric permissioned network gave all parties a shared tamper-proof ledger.