Problem: vLLM's paged attention shares KV cache across requests for efficiency. Sharing across tenant boundaries violated contractual isolation. Naive namespace partitioning dropped KV cache hit rate from 82% to 34%, cutting throughput unacceptably.
Solution: Custom request router prefixed each tenant's system prompt with a tenant-specific token sequence, guaranteeing cache partition. Static system prompt prefix pre-loaded and cached at engine startup per tenant namespace, restoring hit rate to 79% for the static portion. Tensor parallelism across 4×H100 GPUs.
Technology: vLLM · NVIDIA H100 · Kubernetes · OIDC · Python
Optimisation pattern: shared-kv-cache-to-prefix-namespaced-cache-with-per-tenant-preload
Outcomes:
Aggregate throughput: 1,800 tokens/second. Isolation overhead vs no-isolation baseline: 14%. Isolation audit: passed. GPU utilisation: 87% average. Infrastructure cost vs managed API: 61% lower at sustained load.