Multi-tenant vLLM Llama 70B inference at 1,800 tok/s

Problem: vLLM's paged attention shares KV cache across requests for efficiency. Sharing across tenant boundaries violated contractual isolation. Naive namespace partitioning dropped KV cache hit rate from 82% to 34%, cutting throughput unacceptably.

Solution: Custom request router prefixed each tenant's system prompt with a tenant-specific token sequence, guaranteeing cache partition. Static system prompt prefix pre-loaded and cached at engine startup per tenant namespace, restoring hit rate to 79% for the static portion. Tensor parallelism across 4×H100 GPUs.

Technology: vLLM · NVIDIA H100 · Kubernetes · OIDC · Python

Optimisation pattern: shared-kv-cache-to-prefix-namespaced-cache-with-per-tenant-preload

Outcomes:

Aggregate throughput: 1,800 tokens/second. Isolation overhead vs no-isolation baseline: 14%. Isolation audit: passed. GPU utilisation: 87% average. Infrastructure cost vs managed API: 61% lower at sustained load.

Your privacy, your call

Self-hosted Llama 70B: 1,800 tokens/second aggregate with per-tenant KV cache isolation