vLLM Blog·Infra·6d ago·~3 min read

vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache May 18, 2026 · 13 min read TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external...

vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external KV connector interface. It moves KV cache lifetime out of the vLLM worker process, pools cache across local instances and remote nodes, and combines pinned host memory, RDMA-accessible remote memory, and SSD into a three-level cache hierarchy. In production-oriented evaluations, this design delivered: - 2.15x faster vLLM startup when a 500 GiB host KV pool was already owned by the external cache service. - 56% higher throughput for eight Qwen3-8B instances sharing one host cache instead of eight isolated caches. - 72% higher throughput for DeepSeek-V3.2 MLA with TP8 by storing logical KV once instead of once per TP rank. - 194 GB/s average remote-read throughput for large prefix pulls in an internal RDMA cluster with 8 x 400 Gbps NICs per node. The core idea is simple: KV cache should be a long-lived serving asset, not temporary state tied to one inference process. For vLLM users, the important part is that this integration is exposed through the existing kv_transfer_config path. PegaFlow can be used as an external cache backend without modifying vLLM source code or carrying a long-lived fork. Why KV cache needs a process boundary KV cache is one of the most expensive runtime assets in production LLM serving. It can occupy hundreds of GiB per host, takes time to allocate and warm, and often outlives the request pattern that originally created it. In a conventional in-process design, that asset is tightly coupled to the inference engine process. This coupling becomes painful during engine crashes, rolling upgrades, and model switches. When an engine restarts, the host KV pool disappears with it. When a serving fleet switches from one model deployment to another, hundreds of GiB of pinned memory may need to be reallocated and warmed before the instance can serve traffic again. PegaFlow addresses this by moving the KV cache runtime into a standalone daemon on each machine. The PegaFlow server owns the host KV pool, SSD cache, topology metadata, RDMA resources, indexing state, and background tasks. vLLM workers connect to the local PegaFlow process through CUDA IPC on the data path and gRPC on the local control path. This design was built around a production requirement: one cache server should be able to serve multiple engines and multiple models on the same host. Different models, tensor-parallel configurations, and engine versions can coexist under one PegaFlow process with namespace isolation, while sharing the same memory pool, SSD capacity, and cross-node network bandwidth. The resulting failure domains are cleaner. A vLLM process can crash, upgrade, or switch models while the cache service remains alive. Conversely, cache-layer issues do not have to bring down the inference engine process. Faster restarts with external cache ownership To isolate the startup-path impact of host KV pool ownership, we measured an 8 x RTX…

#inference

read full article on vLLM Blog →

0login to vote