vLLM Blog·Infra·8d ago·~3 min read

Serving Agentic Workloads at Scale with vLLM x Mooncake May 6, 2026 · 10 min read TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...

Serving Agentic Workloads at Scale with vLLM x Mooncake TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput, 46x lower TTFT, and 8.6x lower end-to-end latency on realistic agentic traces, while scaling nearly linearly to 60 GB200 GPUs. Agentic workloads are reshaping LLM serving With the rise of LLM agents such as Claude Code and OpenClaw, inference workloads are undergoing a fundamental shift. As Jensen highlighted in his GTC 2026 keynote, LLMs are moving beyond simple chatbots toward autonomous, long-running systems that plan, reason, and act toward complex goals. What makes agentic workloads unique is their structure. They typically consist of long-horizon, multi-turn loops that alternate between a reasoning step, where the model processes context and produces intermediate thoughts, and an action step, where the model issues tool calls and receives external outputs. To quantify this behavior, we collected and analyzed traces from Codex and GPT-5.4 on the SWE-bench Pro dataset. We have also open-sourced the dataset here to encourage broader community study of agentic serving workloads. Figure 1 summarizes the Codex/SWE-bench Pro traces and shows a representative agentic session. The pattern is striking: by turn 30, context length grows to roughly 80K tokens, and the longest contexts can grow beyond 180K tokens. Yet each turn typically introduces only a few hundred to a few thousand new tokens. The rest is prefix that the model has already seen. Across the dataset, the average input-to-output token ratio is roughly 131:1. If we can cache those prefixes, prefill for the cached portion becomes essentially free. The true per-turn cost is only the new delta. Across the Codex/SWE-bench Pro dataset, comprising 610 traces with a median of 33 turns per trace, we observe: - 94.2% cache hit rate - 131:1 input-to-output ratio - Average context growth of roughly 2,242 tokens per turn - Median context growth from 12K to 80K tokens per trace - Inter-turn delays ranging from 5.2s median to 81.4s P99 However, local KV cache offloading to CPU DRAM or disk runs into two major limitations for agentic workloads. - Limited capacity and eviction. A 100K-token context can occupy GBs of storage (e.g., ~3.8 GB for Kimi-2.5 FP8 KV caches). On a busy instance serving many long-running sessions, these large prefix caches can quickly saturate local capacity and trigger eviction. - Cross-instance misses. To balance load, the router may not always schedule the next turn of a session on the same vLLM instance. If the session is migrated to a different instance, that instance has never seen the prefix and must recompute it from scratch. Takeaway: we can no longer treat an inference service as a set of isolated vLLM replicas. For agentic workloads, instances need to share a distributed KV cache pool that provides both larger aggregate capacity and cross-instance cache hits. Distributed KV cache pool with Mooncake Store Mooncake is an open-source, high-performance library for KV cache transfer and distributed storage.…

#agents#inference

read full article on vLLM Blog →

0login to vote