vLLM Blog·Tutorial·1d ago·~3 min read

DeepSeek V4 in vLLM: Efficient Long-context Attention Apr 24, 2026 · 17 min read A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.

DeepSeek V4 in vLLM: Efficient Long-context Attention We are excited to announce that vLLM now supports the DeepSeek V4 family of models (deepseek-ai/DeepSeek-V4-Pro and deepseek-ai/DeepSeek-V4-Flash ). These models feature an efficient long-context attention mechanism, purpose-built for tasks involving up to one million tokens. While the new attention design may appear intricate on first reading, its underlying principles are straightforward once examined systematically. This blog post is organized into three sections: - Quickstart guide for serving DeepSeek V4 on vLLM - First-principles explanation of DeepSeek V4's new architectural design - Overview of our implementation approach and optimization challenges for this model on vLLM: hybrid KV cache, kernel fusion, and disaggregated serving. This represents our initial release of model support, and further optimizations are actively underway. We hope the technical explanation that follows can help the open-source community understand both the attention mechanism itself and the rationale behind our current implementation decisions. Running DeepSeek V4 on vLLM DeepSeek V4 comes with 2 models, a big 1.6T parameter DeepSeek-V4-Pro , and a small 285B parameter DeepSeek-V4-Flash . Both models support up to 1 million tokens of context, and vLLM's implementation of the new attention mechanism is designed to scale to that context length. DeepSeek-V4-Pro Here we highlight a single node deployment optimized for easy testing and prototyping, with several optional optimizations like FP4 indexer and MTP. The following command is runnable on 8xB200 or 8xB300. For more deployment strategies, including disaggregated serving/more GPU architectures, please refer to the recipes. DeepSeek-V4-Flash Here we highlight a single node deployment optimized for easy testing and prototyping, with several optional optimizations like FP4 indexer and MTP. The following command is runnable on 4xB200 or 4xB300. For more deployment strategies, including disaggregated serving/more GPU architectures, please refer to the recipes. DeepSeek V4's Attention Mechanism Explained Long-context inference faces two main challenges: - KV cache memory growth: The KV cache scales linearly with context length. While DeepSeek-style models use Multi-head Latent Attention (MLA), which is substantially more memory-efficient than standard Multi-head Attention (MHA) or Multi-Query Attention (MQA), scaling to one million tokens remains difficult given the limited capacity of GPU memory. - Attention computation cost: Computing attention over long contexts is expensive. Even with prior techniques such as DeepSeek Sparse Attention (DSA), the computation remains a significant bottleneck. To address these challenges, the DeepSeek team designed a new attention mechanism aimed at both compressing the KV cache and reducing attention computation time. - Share key and value vectors (2x memory savings). For correctness, we apply an inverse RoPE operation to the attention output. - Compress the KV cache across multiple tokens (4x to 128x memory savings). In DeepSeek V4, there are two ways to do this: c4a : compress the KV cache by roughly 1/4. One compressed token is a weighted sum of 8 uncompressed tokens, with a stride of 4.c128a : compress the KV cache by roughly 1/128. One compressed token is a weighted sum of 128 uncompressed tokens, with a stride of 128. - DeepSeek Sparse Attention…

#inference

read full article on vLLM Blog →

0login to vote