DeepSeek V4 in vLLM: Efficient Long-context Attention Apr 24, 2026 · 17 min read A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.
DeepSeek V4 in vLLM: Efficient Long-context Attention We are excited to announce that vLLM now supports the DeepSeek V4 family of models (deepseek-ai/DeepSeek-V4-Pro and deepseek-ai/DeepSeek-V4-Flash ). These models feature an efficient long-context attention mechanism, purpose-built for tasks involving up to one million tokens. While the new attention design may appear intricate on first reading, its underlying principles are straightforward once examined systematically. This blog post is organized into three sections: - Quickstart guide for serving DeepSeek V4 on vLLM - First-principles explanation of DeepSeek V4's new architectural design - Overview of our implementation approach and optimization challenges for this model on vLLM: hybrid KV cache, kernel fusion, and disaggregated serving. This represents our initial release of model support, and further optimizations are actively underway. We hope the technical explanation that follows can help the open-source community understand both the attention…