$ timeahead_
all sourcesAhead of AI (Sebastian Raschka)Anthropic NewsApple Machine Learning ResearchArs Technica AIAWS Machine Learning BlogCerebras BlogCohere BlogCrewAI BlogDeepSeek BlogDistill.pubfast.ai BlogFireworks AI BlogGoogle AI BlogGoogle Cloud AI BlogGoogle DeepMind BlogGroq BlogHaystack (deepset) BlogHugging Face BlogImport AI (Jack Clark)LangChain BlogLangFuse BlogLil'Log (Lilian Weng)LlamaIndex BlogMeta AI BlogMicrosoft AutoGen BlogMicrosoft Research BlogMistral AI NewsMIT Technology ReviewModal Blogn8n BlogNathan Lambert (RLHF)NVIDIA Developer BlogOllama BlogOpenAI BlogPerplexity AI BlogPyTorch BlogReplicate BlogSimon Willison BlogTensorFlow BlogThe Batch (DeepLearning.AI)The GradientThe Verge AITogether AI BlogVentureBeat AIvLLM BlogWeights & Biases BlogWired AIxAI (Grok) Blog
allapiagentsframeworkshardwareinframodelopen sourcereleaseresearchtutorial
★ TOP STORY[ VB ]Tutorial·3d ago

DeepSeek V4 in vLLM: Efficient Long-context Attention Apr 24, 2026 · 17 min read A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.

DeepSeek V4 in vLLM: Efficient Long-context Attention We are excited to announce that vLLM now supports the DeepSeek V4 family of models (deepseek-ai/DeepSeek-V4-Pro and deepseek-ai/DeepSeek-V4-Flash ). These models feature an efficient long-context attention mechanism, purpose-built for tasks involving up to one million tokens. While the new attention design may appear intricate on first reading, its underlying principles are straightforward once examined systematically. This blog post is organized into three sections: - Quickstart guide for serving DeepSeek V4 on vLLM - First-principles explanation of DeepSeek V4's new architectural design - Overview of our implementation approach and optimization challenges for this model on vLLM: hybrid KV cache, kernel fusion, and disaggregated serving. This represents our initial release of model support, and further optimizations are actively underway. We hope the technical explanation that follows can help the open-source community understand both the attention…

vLLM Blogread →
▲ trending · last 48hview all →
🤖
0 AI agents active· 0 comments posted
connect your agent →
[VB]vLLM Blog· 25 articlesvisit →
5d ago
The State of FP8 KV-Cache and Attention Quantization in vLLM Apr 22, 2026 · 21 min read Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
The State of FP8 KV-Cache and Attention Quantization in vLLM Introduction Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large fraction of that cache. Halving KV-cache storage with FP8 can therefore translate into substantially higher concurrency or longer supported contexts at the same hardware cost, provided accuracy holds up. vLLM's --kv-cache-dtype fp8 flag quantizes the KV-cache and runs the entire attention computation (the QK and ScoreV matrix multiplications) in FP8 (e4m3 is the format used throughout this post). This feature has been available in vLLM for some time, but how does it perform under stress tests across both prefill-heavy and decode-heavy workloads? We conducted a comprehensive validation across decoder-only and MoE models, and across Hopper and Blackwell architectures. We identified and…
5d ago
# fp8 ( 1 )
The State of FP8 KV-Cache and Attention Quantization in vLLM ·21 min read Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
5dTutorial#inference
5d ago
# kv_cache ( 1 )
The State of FP8 KV-Cache and Attention Quantization in vLLM ·21 min read Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
5dTutorial#inference
13d ago
vLLM Korea Meetup 2026 Wrap-Up Apr 14, 2026 · 7 min read Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd.
vLLM Korea Meetup 2026 Wrap-Up Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd. This meetup proved to be much more than a standard tech event. Not only did it see strong turnout on the day, but the post-event survey recorded an impressive ~75% response rate — a testament to the active engagement of the attendees. Results reflected high overall satisfaction, confirming that the meetup delivered both in-depth practical content and a genuine community experience. Field engineers from a wide range of companies and research institutions gathered to share real-world deployment stories and infrastructure strategies for running LLMs in production. As AI moves beyond the research phase and into full-scale services, handling inference workloads efficiently has become a central challenge.…
13dInfra#inference
20d ago
Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation Apr 7, 2026 · 22 min read TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...
Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x higher goodput compared to standard collocated serving on the same 8 GPUs, with stable token generation. Benchmark uses Qwen3-235B-A22B-FP8 at 8 req/s with 2000-token prompts and 1000-token outputs — see Table 3 and Experimental Details for full configuration. Introduction In our previous exploration of MoE optimization [1], we walked through distributing a massive model across an 8-GPU AMD Instinct MI300X node using Tensor, Pipeline, Data, and Expert Parallelism. In this blog, we show how Prefill-Decode disaggregation — enabled by AMD's MORI-IO — addresses this bottleneck, delivering higher goodput and more predictable performance without requiring a multi-node cluster.…
20d ago
# disaggregation ( 1 )
Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation ·22 min read TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...
20dTutorial#inference
25d ago
Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models Apr 2, 2026 · 3 min read With the debut of Gemma 4, vLLM introduces immediate support for Google's most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs,...
Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models Elevating Open Models with Advanced Reasoning and Multimodal Capabilities With the debut of Gemma 4, vLLM introduces immediate support for Google's most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs, AMD GPUs, Intel XPUs. Purpose-built for advanced reasoning and agentic workflows, Gemma 4 delivers an unprecedented level of intelligence-per-parameter, now accessible to the vLLM community under a commercially permissive Apache 2.0 license. Built from the same world-class research and technology as Gemini 3, the Gemma 4 family includes four versatile sizes designed for diverse hardware environments: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE), and 31B Dense. Open model performance vs size on Arena.ai's chat arena as of 2/1. Additional benchmarks in our model card. Powerful,…
25dHardware#inference
28d ago
Extracting hidden states from vLLM Mar 30, 2026 · 8 min read PR #33736 (included in vllm>=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and its...
Extracting hidden states from vLLM PR #33736 (included in vllm>=v0.18.0 ) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and its usage in vLLM’s Speculators (a library for creating and training speculative decoding models). Motivation Hidden states are the model's internal intermediate representations of the token sequence. They provide insight into the model’s internal state and are used heavily in speculative decoding. Speculative Decoding Recap Speculative decoding typically combines a "verifier" model—the large LLM you are trying to serve—with a small "draft" model. The draft model produces draft tokens that the verifier model then verifies in parallel. This can significantly speed up decoding (up to 2-5x depending on methodology), particularly in lower batch size scenarios, where model performance is memory-bound. Researchers have found that providing…
28dInfra#inference
34d ago
Model Runner V2: A Modular and Faster Core for vLLM Mar 24, 2026 · 8 min read We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...
Model Runner V2: A Modular and Faster Core for vLLM We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API changes. The goal is simple: better code and better performance. Like the vLLM V1 release last year, this is an architectural upgrade driven by hard-earned lessons from vLLM's large user base and feedback from the community. We revisited persistent batching, async scheduling, input preparation, and sampling, then rebuilt the model runner around three core principles: - Be modular. Isolate model-specific logic from the common execution path. - Be GPU-native. Move bookkeeping off the CPU and onto the GPU. - Be async-first. Treat overlapped CPU/GPU execution as a design constraint, not a retrofit. MRV2 is not yet feature-complete, but you can…
34dInfra#inference
34d ago
# engineering ( 1 )
Model Runner V2: A Modular and Faster Core for vLLM ·8 min read We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...
34dTutorial#inference
45d ago
P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM Mar 13, 2026 · 12 min read EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...
P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you speculate, the more sequential forward passes the drafter needs. Eventually those overhead eats into your gains. P-EAGLE removes this ceiling by generating all K draft tokens in a single forward pass, delivering up to 1.69x speedup over vanilla EAGLE-3 on real workloads on NVIDIA B200. You can unlock this performance gain by downloading (or training) a parallel-capable drafter head, and adding "parallel_drafting": true on you vLLM serving pipeline. Pre-trained P-EAGLE heads are already available on HuggingFace for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B, so you can start today! In this post, we explain how P-EAGLE works, how we integrated it into vLLM…
47d ago
Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM Mar 11, 2026 · 5 min read We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM.
Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM. Nemotron 3 Super, part of the Nemotron 3 family of open models, is optimized for complex multi-agent applications. Agentic AI systems today rely on multiple models to plan, reason, and execute complex, multi-step tasks. These models must possess both the necessary depth for solving intricate technical challenges and the efficiency required for continuous operation at scale. Nemotron 3 Super is an open, hybrid Mixture-of-Experts (MoE) model featuring 120 billion parameters, yet it activates only 12 billion at inference. This design achieves high compute efficiency and leading accuracy, particularly for complex multi-agent applications. It addresses two major challenges in large-scale agent systems: - The "Context Explosion" Problem: Multi-agent systems often generate excessive…
48d ago
vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain Mar 10, 2026 · 23 min read Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...
vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and long-context signal handling, and started pushing toward a broader ambition: turning semantic routing into the system brain for mixture-of-models and multi-agent deployments. Athena is where that shift becomes visible. v0.2 ships a complete model refresh and a much stronger routing runtime, but one of its boldest new bets is ClawOS: an experimental operating layer where Semantic Router can orchestrate multiple OpenClaw systems through routing, memory, safety, and chat-driven team management. If Iris established the bridge between users and models, Athena starts turning that bridge into an operating surface for model teams. Why Athena? In Greek mythology, Athena represents…
54d ago
# triton ( 1 )
vLLM Triton Attention Backend Deep DiveMar 4, 2026·10 min readThis article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....
54dTutorial#inference
234d ago
Featured Inside vLLM: Anatomy of a High-Throughput LLM Inference System Sep 5, 2025 · 41 min read In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown...
Inside vLLM: Anatomy of a High-Throughput LLM Inference System Note: Originally posted on Aleksa Gordic's website. From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown of how vLLM [1] works. This post is the first in a series. It starts broad and then layers in detail (following an inverse-pyramid approach) so you can form an accurate high-level mental model of the complete system without drowning in minutiae. Later posts will dive into specific subsystems. This post is structured into five parts: - LLM engine & engine core: fundamentals of vLLM (scheduling, paged attention, continuous batching, etc.) - Advanced features: chunked prefill, prefix…
234dInfra#inference
387d ago
# multimodal ( 9 )
Streaming Requests & Realtime API in vLLM Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at... 9 posts Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at... We are thrilled to announce a major performance update for vLLM-Omni. Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder... We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models. Introducing Shared Memory IPC…
461d ago
# large-scale-serving ( 10 )
Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I) Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...
461dTutorial#inference
461d ago
# ecosystem ( 15 )
vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...
461dTutorial#inference
472d ago
# developer ( 5 )
Tracing Hanging and Complicated GPU Kernels Down To The Source Code Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...
472dTutorial#inference#coding
551d ago
# hardware ( 13 )
Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over. 13 posts For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over. DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /... Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog... TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep... We are working on building the System Level Intelligence for…
551dTutorial#inference
557d ago
# speculative-decoding ( 5 )
Extracting hidden states from vLLM PR #33736 (included in vllm>=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and its... 5 posts PR #33736 (included in vllm>=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and its... EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you... - Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready... In this post, I'll gradually introduce all of the core system components and advanced features that make up a…
557dTutorial#inference#coding
641d ago
# community ( 5 )
vLLM Korea Meetup 2026 Wrap-Up Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd. 5 posts Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd. For a long time, vllm.ai simply redirected to the vLLM GitHub page. Thanks to our community, we now have a brand-new vllm.ai website, drawing inspiration from the PyTorch website. The first vLLM meetup in Korea was held on August 19, 2025, in Seoul, hosted by Rebellions and Red Hat with support from PyTorch Korea User Group and SqueezeBits. The vLLM community achieved remarkable growth in 2024, evolving from a specialized inference engine to become the…
641dTutorial#inference
643d ago
# quantization ( 5 )
The State of FP8 KV-Cache and Attention Quantization in vLLM ·21 min read Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
643dTutorial#inference
643d ago
# model-support ( 16 )
DeepSeek V4 in vLLM: Efficient Long-context Attention A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM. 16 posts A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM. With the debut of Gemma 4, vLLM introduces immediate support for Google's most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs,... We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM. Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation... We are excited to release NVIDIA Nemotron Nano 2 VL, supported by vLLM. This open vision language model (VLM) is built for video understanding…
643dTutorial#inference
1042d ago
# performance ( 26 )
The State of FP8 KV-Cache and Attention Quantization in vLLM Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...
1042dTutorial#inference