vLLM Blog·Hardware·3d ago·~3 min read

The State of FP8 KV-Cache and Attention Quantization in vLLM Apr 22, 2026 · 21 min read Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

The State of FP8 KV-Cache and Attention Quantization in vLLM Introduction Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large fraction of that cache. Halving KV-cache storage with FP8 can therefore translate into substantially higher concurrency or longer supported contexts at the same hardware cost, provided accuracy holds up. vLLM's --kv-cache-dtype fp8 flag quantizes the KV-cache and runs the entire attention computation (the QK and ScoreV matrix multiplications) in FP8 (e4m3 is the format used throughout this post). This feature has been available in vLLM for some time, but how does it perform under stress tests across both prefill-heavy and decode-heavy workloads? We conducted a comprehensive validation across decoder-only and MoE models, and across Hopper and Blackwell architectures. We identified and fixed critical accuracy and performance issues in the Flash Attention 3 (FA3) backend (see example in Figure 1). On the validated paths in this post, it preserves near-baseline accuracy while reducing decode cost and KV-cache memory usage. The main caveats are hybrid-attention models with small sliding-window layers, where skipping those layers is often better, and large-head-dimension models (head_dim = 256 ), where prefill can still regress. Furthermore, for head dimensions 64 and 128, FP8 format also offers speedups both on prefill and decoding. For memory-bound decoding the per-token cost of the KV cache can be reduced to 54% of its BF16 counterpart in the best cases. For large head dimensions like 256, FP8 also reduces the ITL; however, the default prefill performance is currently still slightly worse than for BF16. Table of Contents - The Problems We Found - Kernel and vLLM Improvements - Performance Benchmarking - Accuracy Benchmarking - When to Avoid FP8 KV-Cache Quick start: The Problems We Found Although --kv-cache-dtype fp8 has been available in vLLM for some time, our stress tests revealed two categories of issues: Accuracy: On Hopper GPUs, the FP8 Flash Attention 3 kernel suffered from accumulation precision loss at long contexts. On a 128k needle-in-a-haystack task, FP8 accuracy dropped from 91% (BF16 baseline) to just 13% — a regression traced to imprecise FP32 accumulation in the Tensor Cores (see the two-level accumulation fix below). Performance: The FP8 ITL slope for models with sliding-window attention layers (e.g., gpt-oss-20b) was nearly identical to BF16 (96% of BF16 slope), meaning users gained almost no decoding speedup despite halving memory. The break-even point exceeded 700k tokens — well beyond most practical context lengths. The following section describes the improvements we shipped to address these issues. Kernel and vLLM Improvements During our investigations, we shipped various improvements to enhance the flexibility of the quantization schemes, fix accuracy issues and improve the performance. We briefly describe those here. Two-level accumulation: Hopper's FP8 Tensor Cores are documented as accumulating into FP32 registers, but in practice the intermediate accumulation loses precision when the contraction dimension is large — a known hardware-level issue also encountered during DeepSeek-V3 training (see Figure 7(b) in…

#inference#coding

read full article on vLLM Blog →

0login to vote