NVIDIA Developer Blog·Tutorial·1d ago·by Anu Srivastava·~3 min read

Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints

DeepSeek just launched its fourth generation of flagship models with DeepSeek-V4-Pro and DeepSeek-V4-Flash, both targeted at enabling highly efficient million-token context inference. DeepSeek-V4-Pro is the largest model in the family, with 1.6T total parameters and 49B active parameters. DeepSeek-V4-Flash is a smaller 284B-parameter model with 13B active parameters, designed for higher-speed, higher-efficiency workloads. Both models support up to a 1M-token context window, opening new possibilities for long-context coding, document analysis, retrieval, and agentic AI workflows. Architectural innovations for long-context inference The V4 family builds on the DeepSeek MoE architecture, with an increased focus on optimizing the attention component of the transformer architecture. These innovations are designed to achieve a 73% reduction in per-token inference FLOPs and a 90% reduction in KV cache memory burden compared with DeepSeek-V3.2. That matters because long context is becoming a core requirement for agentic applications. Agents store more than a single prompt and response. They carry system instructions, tool outputs, retrieved context, code, logs, memory, and multi-step reasoning traces across a workflow. As context windows grow, attention and KV cache become major bottlenecks. The core architectural solution to this challenges is hybrid attention, which blends together: - Compressed Sparse Attention (CSA): Leverages dynamic sequence compression to compress KV entries to reduce the KV cache memory footprint and then applies DeepSeek Sparse Attention (DSA) to sparsify the attention matrices and reduce computational overhead. - Heavily Compressed Attention (HCA): Applies much more aggressive compression by consolidating KV entries across sets of tokens into a single compressed entry, resulting in significant reduction in KV cache size. DeepSeek-V4’s architectural innovations signal a shift from basic chat toward multi-turn, long-context inference and agentic systems. This new paradigm stresses the entire stack – software, memory, compute, and networking – fundamentally altering the dynamics of inference economics. As open models reach the frontier of intelligence, the enterprise focus is pivoting from model selection to infrastructure strategy. In this landscape, the ultimate competitive advantage is the ability to deploy and scale these high-performance models at the lowest token cost. Out-of-the-box NVIDIA Blackwell performance insights Whether developers are deploying the 1.6T Pro model for advanced reasoning or the 284B Flash model for high-speed efficiency, Blackwell provides the scale and low-latency performance required for a new era of 1M long-context inference and trillion-parameter intelligence. The NVIDIA Blackwell Platform is built for this class of workload. Out of the box tests on DeepSeek-V4-Pro on NVIDIA GB200 NVL72 demonstrate over 150 tokens/sec/user. In addition to these initial tests, the NVIDIA team leveraged vLLM’s Day 0 NVIDIA Blackwell B300 recipe to produce a snapshot of out-of-the-box performance across the pareto (Figure 2). Expect this performance to climb even higher as we optimize our entire extreme co-design stack: Dynamo, NVFP4, optimized CUDA kernels, advanced parallelization techniques, and beyond. Build with NVIDIA GPU-accelerated endpoints Developers can start building with DeepSeek V4 through NVIDIA GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program. Hosted endpoints provide a fast way to prototype with the latest models before moving to self-hosted…

Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints — image 2

#fine-tuning#gpu

read full article on NVIDIA Developer Blog →

0login to vote