4/24/2026 Notes on DeepSeek-V4's training system
On this page DeepSeek-V4 is interesting less for any single benchmark number than for the shape of the system around it. The paper shows architecture, routing, reward modeling, reasoning modes, distillation, and agent execution all becoming part of the training loop. The useful takeaway for training infrastructure is obvious: fixed recipes are not enough. Researchers increasingly need programmable loops, while the platform handles distributed execution, inference integration, checkpointing, and scaling underneath. Supporting that flexibility is the core design principle behind the Fireworks Training API. DeepSeek-V4 alternates between Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses KV entries and then does sparse top-k selection. HCA compresses more aggressively, but keeps dense attention over the compressed memory. The point is not just "longer context." It is model/runtime co-design: attention pattern, KV layout, precision, sparse selection, and inference kernels all have to line up. Training has the same problem. If serving uses custom kernels and compressed caches, evaluation during training needs to be close enough to serving that we are not optimizing against the wrong system. Training platform design note: This is where training shapes, checkpoint promotion, and weight sync into deployments become relevant. Architecture-specific work is not just a loss function; the platform has to launch the right trainer, save usable checkpoints, and evaluate the same model/runtime combination that will serve users. The most interesting pretraining trick is Anticipatory Routing. DeepSeek reports that loss spikes were tied to MoE outliers and routing. Their fix decouples features from routes: at step t , features are computed with current weights, but routing indices come from older weights, theta_{t - delta} . To avoid running the model twice, they prefetch a future batch, compute its routing decisions early with the older router, cache those routes, and reuse them later. They report about 20% overhead while this mode is active, and only turn it on after a spike detector triggers rollback. This is not a clean new objective. It is a conditional runtime intervention: detect instability, roll back, change routing behavior, cache side-channel data, then return to normal training. Training platform design note: Fireworks has adjacent primitives in its rollout/training stack: rollout sampling can return per-token logprobs, MoE rollout paths can carry routing metadata such as routing_matrices , and training datums can carry model inputs plus side-channel fields. That is not DeepSeek's full historical-router system, but it points in the same direction: routing decisions sometimes need to become data in the training loop. DeepSeek-V4 exposes three modes from the same weights: Non-think, Think High, and Think Max. These are trained with different RL configurations, length penalties, context windows, and response formats. Think Max also gets an explicit system instruction pushing exhaustive reasoning. This makes "reasoning effort" less mysterious. It is not just a runtime flag; it is a behavior contract backed by data, reward design, formatting, and evaluation. Training platform design note: A programmable loop can treat modes as training conditions: vary prompt format, response template, sampling budget, reward shaping, loss weights, and…
