4/24/2026 Notes on DeepSeek-V4's training system
On this page DeepSeek-V4 is interesting less for any single benchmark number than for the shape of the system around it. The paper shows architecture, routing, reward modeling, reasoning modes, distillation, and agent execution all becoming part of the training loop. The useful takeaway for training infrastructure is obvious: fixed recipes are not enough. Researchers increasingly need programmable loops, while the platform handles distributed execution, inference integration, checkpointing, and scaling underneath. Supporting that flexibility is the core design principle behind the Fireworks Training API. DeepSeek-V4 alternates between Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses KV entries and then does sparse top-k selection. HCA compresses more aggressively, but keeps dense attention over the compressed memory. The point is not just "longer context." It is model/runtime co-design: attention pattern, KV layout, precision, sparse selection, and inference kernels all…