3/23/2026 Frontier RL Is Cheaper Than You Think
On this page The conventional wisdom on RL infrastructure is wrong, and it is costing teams that could be competing at the frontier. The entire mega-cluster narrative rests on a single assumption: that you have to ship 1 TB of weights every time you update your rollout fleet. You do not. Researchers have spent the last year writing about asynchronous RL and rollout-training disaggregation in systems like AReaL. Teams like Kimi and MiniMax have also published engineering notes on RL parameter updates and asynchronous scheduling. We have been running that pattern in production. That mega-cluster instinct comes from pretraining, where the main systems problem is keeping one huge synchronous training job saturated. RL is a different problem. The question is not just how to run the trainer. It is also how to keep a large rollout fleet generating data from a fresh enough policy without constantly stalling on full checkpoint transfers. An RL training run has two jobs: The trainer needs dense, tightly coupled hardware. The rollout fleet needs inference throughput across many parallel requests. Pretraining only has the first job. RL has both, which is why the infrastructure question is different. A typical frontier checkpoint is around 1 TB. If every policy refresh required shipping that full checkpoint to the rollout fleet, then the natural conclusion would be that RL needs one giant co-located cluster with RDMA-class internal networking. Keep trainer and inference on the same fabric, avoid long-distance transfers, and treat remote capacity as second class. That is the mega-cluster story. It makes frontier RL look like a market only a handful of companies can enter, because everyone else gets boxed out by infrastructure economics before they even get to compete on algorithms or product execution. But the premise is wrong. You do not need to move the full 1 TB on every update. Between nearby RL checkpoints, most weights change only a little. That makes it practical to send a compressed delta against the previous checkpoint instead of sending the full 1 TB again. Last year, we empirically observed that more than 98% of weights in bf16 format remain bit-equivalent between consecutive checkpoints, and the unchanged fraction is even higher at lower precision. Our intuition was that post-training updates are extremely fine-grained and RL provides very sparse information signal with just a few bits per rollout. In practice that means RL training uses a fairly small learning rate, and most parameters move only slightly in fp32. Those changes often do not cross the threshold required to alter their 16-bit or lower-precision representation. A recently published paper, Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL, provides a theoretical foundation for the same phenomenon and reports similarly high sparsity, often around 99% in practical RL settings. In the sample setup behind this post, a full checkpoint is 1024 GiB. The average delta between adjacent checkpoints is 20.3 GiB, or 1.98% of the full model. Over the 50-step window shown below, that cuts cross-region transfer volume by about…
