4/3/2026 Scaling and Optimizing Frontier Model Training
On this page How Fireworks scales frontier model training and offers the broadest set of fine-tunable MoE models on any platform. Training trillion-parameter Mixture-of-Experts (MoE) models has historically been bottlenecked by memory walls and complex cluster orchestration. Earlier this month, Cursor released Composer 2 — a frontier coding model that tops CursorBench at 61.3, SWE-bench Multilingual at 73.7, and Terminal-Bench at 61.7. Fireworks powers the Reinforcement Learning (RL) inference infrastructure behind it, proving that these bottlenecks can be overcome at scale. We have written about delta-compressed weight sync and multi-region rollout fleets, and about why numerical parity between training and inference is especially hard for MoE models. Those posts cover the inference half of the RL loop — rollouts, weight transfer, and numerical alignment. This post covers the last missing piece: the trainer itself. Our Training SDK provides the model catalog, parallelism stack, precision kernels, and memory optimizations that make it possible to fine-tune trillion-parameter MoE models on current hardware. Our Training Shapes catalog supports both LoRA and full-parameter training across models in the Fireworks catalog. Customers pick a shape ID and call resolve_training_profile() — the Training SDK and API backend handles GPU layout, parallelism, and deployment bring-up automatically. Teams that want to start with managed fine-tuning and graduate to custom training loops can do so on the same platform. Both policy trainer and forward-only reference shapes are available for every model, supporting full RL workflows with separate policy and reference deployments. This is, to our knowledge, the broadest set of fine-tunable frontier MoE models available on any training platform. The two training modes present very different engineering challenges. LoRA freezes most of the model and updates a small set of low-rank adapters — the question is whether the full model even fits on a single node. Full-parameter training updates every weight — the question is how to distribute a trillion parameters, their gradients, their optimizer states, and their activations across a GPU cluster while keeping utilization high. We built the engine to handle both. LoRA fine-tuning of a 1T MoE model sounds like it should be easy — only a fraction of parameters are trainable. But the frozen base model still has to live in GPU memory. Kimi K2.5 has 384 MoE experts; in bfloat16, those experts alone consume the majority of an 8-GPU node's memory before a single gradient is computed. Low-precision expert quantization makes it fit. We store frozen expert weights in a reduced-precision packed format, cutting expert memory by roughly 4x. The experts are dequantized to bf16 on the fly during the forward pass; because they are frozen, there is no loss of gradient precision. For Kimi K2.5, this is the difference between needing multiple nodes and fitting on a single 8-GPU node. Optimizer state offloading reclaims more headroom. Optimizer state offloading between CPU and GPU reclaims significant memory headroom. On a Qwen3-30B MoE model (128 experts, 8 H200 GPUs), this reduces peak GPU memory by over 40% with no loss in throughput. Training results are…
