Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling
As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on the hardware itself. NVIDIA GB200 NVL72 delivers exascale compute in a single rack, unlocking real-time trillion-parameter models. Yet capturing that performance in a shared cluster requires schedulers that understand the system architecture and align jobs with its network topology. This post explains how Slurm topology-aware job scheduling works on NVIDIA GB200 NVL72, and provides scheduling recommendations for optimal GPU occupancy. How does NVIDIA GB200 NVL72 deliver exascale compute? NVIDIA GB200 NVL72 is an exascale computer in a single rack. With 72 NVIDIA Blackwell GPUs interconnected by the largest production scale-up compute fabric, NVIDIA NVLink provides 130 terabytes per second (TB/s) of low-latency GPU communication bandwidth for AI and high-performance computing (HPC) workloads. Multiple GB200 NVL72 systems combined in a cluster create hierarchical network topology with large domains of very high networking bandwidth. An AI training job can greatly benefit from the abundant networking bandwidth offered by GB200 NVL72, when scheduled to maximize the use of NVLink fabrics. Recent results show that GB200 NVL72 delivers significant improvement in performance for all AI workloads, including training (>2.6x with recent MLPerf training), across different inference use cases (real-time inference for trillion-parameter models, >1.5 million tokens/second for the OAI gpt-oss model, state-of-art disaggregate serving), as well as reasoning. In a shared cluster running multiple training jobs, a resource-efficient scheduler must account for varying network bandwidth requirements. What is topology-aware job scheduling? Topology-aware job scheduling allows a job scheduler such as Slurm to make resource allocation decisions based on the cluster’s physical network layout, such as the hierarchy of switches and racks. The scheduler should preserve locality, keeping workloads within the same NVLink domain whenever possible. In addition, because multiple training or inference jobs can fit in a group of NVL72 racks, the scheduler must provide efficient bin-packing to avoid resource fragmentation. The longstanding Slurm topology/tree plugin provides topology-aware scheduling for large clusters, but its best-effort approach often fragments jobs across leaf switches to reduce queue time. While this compromise between start time and performance was acceptable for traditional InfiniBand fabrics, the advent of rack-scale systems like GB200 NVL72 and GB300 NVL72 necessitated a change. In response, NVIDIA and SchedMD collaborated to launch the new topology/block plugin in Slurm 23.11, specifically designed for these modern architectures. This topology plugin configuration provides information about groups of nodes belonging to the same NVL72 domain, which enables algorithms that can align Slurm jobs with NVL72 domain boundaries. To learn more about the block topology plugin and how segment sizes are scheduled, see Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling. How do cluster segmentation and job scheduling work on GB200 NVL72? As clusters grow in scale and complexity, managing GPU resources becomes critical for achieving both high utilization and predictable performance. The GB200 NVL72 system introduces larger AI job segment sizes and fine-grained scheduling control, enabling…

