NVIDIA Developer Blog·Hardware·7d ago·by Felix Abecassis·~3 min read

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling

NVIDIA GB200 NVL72 introduces a fundamentally new way to build GPU clusters by extending NVIDIA NVLink coherence across an entire rack. This design enables exascale performance, but it also changes the assumptions that many scheduling systems were built on. As a result, “rack-scale locality” becomes a hard constraint. When workloads cross domain boundaries, performance drops sharply, and a scheduler that treats the network fabric as a best-effort tree topology will fragment allocations in ways that increase queue times and degrade application performance. To address this, Slurm workload manager introduced the topology/block plugin and continues expanding its capabilities with segmented scheduling. The plugin enables administrators and users to express application-specific NVLink requirements as atomic blocks rather than loosely optimized allocations. This post explains how NVIDIA GB200 NVL72 architecture is unique, how Slurm block scheduling helps optimize placement and performance, and how to configure topology.yaml, --segment , and related features so you can move from prototype clusters to production-grade rack-scale orchestration. How is NVIDIA GB200 NVL72 architecture unique? NVIDIA GB200 NVL72 is an exascale computer in a single rack that represents a new paradigm in GPU cluster design. While previous generations of servers used NVIDIA NVLink within a single chassis, GB200 NVL72 extends this coherent memory domain across an entire rack: 72 NVIDIA Blackwell GPUs spanning 18 compute trays, unified with fifth-generation NVLink. All communications within the rack operate at NVLink speeds: NVIDIA GB200 NVL72 provides 1.8 TB/s bidirectional throughput per GPU, for a total of 130 TB/s aggregate bandwidth. Communication crossing domain boundaries faces a steep performance drop, typically 50 GB/s (400 Gb/s) through InfiniBand or Ethernet. Operating GB200 NVL72 clusters at scale requires new workload scheduling algorithms to treat NVLink domains as hard boundaries for jobs. While these new algorithms are essential for efficient workload, they also require administrative awareness around system fragmentation. The topology/block Slurm plugin helps users and administrators with both of these. How does block scheduling work in Slurm? Slurm has long supported topology-aware job scheduling: the topology/tree plugin has been the standard for large-scale clusters. It models the networking compute fabric as a hierarchical tree of switches and nodes. While the primary objective of topology/tree is to minimize the number of switches a job spans, it is a best-effort attempt. The job might end up being heavily fragmented across leaf switches in order to start sooner. For clusters with an InfiniBand fabric connecting all compute nodes, this trade-off makes sense. A job split across multiple leaf switches might run slightly slower than under one single leaf switch, but the tradeoff between start time and performance is generally considered acceptable. The introduction of GB200 NVL72 and GB300 NVL72 required a new approach. From a joint effort between NVIDIA and SchedMD, the new topology/block plugin was introduced in the Slurm 23.11 release to support rack-scale architectures such as GB200 NVL72. If a job submits an allocation request that fits within a single block (18 nodes or less), the nodes will always be allocated from one block and the…

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling — image 2

#gpu

read full article on NVIDIA Developer Blog →

0login to vote