PyTorch Blog·Hardware·1d ago·by Ahan Gupta¹, Zhihao Wang¹, Neel Dani¹, Masahiro Tanaka², Olatunji Ruwase³, Minjia Zhang¹·~3 min read

Introducing AutoSP

Increasingly, Large-Language-Models (LLMs) are being trained for extremely long-context tasks, where token counts can exceed 100k+. At these token counts, out-of-memory (OOM) issues start to surface, even when scaling device counts using conventional training techniques such as ZeRO/FSDP. To circumvent these issues, sequence parallelism (SP): partitioning the input tokens across devices to enable long-context training with increasing GPU counts, is a commonly used parallel training technique. However, implementing SP is notoriously difficult, requiring invasive code changes to existing libraries such as DeepSpeed or HuggingFace. These code changes often involve partitioning input token contexts (and intermediate activations), inserting communication collectives, and overlapping communication with computation, all of which must be done for both the forward and backwards pass. This results in researchers who want to experiment with long context capabilities spending significant effort on engineering the system’s stack to enable such capability, repeating this effort for different hardware vendors. To avoid this complexity, we introduce AutoSP: a fully automated compiler-based solution that automatically converts easy-to-write training code to multi-GPU sequence parallel code that efficiently uses GPUs to train on longer input contexts while composing with existing parallel strategies (such as ZeRO). This avoids the cumbersome need for developers to repeatedly modify training pipelines for long-context training. Users can now simply import AutoSP and compile arbitrary models using the AutoSP backend, giving the power of long-context training to anyone. Moreover, by embedding this technology into the compiler, our approach is performance-portable: highly performant SP can be realised on diverse hardware. We structure this post as follows: (1) AutoSP and how model scientists can use it to enable long-context training, (2) Key design decisions of AutoSP, (3) key AutoSP results, demonstrating its ease-of-use and impact, (4) some limitations and things AutoSP cannot do. AutoSP Usage A key design philosophy of AutoSP is simplicity in abstracting most of the complexity in programming multiple GPUs from users. To do this, we implement AutoSP within DeepCompile: a compiler ecosystem within DeepSpeed to programmatically enable diverse optimisations for deep neural network training. With this, any user who uses DeepSpeed can automatically enable Sequence Parallelism with almost zero hassle. We take a look at an example next. # We instantiate a deepspeed config. # Assume 8 GPUs with 2 DP ranks and 4 SP ranks. config = { "train_micro_batch_size_per_gpu": 1, "train_batch_size": 2, "steps_per_print": 1, "optimiser": { "type": "Adam", "params": { "lr": 1e-4 } }, "zero_optimization": { "stage": 1, # AutoSP interoperates with ZeRO 0/1. }, # Simply turn on deepcompile and set # the AutoSP pass to be triggered on. "compile": { "deepcompile": True, "passes": ["autosp"] }, "sequence_parallel_size": 4, "gradient_clipping": 1.0, } # Initialise deepspeed with model. model, _, _ = deepspeed.initialize(config=config,model=model) # Compiles model and automatically applies AutoSP passes. model.compile(compile_kwargs={"dynamic": True}) for idx, batch in enumerate(train_loader): # Custom function that we expose within: # deepspeed/compile/passes/sp_compile. inputs, labels, positions, mask = prepare_auto_sp_inputs(batch) loss = model( input_ids=inputs, labels=labels, position_ids=positions, attention_mask=mask ) ... # Backwards pass, optimiser step etc... As seen in the example above, users take existing training code…

#coding#training

read full article on PyTorch Blog →

0login to vote