PyTorch Blog·Hardware·19d ago·by Vasiliy Kuznetsov (Meta) and Sayak Paul (Hugging Face)·~3 min read

Faster Diffusion on Blackwell: MXFP8 and NVFP4 with Diffusers and TorchAO

Diffusion models for image and video generation have been surging in popularity, delivering super-realistic visual media. However, their adoption is often constrained by the sheer requirements in memory and compute. Quantization is essential for efficient serving of these models. In this post, we demonstrate reproducible end-to-end inference speedups of up to 1.26x with MXFP8 and 1.68x with NVFP4 with diffusers and torchao on the Flux.1-Dev, QwenImage, and LTX-2 models on NVIDIA B200. We also outline how we used selective quantization, CUDA Graphs, and LPIPS as a measure to iterate on the accuracy and optimal performance of these models. The code to reproduce the experiments in this post is here. Table of contents: - Background on MXPF8 and NVFP4 - Basic Usage with Diffusers and TorchAO - Benchmark Results - Technical Considerations Background on MXFP8 and NVFP4 MXFP8 and NVFP4 are microscaling formats supported natively by NVIDIA’s Blackwell architecture (e.g., B200 GPUs). Unlike standard quantization, which scales an entire tensor, microscaling groups elements into small blocks (e.g., 16 or 32 values) that share a high-precision scale factor. This allows for significantly lower bit-depths while preserving dynamic range and accuracy. - MXFP8 (OCP Microscaling FP8): An 8-bit industry-standard format (E4M3/E5M2) from the Open Compute Project (OCP). It uses a block size of 32 with 8-bit scaling. It provides a “sweet spot” balance, delivering faster inference than BF16 with virtually no loss in visual quality (lower LPIPS), and often achieves the lowest latency at smaller batch sizes. - NVFP4 (NVIDIA FP4): A 4-bit floating-point format (E2M1) uniquely accelerated by Blackwell Tensor Cores. It uses a block size of 16 with FP8 scaling factors. It offers the highest theoretical throughput and lowest memory footprint (approx. 3.5x smaller than BF16), making it ideal for high-batch, compute-bound workloads. Refer to this post to know more. Basic Usage with diffusers and TorchAO Prerequisites NVFP4 requires a CUDA capability of at least 10.0. So, make sure you have a GPU that fits the bill. The benchmarks presented in this document were conducted on a B200 machine (B200 DGX). For the virtual environment, you can use conda : conda create -n nvfp4 python=3.11 -y conda activate nvfp4 pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu130 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu130 pip install --pre mslk --index-url https://download.pytorch.org/whl/nightly/cu130 pip install diffusers transformers accelerate sentencepiece protobuf av imageio-ffmpeg At the time of writing, the nightlies were 2.12.0.dev20260315+cu130 , 0.17.0.dev20260316+cu130 , and 2026.3.15+cu130 for PyTorch, TorchAO, and MSLK, respectively. Some models require users to be authenticated on the Hugging Face Hub platform. So, please make sure to run hf auth login before running the examples, if not already done. Basic Usage Using the NVFP4 quantization config from TorchAO is straightforward with its native integration in Diffusers: from diffusers import DiffusionPipeline, TorchAoConfig, PipelineQuantizationConfig import torch from torchao.prototype.mx_formats.inference_workflow import ( NVFP4DynamicActivationNVFP4WeightConfig, ) config = NVFP4DynamicActivationNVFP4WeightConfig( use_dynamic_per_tensor_scale=True, use_triton_kernel=True, ) pipe_quant_config = PipelineQuantizationConfig( quant_mapping={"transformer": TorchAoConfig(config)} ) pipe = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, quantization_config=pipe_quant_config ).to("cuda") pipe.transformer.compile_repeated_blocks(fullgraph=True) pipe_call_kwargs = { "prompt": "A cat holding a sign that says hello world", "height":…

Faster Diffusion on Blackwell: MXFP8 and NVFP4 with Diffusers and TorchAO — image 2

#multimodal#gpu

read full article on PyTorch Blog →

0login to vote