NVIDIA Developer Blog·Hardware·7d ago·by Ruixiang Wang·~3 min read

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer

Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA GeForce RTX GPUs. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more efficiently in resource-constrained environments. This post walks through how to use NVIDIA Model Optimizer to quantize a CLIP model in FP8 format with the post-training quantization (PTQ) method. For a general introduction to model quantization, see Model Quantization: Concepts, Methods, and Why It Matters. What is NVIDIA Model Optimizer? The NVIDIA Model Optimizer (ModelOpt) library incorporates state-of-the-art model optimization techniques to compress and accelerate AI models. These techniques include quantization, distillation, pruning, speculative decoding, and sparsity. ModelOpt accepts Hugging Face, PyTorch, or ONNX format models as input and provides Python APIs for users to easily combine different optimization techniques to produce optimized checkpoints. ModelOpt supports highly performant quantization formats such as FP4, FP8, INT8, and INT4, and advanced algorithms including SmoothQuant, AWQ, SVDQuant, and Double Quantization. It supports both PTQ and quantization-aware training (QAT). What is CLIP? CLIP (Contrastive Language-Image Pretraining), introduced by OpenAI in 2021, is a foundation vision language model (VLM) that learns a shared embedding space for images and text through contrastive learning on large image-text pairs. Its ability to produce semantically aligned representations has made it a core building block across modern multimodal systems. The CLIP text encoder is widely reused as a conditioning module for text-to-image (Stable Diffusion, for example) and text-to-video (AnimateDiff, for example) synthesis. Its vision encoder serves as the visual backbone in multimodal LLMs, such as LLaVA, and open-vocabulary perception models, such as OWL-ViT. Successors such as OpenCLIP and SigLIP scale the data and refine the objective but preserve the dual-encoder contrastive paradigm. Quantization recipe The following quantization recipe is used in this post as a step-by-step guide for running CLIP model quantization with ModelOpt to understand how the process works. First, prepare the corresponding models and datasets as shown below: - Base CLIP model: CLIP-ViT-L-14-laion2B-s32B-b82K - Calibration dataset for quantization: 10K subset from MS-COCO - Model accuracy evaluation tasks focus on three from the CLIP_benchmark - cifar100 (zero-shot classification) - imagenet1k (zero-shot classification) - mscoco_captions (zero-shot retrieval) How to run PTQ with ModelOpt The following code sample shows how to run PTQ for the CLIP model in FP8 using ModelOpt: import torch from torch.utils.data import DataLoader, Subset from transformers import CLIPModel, CLIPTokenizer, CLIPImageProcessor from transformers.models.clip.modeling_clip import CLIPAttention import modelopt.torch.opt as mto import modelopt.torch.quantization as mtq from modelopt.torch.quantization.plugins.diffusion.diffusers import _QuantAttention # FP8 (E4M3) per-tensor static quantization FP8_CFG = { "quant_cfg": { "*weight_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"}, "*input_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"}, "*[qkv]_bmm_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"}, "*bmm2_output_quantizer": {"num_bits": (4, 3), "axis": None, "trt_high_precision_dtype": "Half"}, "default": {"enable": False}, }, "algorithm": "max", } mto.enable_huggingface_checkpointing() mtq.QuantModuleRegistry.register({CLIPAttention: "CLIPAttention"})(_QuantAttention) model = CLIPModel.from_pretrained(args.model_ckpt, attn_implementation="sdpa").half().eval().cuda() tokenizer = CLIPTokenizer.from_pretrained(args.model_ckpt) processor = CLIPImageProcessor.from_pretrained(args.model_ckpt) calib_set = Subset(CLIP_COCO_dataset(ANN, IMG_DIR, tokenizer, processor), range(8192)) loader = DataLoader(calib_set, batch_size=512, num_workers=4) # Calibration: 8k MS-COCO image-text pairs def calibrate(m):…

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer — image 2

#inference#training#gpu

read full article on NVIDIA Developer Blog →

0login to vote