vLLM Blog·Infra·10d ago·~3 min read

Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models May 14, 2026 · 7 min read We are excited to announce the pre-release of VeRL-Omni, a general reinforcement learning (RL) post-training framework focused on multimodal generative models, built on top of verl and vllm-omni.

Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models We are excited to announce the pre-release of VeRL-Omni, a general reinforcement learning (RL) post-training framework focused on multimodal generative models, built on top of verl and vllm-omni . Why VeRL-Omni? RL has become a powerful method for aligning large generative models with human preferences and downstream task rewards. While the LLM RL stack has evolved rapidly over the past year, multimodal generative RL, covering diffusion and omni-modality models for image/video/audio understanding and generation, faces critical needs: - Diffusion and omni-modality extension: Extending verl's exceptional flexibility and performance to the world of multi-modal and non-autoregressive RL training, covering diffusion transformer backbones (Qwen-Image), mixed AR-DiT architectures (Qwen-Omni), and unified understanding & generation models (BAGEL, HunyuanImage3.0). - Heterogeneous rollout pipelines: Rollouts are denoising trajectories in a continuous latent space rather than token sequences, and a single rollout may invoke multiple heterogeneous model components and multi-stage pipelines (e.g., text encoder → DiT → VAE). - Complex workload scheduling: Orchestrating complex multi-modal RL training workflows, where reward functions are themselves multimodal models (VLM judges, OCR scorers, etc.) and multi-modal generation rollouts have higher memory peaks compared to text generation. Key Features - Efficient multimodal rollout: We integrate vLLM-Omni for its high-throughput async serving for multimodal generation while maintaining accuracy on par with diffusers. VeRL-Omni works with vLLM-Omni to continuously optimize rollout efficiency via step-wise continuous batching, embedding caching, etc. - Flexible reward engine: Spanning rule-based rewards and model-based rewards (e.g. VLM-as-judge for OCR). vLLM is integrated for efficient VLM and LLM reward model inference. Reward computation is overlapped with ongoing rollout and training processes to reduce end-to-end latency. - Modular training backends: Provide various trainers (DiffusersFSDP/Megatron/VeOmni) with built-in optimization for diffusion and omni-modal models, allowing easy integration of different parallelism strategies (FSDP/USP/TP). - Broad hardware compatibility: Supports both NVIDIA GPUs and Ascend NPUs, allowing flexible deployment across diverse hardware backends. - E2E training recipes and benchmarks: Provided with reference performance results, which can achieve high training throughput thanks to the above features. Algorithm and Model Support Getting Started Installation Check out our Installation Doc for details. Training diffusion models Check out our examples directory for specific scripts to launch different RL algorithm trainers for image/audio/video understanding and generation tasks. You can track the training performance and results via wandb. Demo: Qwen-Image FlowGRPO Post-training In the flowgrpo example, we train Qwen-Image with the OCR reward task. The reward model is Qwen3-VL-8B-Instruct , scoring generated images by reading the rendered text and comparing it against the dataset ground truth. Algorithm Review FlowGRPO Demonstration FlowGRPO is an online policy method for flow-matching models. It employs multi-step SDE sampling with a diffusion policy model to enable effective RL exploration, and adopts model-based rewards to assess generation quality. The training workflow mainly consists of four key stages: - Rollout Generation: The diffusion policy model generates sample rollouts, collecting trajectories of log probabilities and generated images. - Reward Model Scoring: The reward model scores each generated sample,…

#inference#multimodal#training

read full article on vLLM Blog →

0login to vote