NVIDIA Developer Blog·Infra·1d ago·by Anjali Shah·~3 min read

NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model

Agentic systems often reason across screens, documents, audio, video, and text within a single perception‑to‑action loop. However, they still rely on fragmented model chains—separate stacks for vision, audio, and text. This increases inference hops and orchestration complexity, driving up inference costs while weakening cross-modal context consistency. NVIDIA Nemotron 3 Nano Omni, a new addition to the Nemotron 3 family, brings unified multimodal reasoning into a single, highly efficient open model. Built to replace fragmented vision‑language‑audio stacks, Nemotron 3 Nano Omni functions as the multimodal perception and context sub‑agent within agentic systems. With this, agents can perceive and reason across visual, audio, and textual inputs within a single shared perception‑to‑action loop, improving convergence and reducing orchestration complexity and inference cost. It delivers best-in-class accuracy on document intelligence leaderboards such as MMlongbench-Doc and OCRBenchV2, while also leading in video and audio understanding, WorldSense, DailyOmni, and VoiceBench. Beyond accuracy, MediaPerf—an open industry benchmark that evaluates video understanding models on real media data and production tasks across quality, cost, and throughput—shows Nemotron 3 Nano Omni achieving the highest throughput across every task and the lowest inference cost for video-level tagging. Read this post to learn more. Built on a 30B‑A3B hybrid mixture‑of‑experts (MoE) architecture, Nemotron 3 Nano Omni activates the expert required for each task and modality, for high throughput and strong multimodal performance at scale. With fully open weights, datasets, and recipes, developers can customize, deploy, and integrate multimodal sub‑agents across local, cloud, and enterprise environments. Video 1. NVIDIA Nemotron 3 Nano Omni unifies video, audio, image, and text in an open MoE architecture Best-in-class efficiency and accuracy Nemotron 3 Nano Omni supports hardware-aware optimized inference across multiple GPU architectures, including NVIDIA Ampere, NVIDIA Hopper, and NVIDIA Blackwell GPU families, and for popular inference engines, including vLLM and NVIDIA TensorRT-LLM. It supports FP8 and NVFP4 quantization, efficient video sampling, and NVIDIA‑optimized kernels to deliver predictable, low‑latency inference. Combined with convolutional 3D‑based temporal‑spatial processing, these optimizations enable sustained multimodal perception with lower compute costs across GPUs—from workstations to data center and cloud deployments. Designed to power sub‑agents, Nemotron 3 Nano Omni powers perception, context maintenance, and multimodal understanding within larger agent systems. It integrates cleanly with execution and planning models—such as NVIDIA Nemotron 3 Super and NVIDIA Nemotron 3 Ultra—keeping agent architectures modular, efficient, and scalable. The following benchmarks evaluate performance under a fixed interactivity threshold—the points at which each user continues to experience responsive, real‑time interactions. Rather than maximizing raw concurrency, the evaluations hold per‑user throughput (tokens per second per user) constant on the x‑axis and measure how much total system throughput can be sustained without degrading the user experience. For video reasoning at the same interactivity threshold, Nemotron 3 Nano Omni sustains higher aggregate throughput, translating into up to ~9.2× greater effective system capacity compared to alternative open omni models. For multi-document reasoning at the same interactivity threshold, Nemotron 3 Nano Omni sustains higher aggregate throughput, translating into up to ~7.4× greater effective system capacity compared to alternative open omni models. On…

NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model — image 2

#agents#multimodal#gpu

read full article on NVIDIA Developer Blog →

0login to vote