AWS Machine Learning Blog·Tutorial·1d ago·by Dan Ferguson·~3 min read

NVIDIA Nemotron 3 Nano Omni model now available on Amazon SageMaker JumpStart

Artificial Intelligence NVIDIA Nemotron 3 Nano Omni model now available on Amazon SageMaker JumpStart Today, we are excited to announce the day zero availability of NVIDIA Nemotron 3 Nano Omni on Amazon SageMaker JumpStart. This multimodal model from NVIDIA combines video, audio, image, and text understanding into a single, efficient architecture, enabling enterprise customers to build intelligent applications that can see, hear, and reason across modalities in one inference pass. In this post, we walk through the model architecture and key capabilities of Nemotron 3 Nano Omni, explore the enterprise use cases it unlocks, and show you how to deploy and run inference using Amazon SageMaker JumpStart. Overview of NVIDIA Nemotron 3 Nano Omni NVIDIA Nemotron 3 Nano Omni is an open, multimodal large language model with 30 billion total parameters and 3 billion active parameters (30B A3B). It is built on a Mamba2 Transformer Hybrid Mixture of Experts (MoE) architecture, combining three core components: - Nemotron 3 Nano LLM as the language backbone - CRADIO v4-H as the vision encoder for image and video understanding - Parakeet as the speech encoder for audio transcription and comprehension This unified architecture processes video, audio, images, and text as input and generates text as output. It supports a 131K token context length, chain of thought reasoning, tool calling, JSON output, and word level timestamps for transcription tasks. The model is available in FP8 precision on SageMaker JumpStart, delivering an optimal balance of accuracy and efficiency for enterprise workloads. It is licensed under the NVIDIA Open Model Agreement for commercial use.Enterprise agent workflows are inherently multimodal. Agents must interpret screens, documents, audio, video, and text, often within the same reasoning loop. Today, most agentic systems stitch together separate models for vision, speech, and language. This approach increases latency through repeated inference passes, complicates orchestration and error handling, fragments context across modalities, and amplifies cost and failure modes over time. Nemotron 3 Nano Omni solves this by functioning as the multimodal perception and context sub-agent in a system of agents. It provides the agent system with eyes and ears: reading screens, interpreting documents, transcribing speech, and analyzing video, all while maintaining a converged multimodal context across reasoning loops.Nano Omni understands screens, documents, audio, and video in a single reasoning loop. This replaces fragmented model stacks and simplifies agent workflow design significantly. For anyone building agentic architectures, this collapses inference hops, orchestration logic, and cross-model synchronization overhead into a single model call.The model accepts the following input types: Enterprise use cases The multimodal capabilities of Nemotron 3 Nano Omni make it a powerful, flexible model choice for enterprise use cases. Computer use agents Nemotron 3 Nano Omni powers the perception loop for agents navigating graphical user interfaces. It reads screens, understands UI state over time, and validates outcomes, while execution agents handle the actions. This collapses vision and reasoning into a single loop, eliminating the need for split perception pipelines. Practical applications include incident management dashboards, agentic search, browser automation, and email workflow agents. Document…

NVIDIA Nemotron 3 Nano Omni model now available on Amazon SageMaker JumpStart — image 2

#inference#gpu

read full article on AWS Machine Learning Blog →

0login to vote