How to Build Vision AI Pipelines Using NVIDIA DeepStream Coding Agents
Developing real-time vision AI applications presents a significant challenge for developers, often demanding intricate data pipelines, countless lines of code, and lengthy development cycles. NVIDIA DeepStream 9 removes these development barriers using coding agents, such as Claude Code or Cursor, to help you easily create deployable, optimized code that brings your vision AI applications to life faster. This new approach simplifies the process of building complex multi-camera pipelines that ingest, process, and analyze massive volumes of real-time video, audio, and sensor data. Built on GStreamer and part of the NVIDIA Metropolis vision AI development platform, DeepStream accelerates a developer’s journey from concept to actionable insight across industries. Video 1. How to use the NVIDIA DeepStream coding agents to generate complete vision AI pipelines from natural language prompts with Claude Code. To watch a recording showing how to build a DeepStream vision AI pipeline using Claude Code or Cursor, click here. Using NVIDIA Cosmos Reason 2 to build a video analytics app It is possible to build a video analytics app that concurrently ingests hundreds of camera streams and analyzes the streams with a vision language model (VMA) using NVIDIA Cosmos Reason 2, the most accurate, open, reasoning VLM for physical AI. The application scales dynamically with no wasted redeployment time to add cameras or swap models and no guessing at bottlenecks. The coding agent understands your hardware and generates an application optimized for it. With just a few lines, a prompt can generate a complete production-grade microservice with REST APIs, health monitoring, deployment automation, and Kafka integration — all in one development session. How to generate a VLM-powered vision AI application: Step 1: Install the DeepStream Coding Agent skill for Claude Code or Cursor. You can generate code anywhere, but deployment requires the minimum hardware, listed on GitHub. Step 2: Paste the prompt below into your agent to generate a scalable VLM pipeline with dynamic N stream ingestion and per-stream batching. Implement a Python application that uses a multi-modal VLM to summarize video frames and sends summaries to a remote server via Kafka. Architecture: 1. DeepStream Pipeline: Use DeepStream pyservicemaker APIs to receive N RTSP streams, decode video, and convert frames to RGB format. Process each stream independently — do not mux streams together. 2. Frame Sampling & Batching: Use MediaExtractor to sample frames at a configurable interval (e.g. 1 frame every 10 seconds). When the VLM supports multi-frame input, batch sampled frames over a configurable duration (e.g. 1 minute) before sending to the model. Each batch must contain frames from a single stream only. 3. VLM Backend: Implement a module that receives a batch of decoded video frames and returns a text summary from the multi-modal VLM. 4. Kafka Output: Send each text summary to a remote server using Kafka. Constraints: - Scalable to hundreds of RTSP streams across multiple GPUs on a single node. Distribute processing load across all available GPUs. - Never mix frames from different RTSP streams in a single batch. Store output in the rtvi_app…

