NVIDIA Developer Blog·Agents·6d ago·by Matej Kosec·~3 min read

Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in NVIDIA Dynamo

An agentic exchange must preserve a structured interaction: assistant turns interleave reasoning with one or more tool calls, and subsequent user turns return the corresponding tool results to the model context. Reasoning replay is model- and turn-dependent: some reasoning should be retained, while some should be dropped. The inference engine is responsible for supporting this more expressive interaction model and for producing correctly segmented API results. Tool-call parsing and reasoning parsing need to happen before the attached harness consumes the response. High-value agentic workflows such as coding also depend on a responsive harness experience: reasoning segments, tool-call events, and request metadata need to stream back as the turn unfolds instead of arriving only after a final text response. This post covers lessons from running real agentic clients against NVIDIA Dynamo: how we hardened parser and API coverage, improved streaming behavior, and extracted those parser layers into standalone reusable crates. These changes build on the performance considerations outlined in our first post, which focused on the serving architecture underneath agentic inference: the frontend, router, and KV cache management. This follow-up focuses on correctness, user-experience equivalence, and performance. Agentic harnesses are still evolving quickly. Claude Code, Codex, and OpenClaw expose many of the same pressure points through different API surfaces, so the examples below focus on the core behaviors that custom serving stacks need to reproduce. Harness-facing Dynamo settings Our experiments used the newly released nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model, though the same issues apply across models, reasoning parsers, and tool-call parsers. To reproduce our results, configure the frontend with the Anthropic-compatible API and the flags that preserve prompt, reasoning, and tool state: --enable-anthropic-api exposes the Anthropic Messages API to harnesses. Many harnesses can fall back to the default Messages API, but the experience is degraded.--strip-anthropic-preamble removes the Anthropic billing header that can destabilize KV reuse.--enable-streaming-tool-dispatch lets complete tool calls start executing as soon as they are decoded, rather than waiting for the end of the turn. Putting all of this together: python -m dynamo.frontend \ --http-port 8000 \ --enable-anthropic-api \ --strip-anthropic-preamble \ --enable-streaming-tool-dispatch On the worker side, the important settings in this deployment are: --dyn-tool-call-parser <parser> and--dyn-reasoning-parser <parser> reconstruct tool calls and reasoning blocks in the model-specific format the harness expects. Those parsers also control whether reasoning from previous turns should be retained, transformed, or dropped. Prompt stability is key for cache reuse Claude Code sends thousands of tokens of reusable prompt scaffolding, much of which is intended to remain identical across users and sessions. However, each prompt begins with a session-specific billing header that causes cache misses when requests are routed to custom endpoints that do not strip it out: x-anthropic-billing-header: cc_version=0.2.93; cch=abc123def456==; You are Claude Code, an interactive CLI tool... These headers poison the KV cache and prevent it from being reused, even across sessions by the same user. A varying line at position zero means every new session starts with a different token prefix, so the stable instructions and tool definitions behind it never line up cleanly for reuse. To restore…

Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in NVIDIA Dynamo — image 2

#agents#gpu

read full article on NVIDIA Developer Blog →

0login to vote