Scalable voice agent design with Amazon Nova Sonic: multi-agent, tools, and session segmentation
Artificial Intelligence Scalable voice agent design with Amazon Nova Sonic: multi-agent, tools, and session segmentation Design patterns for scalable voice agents matter for organizations that need to deliver fast, natural, and reliable voice experiences. Many teams face challenges like high latency, managing real-time audio, and coordinating multiple agents in complex workflows. In this post, you’ll learn how to use Amazon Nova Sonic, Amazon Bedrock AgentCore, and Strands BidiAgent to build scalable, maintainable voice agents that handle these challenges efficiently, resulting in more responsive and intelligent customer interactions. We’ll explore three popular architectural patterns for voice agents, highlighting their trade-offs and best practices for minimizing latency. The building blocks Before diving deeper into the architecture patterns, here’s a quick overview of the three key components used as the sample solution in this post. Amazon Nova Sonic is a foundation model that creates natural, human-like speech-to-speech conversations for generative AI applications. Users can interact with AI through voice in real time, with capabilities for understanding tone, natural conversational flow, and performing actions. Amazon Bedrock AgentCore Runtime is a serverless hosting environment for AI agents. You package your agent as a container, deploy to AgentCore Runtime, and it handles scaling, session isolation, and billing. For voice agents, it provides bidirectional WebSocket streaming with SigV4 auth, microVM-level session isolation to avoid noisy-neighbor latency spikes, AgentCore Gateway for shared tool hosting using the Model Context Protocol (MCP) open source protocol, persistent memory across sessions, and telemetry for voice-specific metrics like time-to-first-audio. Strands Agents is an open source framework for building AI agents. Its BidiAgent class is one integration option between Nova Sonic and your application. It manages the bidirectional stream lifecycle, routes tool calls, and handles session management, simplifying the voice agent application through the model SDK interface. Three integration patterns: tool, agent-as-tool (sub-agent), and session segmentation Instead of building one all-powerful agent, modern voice systems are increasingly composed of tool-driven agents, sub-agents acting as tools and session segmentation strategies that isolate prompts, memory, and permissions. These patterns allow teams to decompose large assistants into smaller, specialized, and reusable components while maintaining clear security boundaries. Before running the samples in the following sections, install Python and the required dependencies, including strands-agents and boto3, and make sure your IAM setup has the necessary permissions for the required services. For the full example, refer to the GitHub repository. Pattern 1: AgentCore Gateway – tool selection for low latency A tool call is when a voice agent sends input to an external function or service, which processes it and returns output. It lets the agent perform tasks like querying a database or triggering a service quickly and securely, without extra reasoning steps. With AgentCore Gateway, you expose your existing business logic as tools, discrete functions that Nova Sonic can call directly during a conversation. The voice model selects which tool to invoke, passes parameters, gets a result, and speaks it back. There’s no intermediate reasoning layer between the model and the tool. AgentCore Gateway hosts MCP servers as…

