Announcing OpenAI-compatible API support for Amazon SageMaker AI endpoints
Artificial Intelligence Announcing OpenAI-compatible API support for Amazon SageMaker AI endpoints Today, Amazon SageMaker AI introduces OpenAI-compatible API support for real-time inference endpoints. If you use the OpenAI SDK, LangChain, or Strands Agents, you can now invoke models on SageMaker AI by changing only your endpoint URL. You don’t need a custom client, a SigV4 wrapper, or code rewrites. Overview With this launch, SageMaker AI endpoints expose an /openai/v1 path that accepts Chat Completions requests and returns responses as is from the container, including streaming. OpenAI endpoints are turned on for all endpoints and inference components using standard SageMaker AI APIs and SDK. SageMaker AI routes based on the endpoint name in the URL, so any OpenAI-compatible client works out of the box. You can now create time-limited bearer tokens for your endpoints and use them with your OpenAI clients. For a working example that includes deployment and invocation, see the accompanying notebook on GitHub. “We run AI coding agents that use multiple LLM providers through an LLM gateway (Bifrost) speaking the OpenAI chat completions protocol. The bearer token feature lets us add SageMaker as a drop-in OpenAI-compatible inference endpoint — no custom SigV4 signing — so it works natively with our gateway, Vercel AI SDK, and standard OpenAI clients.” says Giorgio Piatti (AI/ML Engineer – Caffeine.AI) Use cases Agentic workflows on owned infrastructure If you build multi-step AI agents with frameworks like Strands Agents or LangChain, you can now run those workflows entirely on your own SageMaker AI endpoints. Your agents call models using the same OpenAI-compatible interface they were built on, but inference runs on dedicated GPU instances in your own account. Multi-model hosting with a single interface If you run multiple models—for example, Llama for general tasks, a fine-tuned Mistral for domain-specific work, and a smaller model for classification—you can host all of them on a single SageMaker AI endpoint using inference components. Each model gets its own resource allocation, and every one is callable through the same OpenAI SDK. You don’t need separate API clients or routing logic in application code. Serving fine-tuned models without code changes If you fine-tune open source models for your specific use case, you can deploy them on SageMaker AI and call them through the same OpenAI-compatible interface that your applications already use. The only change is the endpoint URL. The rest of the application—the SDK calls, the streaming logic, the prompt formatting—stays the same. Solution overview In this post, we walk through the following: - How bearer token authentication works with SageMaker AI endpoints. - Deploying and invoking a single-model endpoint. - Deploying and invoking inference components for multi-model deployments. - Integration with the Strands Agents framework. Prerequisites To follow along with this walkthrough, you must have the following: - An AWS account with permissions to create SageMaker AI endpoints. - The SageMaker Python SDK ( pip install sagemaker ). - The OpenAI Python SDK ( pip install openai ). - A model stored in Amazon Simple Storage Service (Amazon…

