SocialReasoning-Bench: Measuring whether AI agents act in users’ best interests
At a glance - AI agents are moving into social contexts. When agents manage calendars, negotiate purchases, or interact with other agents on a user’s behalf, they need more than task competence—they need social reasoning. - SocialReasoning-Bench evaluates that ability. The benchmark tests whether an agent can negotiate for a user in two realistic settings: Calendar Coordination and Marketplace Negotiation. - The benchmark measures both outcomes and process: it scores agents on outcome optimality (how much value they secure for the user) and due diligence (whether they follow a competent decision-making process). - Current frontier models often leave value on the table. They usually complete the task, but they frequently accept suboptimal meeting times or poor deals instead of advocating effectively for the user. - Prompting helps, but it is not enough. Even with explicit guidance to act in the user’s best interest, performance remains well below what a trustworthy delegate should achieve. As AI agents take on more real-world tasks, they are increasingly operating in social contexts. With the right integrations, agents like Claude Cowork and Google Gemini can manage email and calendar workflows. In these settings, the agent must interact with others on your behalf. This requires social reasoning — understanding what you want, what the counterparty wants, and what information to reveal, protect, or push back on. Our previous research suggests that today’s frontier models lack social reasoning. In our simulated multi-agent marketplace, agents accepted the first proposal they received up to 93% of the time without exploring alternatives. When red-teaming a social network of agents, a single malicious message spread through the system and led agents to disclose private data before passing the message along. This kind of relationship has a long history outside AI. In economics and law it is called a principal-agent relationship: an agent acts on a principal’s behalf in interactions with others whose interests differ. Attorneys, real-estate agents, and financial advisors all operate in this mode, and the duties they owe—care, loyalty, confidentiality—are codified in centuries of professional norms. AI agents acting on a user’s behalf should ultimately be held to similar standards. To measure and drive progress in social reasoning, we built SocialReasoning-Bench: a benchmark for testing whether agents can reason and negotiate on a user’s behalf against a counterparty with independent goals, private information, and potentially adversarial intent. Introducing SocialReasoning-Bench SocialReasoning-Bench evaluates social reasoning in two domains: Calendar Coordination and Marketplace Negotiation. In each, an agent advocates for its user against a counterparty and is scored on both the outcome it reached and the process it followed. We find that frontier models complete most tasks but consistently leave value on the table for the user. Calendar coordination In calendar coordination, an assistant agent manages a user’s calendar on a single day and fields a meeting request from another agent. We assume the agent has access to a value function over time slots that captures the user’s scheduling preferences between 0.0 and 1. This function could be provided explicitly by the…

