Hugging Face Blog·Research·6d ago·~3 min read

The Open Agent Leaderboard

The Open Agent Leaderboard How good are general purpose AI agents? We built an open evaluation framework to find out. Most evaluations in AI report a simple result: what score each model got on which benchmarking task. When you deploy an agent, you're not just choosing a model. You're choosing a full system: what tools the agent can use, how it plans its steps, what it remembers between actions, how it recovers when something goes wrong. Change any of those and the same model can produce very different results at very different costs. How well an AI agent works depends on how it's built, not just the model inside it. Today we're launching the Open Agent Leaderboard, an open benchmark for comparing full agent systems, not just the models inside them. It reports both quality and cost, so you can see not just what works, but what's worth deploying. The leaderboard is paired with the Exgentic framework for running and reproducing evaluations, and a paper describing the full methodology and results. Everything is open from day one. Can we measure generality? AI agents are getting really useful when carefully tailored to a specific job, like coding in a familiar repository or handling customer service with a known set of tools. But the harder question is whether the same agent can handle many different jobs, each with its own tools, rules, and constraints, without being manually customized for each one. A more general agent is one you can drop into a new setting and have it just work. That's what we mean by generality, and it's best understood as a spectrum, not a binary label. Of course, generality that only works in theory isn't useful. What matters is whether an agent stays capable as the range of jobs and settings grows, and whether it does so at a reasonable cost. A system that handles everything but costs a fortune to run isn't general in any way that matters. This leaderboard measures exactly that: how general your agent actually is. It evaluates agents across diverse, unfamiliar settings, each with different tools, rules, and constraints, and reports both quality and cost. So you can see not just how well a system performs, but whether it's worth actually deploying. It doesn't cover every capability a general agent will eventually need. But it's a much stronger test of how well agents work across different situations than anything previously available. And by treating the full agent system, not just the model, as the thing being measured, it makes visible what's actually driving the results. What we built We assembled six benchmarks, each testing a different kind of realistic task. Together they aim to capture a broad range of working settings: coding, customer service, technical support, personal assistance, and research. SWE-Bench Verified -- fixing real bugs in real code repositoriesBrowseComp+ -- researching complex questions across the webAppWorld -- completing personal tasks across hundreds of apps and actionstau2-Bench Airline & Retail -- customer service following company policiestau2-Bench Telecom…

#agents#benchmark

read full article on Hugging Face Blog →

0login to vote