Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows. VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints. As can be seen below, models perform poorly on VAKRA - in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks. Task Description As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities. Fig 1: Representative examples of each capability in the VAKRA benchmark Capability 1: API Chaining using Business Intelligence APIs This capability includes 2,077 test instances across 54 domains, requiring the use of tools from the SLOT-BIRD and SEL-BIRD collections (Elder et al., 2026). Compared to the setup in Elder et al., the tool universe in SLOT-BIRD and SEL-BIRD is expanded through the inclusion of a larger number of domains. Each domain is restricted to one tool collection, and tasks involve chaining 1–12 tool calls to arrive at the final answer. { "query": "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?", "tool_calls":[ { "name": "get_data", "arguments":{"tool_universe_id="486ea46224d1-aeb8037c5e78"}, "label": "retrieved_data_1" }, { "name": "select_data_equal_to", "arguments":{"data_label":"retrieved_data_1","key_name":"play_speed","value":31}, "label": "FILTERED_DF_0" }, { "name": "select_data_equal_to", "arguments":{"data_label":"FILTERED_DF_0","key_name":"play_dribble","value":53}, "label": "FILTERED_DF_1" }, { "name": "select_data_equal_to", "arguments":{"data_label":"FILTERED_DF_1","key_name":"play_passing","value":32}, "label": "FILTERED_DF_2" }, {"name":{get_team_name},"arguments":{"data_label":"FILTERED_DF_2","n":1}}}], "answer": "FC Barcelona" } Fig 2: Data sample from SEL-BIRD collection As shown above, each instance has an associated JSON data source from which the answer must be derived. The MCP servers supporting this task include a special tool, called get_data(tool_universe_id=id) , which must be called at the beginning of each instance. This tool initializes the data source, returns a lightweight preview of the data (see below Figure 3), and stores the full dataset server-side to avoid large data transfers. This prevents the inefficient transfer of large data over the MCP protocol. The call also configures the MCP server to expose the appropriate tool set based on the tool_universe_id and aligns the data source with the domain-specific database for the instance. The SLOT-BIRD collection provides a global set of 7 tools for generic data manipulation (e.g., filtering, sorting), inspired by systems like Tableau and Google Analytics. The SEL-BIRD collection extends this by introducing more specialized tools: some are shared with SLOT-BIRD, while others are derived by flattening categorical arguments into separate functions (e.g., sort_data with argument ascending: bool = False becomes sort_data_ascending and sort_data_descending ). Additionally,…

