Open Agent Leaderboard

Results

What the Data Shows

Three findings that challenge conventional assumptions about AI agents.

General agents adapt without manual tuning

Software engineering, customer service, web research, and everyday digital tasks, all without per-domain manual customization or training.

The model matters most, but so does the agent

Model choice drives 28% of task variance, but agent choice still swings results by up to 11 percentage points.

General agents already rival specialists

On half the tested benchmarks, general agents match or beat the top published scores from domain-specific systems, with none of the per-domain engineering.

Read the Full Paper (arXiv) →

FAQ

Frequently Asked Questions

Common questions about Exgentic's evaluation methodology and results.

Many agent or model whitepapers rely on significant prompt optimization to maximize performance on specific benchmarks. In Exgentic, we intentionally avoid prompt optimization to provide a more neutral and comparable evaluation across agents.

In addition, we report results on 100 sampled tasks per benchmark. For some benchmarks this is a subset of the full dataset, but we believe this is not the primary factor driving differences in results.

No. We do not modify the benchmarks or the agent implementations. In some cases, we needed to slightly adapt the interface to fit the unified protocol—for example: Externalizing prompts that are embedded inside the benchmark (e.g., TAU-Bench). Adding task instructions when required by the benchmark specification (e.g., SWE-Bench Verified). These adjustments do not change the task itself.

We selected agents that represent commonly used general-agent architectures and that can operate across multiple benchmarks with minimal task-specific customization. Our goal is to evaluate general-purpose agents, not agents tailored to a single benchmark. We will add more agents, and we welcome contributions from the community.

Since Exgentic focuses on evaluating generality, we chose benchmarks that cover a diverse range of domains and task types, including coding, tool use, reasoning, and interactive environments.

We started with a small set of widely used models to establish the initial leaderboard. We plan to expand the leaderboard soon with many open-weight models as well as additional closed models.

Yes. Exgentic is fully open. The evaluation pipeline, agent implementations, and configuration are available in the repository so results can be reproduced. Minor variations may occur due to model version changes or nondeterminism in LLM outputs.

Get Involved

Open Source. Open Data. Open Leaderboard.

Everything is open. Jump in.

Run Evaluations

Clone the repo. Evaluate your agent against the field. Results in hours, not weeks.

Get the Code

Submit Your Agent

Put your agent on the board. See exactly where it stands against the field.

Submit

Add a Benchmark

Plug your evaluation environment into the Unified Protocol. Instant cross-agent testing.

Integrate

Insights

Blog

Research updates, benchmarks, and findings from the Exgentic team.

Mar 17, 2026 · 12 min read

Rethinking Agent Evaluation Reporting

Why pass/fail scores hide what matters most about agent systems.

Mar 15, 2026 · 9 min read

The Open Agent Leaderboard

How good are general purpose AI agents? We built an open evaluation framework to find out.

Open Agent
Leaderboard

The Agent Leaderboard

Leaderboard

What Does Performance Cost?

What the Data Shows

General agents adapt without manual tuning

The model matters most, but so does the agent

General agents already rival specialists

The Evaluation Graph

Frequently Asked Questions

Open Source. Open Data. Open Leaderboard.

Run Evaluations

Submit Your Agent

Add a Benchmark

Blog

Rethinking Agent Evaluation Reporting

The Open Agent Leaderboard

Open AgentLeaderboard

The Agent Leaderboard

Leaderboard

What Does Performance Cost?

What the Data Shows

General agents adapt without manual tuning

The model matters most, but so does the agent

General agents already rival specialists

The Evaluation Graph

Frequently Asked Questions

Open Source. Open Data. Open Leaderboard.

Run Evaluations

Submit Your Agent

Add a Benchmark

Blog

Rethinking Agent Evaluation Reporting

The Open Agent Leaderboard

Open Agent
Leaderboard