How good are general purpose AI agents?
We test them across diverse domains.
No hand-tuning. No shortcuts. Fairly Ranked.
Head-to-head performance across diverse benchmarks.
Which agents deliver the most for the least?
Three findings that challenge conventional assumptions about AI agents.
Software engineering, customer service, web research, and everyday digital tasks, all without per-domain manual customization or training.
Model choice drives 28% of task variance, but agent choice still swings results by up to 11 percentage points.
On half the tested benchmarks, general agents match or beat the top published scores from domain-specific systems, with none of the per-domain engineering.
Common questions about Exgentic's evaluation methodology and results.
Many agent or model whitepapers rely on significant prompt optimization to maximize performance on specific benchmarks. In Exgentic, we intentionally avoid prompt optimization to provide a more neutral and comparable evaluation across agents.
In addition, we report results on 100 sampled tasks per benchmark. For some benchmarks this is a subset of the full dataset, but we believe this is not the primary factor driving differences in results.
No. We do not modify the benchmarks or the agent implementations. In some cases, we needed to slightly adapt the interface to fit the unified protocol—for example: Externalizing prompts that are embedded inside the benchmark (e.g., TAU-Bench). Adding task instructions when required by the benchmark specification (e.g., SWE-Bench Verified). These adjustments do not change the task itself.
We selected agents that represent commonly used general-agent architectures and that can operate across multiple benchmarks with minimal task-specific customization. Our goal is to evaluate general-purpose agents, not agents tailored to a single benchmark. We will add more agents, and we welcome contributions from the community.
Since Exgentic focuses on evaluating generality, we chose benchmarks that cover a diverse range of domains and task types, including coding, tool use, reasoning, and interactive environments.
We started with a small set of widely used models to establish the initial leaderboard. We plan to expand the leaderboard soon with many open-weight models as well as additional closed models.
Yes. Exgentic is fully open. The evaluation pipeline, agent implementations, and configuration are available in the repository so results can be reproduced. Minor variations may occur due to model version changes or nondeterminism in LLM outputs.
Everything is open. Jump in.
Clone the repo. Evaluate your agent against the field. Results in hours, not weeks.
Get the CodePlug your evaluation environment into the Unified Protocol. Instant cross-agent testing.
IntegrateResearch updates, benchmarks, and findings from the Exgentic team.
Why pass/fail scores hide what matters most about agent systems.
Read more →How good are general purpose AI agents? We built an open evaluation framework to find out.
Read more →