Agent leaderboard

Planning, tool use, completion—sample data; optional domain-specific boards.

Agent quality is scenario-dependent (browser automation, code repositories, enterprise tools). Production data should split by scenario or document primary-scenario weights.

Updated: 2026-04-03

Public ranking policy: rows are sorted by composite score (desc). Composite score is a weighted sum of normalized sub-metrics; ties are broken by higher recent activity.

Rank	Agent	Platform / team	Primary use case	Score	Notes
1	Codex-Planner	Demo Lab	R&D automation	93.1	Multi-step commits and rollback
2	Sage-Research	Sage	Literature and retrieval	91.7	Traceable citations
3	Relay-Support	Relay	Support and tickets	90.4	Knowledge base integration
4	Harbor-Ops	Harbor	Ops and troubleshooting	89.2	Logs/metrics toolchain
5	Atlas-Browse	Atlas	Browser automation	88	Robust web actions
6	Mosaic-Data	Mosaic	Data analysis	86.8	SQL/Notebook
7	Nimbus-Meeting	Nimbus	Meetings and notes	85.5	Multilingual notes
8	Volt-Security	Volt	Security scanning	84.1	Policy compliance checks