Agent leaderboard

Planning, tool use, completion—sample data; optional domain-specific boards.

Agent quality is scenario-dependent (browser automation, code repositories, enterprise tools). Production data should split by scenario or document primary-scenario weights.

Updated:

Public ranking policy: rows are sorted by composite score (desc). Composite score is a weighted sum of normalized sub-metrics; ties are broken by higher recent activity.

RankAgentPlatform / teamPrimary use caseScoreNotes
1 Codex-Planner Demo Lab R&D automation 93.1 Multi-step commits and rollback
2 Sage-Research Sage Literature and retrieval 91.7 Traceable citations
3 Relay-Support Relay Support and tickets 90.4 Knowledge base integration
4 Harbor-Ops Harbor Ops and troubleshooting 89.2 Logs/metrics toolchain
5 Atlas-Browse Atlas Browser automation 88 Robust web actions
6 Mosaic-Data Mosaic Data analysis 86.8 SQL/Notebook
7 Nimbus-Meeting Nimbus Meetings and notes 85.5 Multilingual notes
8 Volt-Security Volt Security scanning 84.1 Policy compliance checks