Planning, tool use, completion—sample data; optional domain-specific boards.
Agent quality is scenario-dependent (browser automation, code repositories, enterprise tools). Production data should split by scenario or document primary-scenario weights.
Public ranking policy: rows are sorted by composite score (desc). Composite score is a weighted sum of normalized sub-metrics; ties are broken by higher recent activity.
| Rank | Agent | Platform / team | Primary use case | Score | Notes |
|---|---|---|---|---|---|
| 1 | Codex-Planner | Demo Lab | R&D automation | 93.1 | Multi-step commits and rollback |
| 2 | Sage-Research | Sage | Literature and retrieval | 91.7 | Traceable citations |
| 3 | Relay-Support | Relay | Support and tickets | 90.4 | Knowledge base integration |
| 4 | Harbor-Ops | Harbor | Ops and troubleshooting | 89.2 | Logs/metrics toolchain |
| 5 | Atlas-Browse | Atlas | Browser automation | 88 | Robust web actions |
| 6 | Mosaic-Data | Mosaic | Data analysis | 86.8 | SQL/Notebook |
| 7 | Nimbus-Meeting | Nimbus | Meetings and notes | 85.5 | Multilingual notes |
| 8 | Volt-Security | Volt | Security scanning | 84.1 | Policy compliance checks |