Methodology

This page defines production scoring standards; published scores must remain consistent with the same documentation.

Public ranking algorithms

All site lists are rankable and auditable. The formulas below are publicly documented in both UI and source code.

Leaderboards (Model / Agent / LLM / Toolchain)

Sort order: composite score descending.
Composite score: normalize each sub-metric with min-max, then compute a weighted sum. Ties are broken by higher recent activity (e.g., commits in the last 30 days).
Trend groups (GitHub)

Within-group formula: Score = 100 × [0.30·Stars + 0.15·Forks + 0.30·Commits30d + 0.15·Contributors + 0.05·(1-Issues) + 0.05·(1-PRs)].
All terms are min-max normalized within the same group; Issues/PR are inverse signals (lower is better).

Layers & dimensions

Four boards map to Models, Agents, LLMs, and Toolchains; columns may extend independently (vendor, domain, size, coverage, etc.).

Composite scores use configurable weights and normalization; multi-benchmark setups must declare benchmark versions, weights, and missing-value handling.

Updates & release

Static builds: commit JSON or fetched artifacts, then run SSG.

Scheduled jobs: GitHub Actions may run evaluators or aggregators, write artifacts, and trigger builds; read-only edge storage must be weighed against static-first goals.

Verifiability

The Sources page should list primary sources, fetch times, and versions; caching or sampling must be disclosed. Readers may cross-check Methodology against Sources.

Bias & mitigations

Common issues include benchmark leakage, overfitting to public eval sets, vendor-reported scores vs. independent reproduction, and composite scores masking weak tasks. Mitigations: pin task versions, publish seeds and scripts, disclose per-task scores, and review third-party leaderboard changelogs.

GitHub momentum is gameable—cross-check stars with commits, issues/PRs, and release cadence.