Methodology

How to read sample leaderboards today and how they connect to production data—keep this narrative aligned when you ship real scores.

Public ranking algorithms

All site lists are rankable and auditable. The formulas below are publicly documented in both UI and source code.

  • Leaderboards (Model / Agent / LLM / Toolchain)

    Sort order: composite score descending.

    Composite score: normalize each sub-metric with min-max, then compute a weighted sum. Ties are broken by higher recent activity (e.g., commits in the last 30 days).

  • Trend groups (GitHub)

    Within-group formula: Score = 100 × [0.30·Stars + 0.15·Forks + 0.30·Commits30d + 0.15·Contributors + 0.05·(1-Issues) + 0.05·(1-PRs)].

    All terms are min-max normalized within the same group; Issues/PR are inverse signals (lower is better).

Layers & dimensions

Four boards map to Models, Agents, LLMs, and Toolchains; per-board columns can extend independently (vendor, domain, size, coverage, etc.).

Composite scores come from configurable weights and normalization; multi-benchmark setups should declare benchmark versions, weights, and missing-value handling.

Updates & release

Static builds: commit JSON or fetched artifacts, then SSG.

Scheduled jobs: GitHub Actions can run evaluators or aggregators, write artifacts, and trigger builds; read-only D1/KV at the edge is possible but should be weighed against static-first goals.

Verifiability

List primary sources, fetch times, and versions on the Sources page; disclose caching or sampling. Readers can cross-check Methodology ↔ Sources.

Bias & mitigations

Common issues include benchmark leakage, overfitting to public eval sets, vendor-reported scores vs. independent reproduction, and composite scores masking weak tasks. Mitigations: pin task versions, publish seeds and scripts, disclose per-task scores, and review third-party leaderboard changelogs.

GitHub momentum is gameable—cross-check stars with commits, issues/PRs, and release cadence.