LLM leaderboard

Large language model capabilities—sample data; align with public benchmarks for replacement.

Composite scores may decompose into reasoning, coding, multilingual, and safety dimensions; Methodology must cite benchmark versions and tool-use policy.

Updated:

Public ranking policy: rows are sorted by composite score (desc). Composite score is a weighted sum of normalized sub-metrics; ties are broken by higher recent activity.

RankModelVendorSizeScoreNotes
1 Nova-Large-2 Nova AI ~400B MoE 95 Reasoning mode
2 Summit-Pro Summit ~200B 93.4 Strong instruction following
3 DeepLine-R1 DeepLine ~70B 91.9 Open weights
4 Cedar-32B Cedar 32B 89.7 Balanced Chinese/English
5 Birch-Mini Birch 8B 87.3 On-device deployment
6 Fjord-1.5 Fjord Labs 14B 86.1 Tool calling
7 Ridge-Code Ridge 33B 85 Code-focused
8 Willow-Base Willow 3B 82.4 Very low latency