EvalKit score
66.9Public snapshot workspace
LLM Leaderboard
Compare LLMs on reasoning, coding, speed, context, and price. Every row is citation-backed — sourced from public benchmarks and traceable to its origin.
How EvalKit scores work
The EvalKit Score is a composite of reasoning (40%), coding (35%), and agent capability (15%), blended across public benchmark data. Rows without a direct public metric show an estimate — marked with ~ — derived from provider or model-family medians. Source citations accompany every row.
Top models
By composite EvalKit scoreEvalKit score
65.3EvalKit score
64.1Current leaders
Updated weekly| Reasoning | Coding | Agent | Code arena | Context | Speed | Pricing $/M | License | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| #1 | GPT-5.2 CodexOpenAICreated by OpenAI | 48.4 | 52.1 | 38.9 | 26 | 1,232 | 400K | 204/s | $1.75 |
GPT-5.2 Codex
OpenAI · Created by OpenAI · LLM
Scoring methodology
EvalKit Score = 40% reasoning + 35% coding + 15% agent, normalized to a 0–100 scale. Public benchmark data from 8 independent sources is preferred; estimates fill missing cells and are always labeled.
Source transparency
Every row links back to its original benchmark source. Replicated rows keep the exact source URL. No score is marked as EvalKit-verified unless run evidence exists. Read citation policy →
Data freshness
The leaderboard refreshes weekly. The snapshot date for each row is shown in the detail panel. Use the source filter to view data from a specific benchmark provider.
Replicated rows keep their source links. Read citation policy →