EvalKit score
66.9Public snapshot workspace
LLM Leaderboard
Compare LLMs on reasoning, coding, speed, context, and price. Every row is citation-backed — sourced from public benchmarks and traceable to its origin.
How EvalKit scores work
The EvalKit Score is a composite of reasoning (40%), coding (35%), and agent capability (15%), blended across public benchmark data. Rows without a direct public metric show an estimate — marked with ~ — derived from provider or model-family medians. Source citations accompany every row.
Top models
By composite EvalKit scoreEvalKit score
65.3EvalKit score
64.1Current leaders
Updated weekly| Reasoning | Coding | Agent | Code arena | Context | Speed | Pricing $/M | License | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| #1 | GPT-5.5OpenAICreated by OpenAI | 64.1 | 62.3 | 51 | 75.3 | 1,948 | 1.1M | 175/s | $5 | ||
| #2 | GPT-5.5 ProOpenAICreated by OpenAI | 50.4 | 61.6 | ~26.1 | ~44.5 | ~1,613 | ~400K | ~230/s | ~$1.25 | ||
| #3 | GPT-5.5 InstantOpenAICreated by OpenAI | 48.8 | 42.5 | 43.4 | ~43.7 | 1,361 | 400K | 227/s | $5 | ||
| #4 | GPT-5.5 (xhigh)OpenAICreated by OpenAI | 39.7 | ~38.6 | ~26.1 | ~34.1 | ~1,613 | ~400K | ~230/s | ~$1.25 | ||
| #5 | GPT-5.5 (high)OpenAICreated by OpenAI | 39.7 | ~38.6 | ~26.1 | ~34.1 | ~1,613 | ~400K | ~230/s | ~$1.25 | ||
| #6 | GPT-5.5 Thinking xHigh EffortOpenAICreated by OpenAI | 39.7 | ~38.6 | ~26.1 | ~34.1 | ~1,613 | ~400K | ~230/s | ~$1.25 | ||
| #7 | gpt-5.5-highOpenAICreated by OpenAI | 39.7 | ~38.6 | ~26.1 | ~34.1 | 1,469 | ~400K | ~230/s | ~$1.25 | ||
| #8 | gpt-5.5OpenAICreated by OpenAI | 39.7 | ~38.6 | ~26.1 | ~34.1 | 1,463 | ~400K | ~230/s | ~$1.25 | ||
| #9 | gpt-5.5-instantOpenAICreated by OpenAI | 39.7 | ~38.6 | ~26.1 | ~34.1 | 1,418 | ~400K | ~230/s | ~$1.25 | ||
| #10 | GPT-5.5OpenAICreated by OpenAI | 39.7 | ~38.6 | ~26.1 | ~34.1 | ~1,613 | ~400K | ~230/s | ~$1.25 | ||
| #11 | GPT-5.5 ProOpenAICreated by OpenAI | 39.7 | ~38.6 | ~26.1 | ~34.1 | ~1,613 | ~400K | ~230/s | ~$1.25 |
GPT-5.5
OpenAI · Created by OpenAI · LLM
GPT-5.5 Pro
OpenAI · Created by OpenAI · LLM
GPT-5.5 Instant
OpenAI · Created by OpenAI · LLM
GPT-5.5 (xhigh)
OpenAI · Created by OpenAI · Coding
GPT-5.5 (high)
OpenAI · Created by OpenAI · Coding
GPT-5.5 Thinking xHigh Effort
OpenAI · Created by OpenAI · Coding
gpt-5.5-high
OpenAI · Created by OpenAI · Coding
gpt-5.5
OpenAI · Created by OpenAI · Coding
gpt-5.5-instant
OpenAI · Created by OpenAI · Coding
GPT-5.5
OpenAI · Created by OpenAI · RAG
GPT-5.5 Pro
OpenAI · Created by OpenAI · RAG
Scoring methodology
EvalKit Score = 40% reasoning + 35% coding + 15% agent, normalized to a 0–100 scale. Public benchmark data from 8 independent sources is preferred; estimates fill missing cells and are always labeled.
Source transparency
Every row links back to its original benchmark source. Replicated rows keep the exact source URL. No score is marked as EvalKit-verified unless run evidence exists. Read citation policy →
Data freshness
The leaderboard refreshes weekly. The snapshot date for each row is shown in the detail panel. Use the source filter to view data from a specific benchmark provider.
Replicated rows keep their source links. Read citation policy →