Public snapshot workspace

LLM Leaderboard

Compare LLMs on reasoning, coding, speed, context, and price. Every row is citation-backed — sourced from public benchmarks and traceable to its origin.

How EvalKit scores work

The EvalKit Score is a composite of reasoning (40%), coding (35%), and agent capability (15%), blended across public benchmark data. Rows without a direct public metric show an estimate — marked with ~ — derived from provider or model-family medians. Source citations accompany every row.

Weekly refresh 8 public sourcesCitation policy

464Tracked rows

8Public sources

462Replicated

414Scored models

Top models

By composite EvalKit score

Claude Opus 4.8Anthropic

EvalKit score

66.9

Claude Mythos PreviewAnthropic

EvalKit score

65.3

GPT-5.5OpenAI

EvalKit score

64.1

Current leaders

Updated weekly

Claude Mythos PreviewBest reasoning

72.51

Gemini 3 FlashBest value

$0.5/M

Grok 4 FastLongest context

2.0M tokens

Claude Mythos PreviewBest coding

57.83

Mistral Small 4Fastest

678.15 c/s

Kimi K2.6Best open-weight

58.13

			Reasoning	Coding	Agent	Code arena	Context	Speed	Pricing $/M	License
#1	GPT-5.2 CodexOpenAICreated by OpenAI	48.4	52.1	38.9	26	1,232	400K	204/s	$1.75

GPT-5.2 Codex

OpenAI · Created by OpenAI · LLM

Score48.4

Reasoning52.1

Coding38.9

Agent26

Replicated from public sourceLLM Stats

Scoring methodology

EvalKit Score = 40% reasoning + 35% coding + 15% agent, normalized to a 0–100 scale. Public benchmark data from 8 independent sources is preferred; estimates fill missing cells and are always labeled.

Source transparency

Every row links back to its original benchmark source. Replicated rows keep the exact source URL. No score is marked as EvalKit-verified unless run evidence exists. Read citation policy →

Data freshness

The leaderboard refreshes weekly. The snapshot date for each row is shown in the detail panel. Use the source filter to view data from a specific benchmark provider.

Replicated rows keep their source links. Read citation policy →

Leaderboard | EvalKit