Model catalog
Search the frontier by model, provider, price, context, and evidence.
EvalKit turns public leaderboard rows into model profiles so you can start with the model, then inspect the citations behind every score.
410 models from 410 profiles
Source-scoped metrics. No fake universal score.
#1 · Anthropic
Claude Opus 4.6
Claude Opus 4.6 by Anthropic appears in 3 sources with Reasoning at 59.45. Best read for Code quality, Coding, High intelligence.
#2 · Google
Gemini 3.1 Pro
Gemini 3.1 Pro by Google appears in 1 source with Reasoning at 59.1. Best read for LLM, Long context, Low cost.
#3 · OpenAI
GPT-5.5
GPT-5.5 by OpenAI appears in 3 sources with Reasoning at 62.25. Best read for Coding, High intelligence, LLM.
#4 · Anthropic
Claude Opus 4.7
Claude Opus 4.7 by Anthropic appears in 3 sources with Reasoning at 62.47. Best read for Code quality, Coding, High intelligence.
#5 · Zhipu AI
GLM-5.1
GLM-5.1 by Zhipu AI appears in 2 sources with Reasoning at 54.18. Best read for Coding, High intelligence, Low cost.
#6 · OpenAI
GPT-5.4
GPT-5.4 by OpenAI appears in 2 sources with Reasoning at 57.57. Best read for Coding, High intelligence, LLM.
#7 · Google
Gemini 3 Flash
Gemini 3 Flash by Google appears in 2 sources with Reasoning at 49.35. Best read for Coding, High intelligence, LLM.
#8 · Anthropic
Claude Sonnet 4.6
Claude Sonnet 4.6 by Anthropic appears in 2 sources with Reasoning at 52.16. Best read for Code quality, Coding, High intelligence.
#9 · Anthropic
Claude Opus 4.5
Claude Opus 4.5 by Anthropic appears in 2 sources with Reasoning at 54.19. Best read for Code quality, Coding, LLM.
#10 · Zhipu AI
GLM-5
GLM-5 by Zhipu AI appears in 2 sources with Reasoning at 51.47. Best read for Code quality, Coding, High intelligence.
#11 · Google
Gemini 3 Pro
Gemini 3 Pro by Google appears in 3 sources with Reasoning at 49.76. Best read for Code quality, Coding, High intelligence.
#12 · Moonshot AI
Kimi K2.6
Kimi K2.6 by Moonshot AI appears in 2 sources with Reasoning at 58.13. Best read for Code quality, Coding, High intelligence.
#13 · OpenAI
GPT-5.2
GPT-5.2 by OpenAI appears in 2 sources with Reasoning at 53.54. Best read for Code quality, Coding, LLM.
#14 · Google
Gemini 3.5 Flash
Gemini 3.5 Flash by Google appears in 1 source with Reasoning at 59.19. Best read for LLM, Long context, Low cost.
#15 · Moonshot AI
Kimi K2.5
Kimi K2.5 by Moonshot AI appears in 1 source with Reasoning at 49.83. Best read for Code quality, Multimodal, Open weights.
#16 · Anthropic
Claude Sonnet 4.5
Claude Sonnet 4.5 by Anthropic appears in 2 sources with Reasoning at 40.2. Best read for Code quality, Coding, LLM.
#17 · OpenAI
GPT-5.4 mini
GPT-5.4 mini by OpenAI appears in 1 source with Reasoning at 45.79. Best read for LLM, Low cost, Multimodal.
#18 · OpenAI
GPT-5.3 Codex
GPT-5.3 Codex by OpenAI appears in 1 source with Reasoning at 56.17. Best read for LLM, Low cost, Multimodal.
#19 · OpenAI
GPT-5.5 Instant
GPT-5.5 Instant by OpenAI appears in 2 sources with Reasoning at 42.54. Best read for Coding, High intelligence, LLM.
#20 · DeepSeek
DeepSeek-V4-Pro-Max
DeepSeek-V4-Pro-Max by DeepSeek appears in 1 source with Reasoning at 56.96. Best read for Code quality, Long context, Low cost.
#21 · OpenAI
GPT-5 High
GPT-5 High by OpenAI appears in 1 source with Reasoning at 47.84. Best read for LLM, Math, Multimodal.
#22 · Alibaba Cloud / Qwen Team
Qwen3.5-397B-A17B
Qwen3.5-397B-A17B by Alibaba Cloud / Qwen Team appears in 2 sources with Reasoning at 48.93. Best read for Code quality, Coding, High intelligence.
#23 · Google
Gemma 4 26B-A4B
Gemma 4 26B-A4B by Google appears in 2 sources with Reasoning at 35.5. Best read for Coding, High intelligence, Low cost.
#24 · OpenAI
GPT-5.2 Codex
GPT-5.2 Codex by OpenAI appears in 1 source with Reasoning at 52.1. Best read for LLM, Low cost, Multimodal.
#25 · Alibaba Cloud / Qwen Team
Qwen3.7 Max
Qwen3.7 Max by Alibaba Cloud / Qwen Team appears in 1 source with Reasoning at 60.25. Best read for Code quality, LLM, Long context.
#26 · Alibaba Cloud / Qwen Team
Qwen3.6 Plus
Qwen3.6 Plus by Alibaba Cloud / Qwen Team appears in 2 sources with Reasoning at 52.14. Best read for Coding, High intelligence, LLM.
#27 · OpenAI
GPT-5.1
GPT-5.1 by OpenAI appears in 2 sources with Reasoning at 47.52. Best read for Code quality, Coding, High intelligence.
#28 · Anthropic
Claude Opus 4.1
Claude Opus 4.1 by Anthropic appears in 1 source with Reasoning at 38.24. Best read for Code quality, LLM, Multimodal.
#29 · OpenAI
GPT-5.1 Medium
GPT-5.1 Medium by OpenAI appears in 1 source with Reasoning at 45.89. Best read for LLM, Low cost, Math.
#30 · xAI
Grok-4.20 Beta Non-Reasoning
Grok-4.20 Beta Non-Reasoning by xAI appears in 1 source with LLM Stats Code Index (estimated from arena) at 34.59 score. Best read for LLM, Multimodal.
#31 · MiniMax
MiniMax M2.7
MiniMax M2.7 by MiniMax appears in 1 source with Reasoning at 53.06. Best read for Low cost, Open weights, Reasoning.
#32 · Zhipu AI
GLM-4.6
GLM-4.6 by Zhipu AI appears in 2 sources with Reasoning at 37.73. Best read for Code quality, Coding, High intelligence.
#33 · OpenAI
GPT-5.1 High
GPT-5.1 High by OpenAI appears in 2 sources with Reasoning at 53.33. Best read for Coding, High intelligence, LLM.
#34 · OpenAI
GPT-5 mini
GPT-5 mini by OpenAI appears in 1 source with Reasoning at 36.61. Best read for LLM, Low cost, Math.
#35 · Google
Gemma 4 31B
Gemma 4 31B by Google appears in 2 sources with Reasoning at 44.8. Best read for Coding, High intelligence, Low cost.
#36 · DeepSeek
DeepSeek-V4-Flash-Max
DeepSeek-V4-Flash-Max by DeepSeek appears in 1 source with Reasoning at 51.86. Best read for Code quality, Long context, Low cost.
#37 · xAI
Grok-3
Grok-3 by xAI appears in 1 source with Reasoning at 39.69. Best read for LLM, Low cost, Math.
#38 · OpenAI
GPT-5 Medium
GPT-5 Medium by OpenAI appears in 1 source with Reasoning at 43.51. Best read for LLM, Math, Multimodal.
#39 · OpenAI
GPT-5.1 Thinking
GPT-5.1 Thinking by OpenAI appears in 1 source with Reasoning at 47.47. Best read for Code quality, LLM, Multimodal.
#40 · Zhipu AI
GLM-4.7
GLM-4.7 by Zhipu AI appears in 2 sources with Reasoning at 43.81. Best read for Code quality, Coding, High intelligence.
#41 · Alibaba Cloud / Qwen Team
Qwen3.6-27B
Qwen3.6-27B by Alibaba Cloud / Qwen Team appears in 1 source with Reasoning at 45.74. Best read for Code quality, Low cost, Multimodal.
#42 · MiniMax
MiniMax M2.5
MiniMax M2.5 by MiniMax appears in 1 source with Reasoning at 52.51. Best read for Code quality, Long context, Low cost.
#43 · OpenAI
GPT-4.1 mini
GPT-4.1 mini by OpenAI appears in 1 source with Reasoning at 15.71. Best read for LLM, Long context, Low cost.
#44 · Meituan
LongCat-Flash-Chat
LongCat-Flash-Chat by Meituan appears in 2 sources with Reasoning at 23.41. Best read for Code quality, Coding, High intelligence.
#45 · Moonshot AI
Kimi K2 0905
Kimi K2 0905 by Moonshot AI appears in 1 source with Reasoning at 25.5. Best read for LLM, Math, Reasoning.
#46 · Google
Gemini 2.5 Pro
Gemini 2.5 Pro by Google appears in 2 sources with Reasoning at 35.05. Best read for Coding, High intelligence, LLM.
#47 · Google
Gemini 3.1 Flash-Lite
Gemini 3.1 Flash-Lite by Google appears in 1 source with Reasoning at 41.75. Best read for LLM, Long context, Low cost.
#48 · DeepSeek
DeepSeek-V3.2 (Non-thinking)
DeepSeek-V3.2 (Non-thinking) by DeepSeek appears in 1 source with LLM Stats Code Index (estimated from arena) at 22.77 score. Best read for Low cost, Open weights.
#49 · MiniMax
MiniMax M2
MiniMax M2 by MiniMax appears in 1 source with Reasoning at 33.6. Best read for Code quality, Long context, Low cost.
#50 · Anthropic
Claude Haiku 4.5
Claude Haiku 4.5 by Anthropic appears in 1 source with Reasoning at 35.4. Best read for Code quality, LLM, Low cost.
#51 · Moonshot AI
Kimi K2-Thinking-0905
Kimi K2-Thinking-0905 by Moonshot AI appears in 1 source with Reasoning at 45.23. Best read for Code quality, Math, Open weights.
#52 · OpenAI
GPT-5.1 Instant
GPT-5.1 Instant by OpenAI appears in 1 source with Reasoning at 48.57. Best read for Code quality, LLM, Low cost.
#53 · Anthropic
Claude Opus 4
Claude Opus 4 by Anthropic appears in 1 source with Reasoning at 35.16. Best read for Code quality, LLM, Multimodal.
#54 · xAI
Grok 4 Fast
Grok 4 Fast by xAI appears in 1 source with Reasoning at 40.93. Best read for LLM, Long context, Low cost.
#55 · Google
Gemini 2.5 Flash
Gemini 2.5 Flash by Google appears in 2 sources with Reasoning at 28.49. Best read for Coding, High intelligence, LLM.
#56 · OpenAI
GPT-5
GPT-5 by OpenAI appears in 1 source with Reasoning at 44.66. Best read for Code quality, LLM, Multimodal.
#57 · Anthropic
Claude Sonnet 4
Claude Sonnet 4 by Anthropic appears in 1 source with Reasoning at 30.27. Best read for Code quality, LLM, Multimodal.
#58 · Mistral AI
Mistral Large 3 (675B Instruct 2512)
Mistral Large 3 (675B Instruct 2512) by Mistral AI appears in 1 source with Reasoning at 10.23. Best read for Low cost, Math, Multimodal.
#59 · MiniMax
MiniMax M2.1
MiniMax M2.1 by MiniMax appears in 1 source with Reasoning at 41.25. Best read for Code quality, Long context, Low cost.
#60 · xAI
Grok 4.3
Grok 4.3 by xAI appears in 2 sources with LLM Stats Code Index (estimated from arena) at 25.31 score. Best read for Coding, High intelligence, LLM.
#61 · OpenAI
GPT-4.1
GPT-4.1 by OpenAI appears in 1 source with Reasoning at 21.09. Best read for LLM, Long context, Low cost.
#62 · DeepSeek
DeepSeek-V3.2-Speciale
DeepSeek-V3.2-Speciale by DeepSeek appears in 1 source with Reasoning at 43.22. Best read for Code quality, Math, Open weights.
#63 · Anthropic
Claude Opus 4.8
Claude Opus 4.8 by Anthropic appears in 1 source with Reasoning at 65.69. Best read for LLM, Long context, Low cost.
#64 · Xiaomi
MiMo-V2-Flash
MiMo-V2-Flash by Xiaomi appears in 1 source with Reasoning at 38.43. Best read for Code quality, Math, Open weights.
#65 · xAI
Grok-4.20 Beta Reasoning
Grok-4.20 Beta Reasoning by xAI appears in 1 source with LLM Stats Code Index (estimated from arena) at 20.12 score. Best read for LLM, Multimodal.
#66 · Meituan
LongCat-Flash-Thinking
LongCat-Flash-Thinking by Meituan appears in 1 source with Reasoning at 35.89. Best read for Code quality, Math, Open weights.
#67 · OpenAI
GPT-5.4 nano
GPT-5.4 nano by OpenAI appears in 1 source with Reasoning at 39.89. Best read for LLM, Low cost, Multimodal.
#68 · Alibaba Cloud / Qwen Team
Qwen3.5-27B
Qwen3.5-27B by Alibaba Cloud / Qwen Team appears in 1 source with Reasoning at 41.95. Best read for Code quality, Low cost, Multimodal.
#69 · Alibaba Cloud / Qwen Team
Qwen3 Max
Qwen3 Max by Alibaba Cloud / Qwen Team appears in 1 source with Reasoning at 28.55. Best read for Code quality, LLM, Math.
#70 · Zhipu AI
GLM-4.7-Flash
GLM-4.7-Flash by Zhipu AI appears in 1 source with Reasoning at 31.38. Best read for Code quality, Math, Open weights.
#71 · Meituan
LongCat-Flash-Lite
LongCat-Flash-Lite by Meituan appears in 1 source with Reasoning at 23. Best read for Code quality, Low cost, Open weights.
#72 · DeepSeek
DeepSeek-V3.2-Exp
DeepSeek-V3.2-Exp by DeepSeek appears in 2 sources with Reasoning at 35.69. Best read for Code quality, Coding, High intelligence.
#73 · Alibaba Cloud / Qwen Team
Qwen3.5-122B-A10B
Qwen3.5-122B-A10B by Alibaba Cloud / Qwen Team appears in 2 sources with Reasoning at 43.05. Best read for Code quality, Coding, High intelligence.
#74 · Zhipu AI
GLM-4.5
GLM-4.5 by Zhipu AI appears in 2 sources with Reasoning at 33.95. Best read for Code quality, Coding, High intelligence.
#75 · OpenAI
GPT-5.1 Codex
GPT-5.1 Codex by OpenAI appears in 1 source with Reasoning at 42.49. Best read for Code quality, LLM, Multimodal.
#76 · StepFun
Step-3.5-Flash
Step-3.5-Flash by StepFun appears in 1 source with Reasoning at 49.2. Best read for Code quality, Low cost, Open weights.
#77 · OpenAI
GPT-5.3 Chat
GPT-5.3 Chat by OpenAI appears in 1 source with LLM Stats Code Index (estimated from arena) at 27.85 score. Best read for LLM, Low cost, Multimodal.
#78 · OpenAI
GPT-5.1 Codex High
GPT-5.1 Codex High by OpenAI appears in 1 source with Reasoning at 44.32. Best read for LLM, Math, Multimodal.
#79 · OpenAI
GPT OSS 120B High
GPT OSS 120B High by OpenAI appears in 1 source with Reasoning at 31.76. Best read for Low cost, Math, Open weights.
#80 · Alibaba Cloud / Qwen Team
Qwen3 VL 235B A22B Instruct
Qwen3 VL 235B A22B Instruct by Alibaba Cloud / Qwen Team appears in 2 sources with Reasoning at 26.03. Best read for Coding, High intelligence, Low cost.
#81 · xAI
Grok-4 Fast Reasoning
Grok-4 Fast Reasoning by xAI appears in 1 source with LLM Stats Code Index (estimated from arena) at 26.07 score. Best read for LLM, Long context, Low cost.
#82 · xAI
Grok-4 Fast Non-Reasoning
Grok-4 Fast Non-Reasoning by xAI appears in 1 source with LLM Stats Code Index (estimated from arena) at 23.23 score. Best read for LLM, Long context, Low cost.
#83 · Alibaba Cloud / Qwen Team
Qwen3 VL 4B Thinking
Qwen3 VL 4B Thinking by Alibaba Cloud / Qwen Team appears in 1 source with Reasoning at 15.02. Best read for Low cost, Multimodal, Open weights.
#84 · xAI
Grok-4.1 Fast Non-Reasoning
Grok-4.1 Fast Non-Reasoning by xAI appears in 1 source with LLM Stats Code Index (estimated from arena) at 23.08 score. Best read for LLM, Long context, Low cost.
#85 · OpenAI
GPT-5 nano
GPT-5 nano by OpenAI appears in 1 source with Reasoning at 24.61. Best read for LLM, Math, Multimodal.
#86 · xAI
Grok-4.20 Multi-Agent Beta
Grok-4.20 Multi-Agent Beta by xAI appears in 1 source with LLM Stats Code Index (estimated from arena) at 21.9 score. Best read for LLM, Multimodal.
#87 · Anthropic
Claude 3.7 Sonnet
Claude 3.7 Sonnet by Anthropic appears in 1 source with Reasoning at 28.92. Best read for Code quality, LLM, Multimodal.
#88 · xAI
Grok-4.1 Fast Reasoning
Grok-4.1 Fast Reasoning by xAI appears in 1 source with LLM Stats Code Index (estimated from arena) at 19.21 score. Best read for LLM, Long context, Low cost.
#89 · xAI
Grok Code Fast 1
Grok Code Fast 1 by xAI appears in 1 source with Reasoning at 31.93. Best read for Code quality, LLM, Low cost.
#90 · Alibaba Cloud / Qwen Team
Qwen3-Coder
Qwen3-Coder by Alibaba Cloud / Qwen Team appears in 1 source with LLM Stats Code Index (estimated from arena) at 17.3 score. Best read for Open weights.
#91 · Mistral AI
Mistral Small 4
Mistral Small 4 by Mistral AI appears in 1 source with Reasoning at 23.7. Best read for Low cost, Math, Multimodal.
#92 · DeepSeek
DeepSeek-V3.2
DeepSeek-V3.2 by DeepSeek appears in 2 sources with Reasoning at 41.8. Best read for Code quality, Coding, High intelligence.
#93 · OpenAI
GPT-4.1 nano
GPT-4.1 nano by OpenAI appears in 1 source with Reasoning at 0.73. Best read for LLM, Long context, Low cost.
#94 · Alibaba Cloud / Qwen Team
Qwen3 VL 235B A22B Thinking
Qwen3 VL 235B A22B Thinking by Alibaba Cloud / Qwen Team appears in 1 source with Reasoning at 31.71. Best read for Math, Multimodal, Open weights.
#95 · DeepSeek
DeepSeek-V3 0324
DeepSeek-V3 0324 by DeepSeek appears in 1 source with Reasoning at 16.18. Best read for Low cost, Math, Open weights.
#96 · OpenAI
GPT OSS 20B
GPT OSS 20B by OpenAI appears in 2 sources with Reasoning at 19.77. Best read for Math, Open weights, RAG.