Benchmark guide
Know what each score is actually testing before you choose a model.
GPQA, AIME, SWE-bench, Code Arena, MMMU, Toolathlon, MCP Atlas, and long-context metrics all answer different product questions.
GPQA Diamond
Graduate-level science questions used to separate frontier reasoning models.
Why it mattersUseful when choosing models for hard expert QA, scientific reasoning, and chained analysis.
GPQAReasoning indexHumanity Last Exam
Source: LLM Stats · retrieved 2026-05-20AIME 2025
Competition-style math benchmark for exact multi-step problem solving.
Why it mattersStrong signal for symbolic reasoning, contest math, and strict answer checking.
AIME 2025Math indexFrontierMath
Source: LLM Stats · retrieved 2026-05-20SWE-bench Verified
Software engineering benchmark focused on real issue resolution and code changes.
Why it mattersBest paired with code arena and security metrics before choosing an agent model.
SWE-bench VerifiedSWE-bench ProCode Arena
Source: LLM Stats · retrieved 2026-05-20Code Arena
Preference-style coding leaderboard comparing model outputs in developer tasks.
Why it mattersAdds practical taste and usefulness signals beyond static coding test scores.
Code ArenaCoding indexTerminal Bench
Source: LLM Stats · retrieved 2026-05-20MMMU / MMMU-Pro
Multimodal academic and visual reasoning benchmark family.
Why it mattersHelpful for picking models that must reason over diagrams, charts, images, and text together.
MMMUMMMU-ProVision index
Source: LLM Stats · retrieved 2026-05-20MCP Atlas
Model Context Protocol style benchmark for tool-rich agent workflows.
Why it mattersA useful companion metric for teams evaluating practical MCP and agent orchestration behavior.
MCP AtlasApex AgentsOSWorld
Source: LLM Stats · retrieved 2026-05-20MRCR / Long Context
Long-document retention and retrieval-style benchmark signals.
Why it mattersCritical for legal, research, codebase, and knowledge-base workflows where context windows can mislead.
MRCR v2Context windowLong context index
Source: LLM Stats · retrieved 2026-05-20Artificial Analysis Intelligence Index
Public composite intelligence index from Artificial Analysis.
Why it mattersWorks as a second-source view when comparing model rankings against LLM Stats or arena data.
Intelligence IndexPriceSpeed
Source: Artificial Analysis · retrieved 2026-05-20Arena ratings
Human or preference-style arena comparisons across model families and modalities.
Why it mattersUseful to balance benchmark scores with perceived answer quality and preference wins.
Arena ratingCategory rankSource overlap
Source: Arena AI · retrieved 2026-05-20