Colophon|Last verified May 2026

About benchmarkingagents.com

An independent reference for AI agent and LLM benchmarks. Built by Digital Signet. No vendor affiliation, no paid placements, no newsletter capture.

What this site is

benchmarkingagents.com is an independent reference for the public benchmarks used to evaluate large language models and AI agents in 2026. Coverage spans the 17 most-cited benchmarks across knowledge (MMLU, MMLU-Pro, MMMU, Humanity's Last Exam), coding (HumanEval, MBPP, LiveCodeBench, SWE-bench Verified), reasoning (GPQA, GPQA-Diamond, ARC-AGI, ARC-AGI-2, BIG-Bench Hard), agentic capability (SWE-bench Verified, WebArena, AgentBench, OSWorld, Terminal-Bench, Tau-Bench), multimodal (MMMU, MathVista, ChartQA), and human preference (LMSYS Chatbot Arena, MT-Bench).

It also covers the practitioner stack that production teams actually use to evaluate their own systems: custom golden datasets, LLM-as-judge methodology, RAG evaluation frameworks like Ragas, online production monitoring, and the eight evaluation platforms in widest use (Braintrust, Langfuse, LangSmith, Arize Phoenix, DeepEval, Patronus, Helicone, PromptLayer).

It is not a leaderboard that scrapes scores without context. Every quoted score carries its capture date, the N-shot setting, the chain-of-thought flag, the test set version, and a link back to the primary source. Where a benchmark has saturated or has documented contamination risk, the page says so explicitly.

Why this site exists

The benchmark landscape has three structural problems that most AI coverage ignores or treats as footnotes. First, contamination. MMLU test questions appear verbatim in Common Crawl, the primary web-crawl dataset behind most pre-training corpora. HumanEval problems are near-duplicates of LeetCode solutions that pervade GitHub and Stack Overflow crawls. SWE-bench issues have solutions in the same repositories' public git history. A high score on a contaminated benchmark may reflect memorisation as much as reasoning, and the leaderboard cannot tell you which.

Second, saturation. MMLU saturated in 2024. HumanEval, HellaSwag and WinoGrande saturated in 2023 or 2024. All frontier models now score within statistical noise of each other on these benchmarks. The field moved to MMLU-Pro (10-choice, CoT-required), GPQA-Diamond, ARC-AGI-2 and LiveCodeBench specifically because the previous generation no longer discriminates. Most comparison sites still quote the saturated versions, because the numbers are larger and more recognisable.

Third, self-reporting. Model cards are published by the same company that built the model. Common issues: publishing only benchmarks where the model performs well, omitting evaluation setup (N-shot, CoT, harness, test set version), and reporting scores under maximally favourable conditions that independent replications struggle to reproduce. The benchmark-literate reader treats vendor-published scores as a starting point, not a final answer.

This site exists to apply a consistent six-question rubric to every benchmark score it cites, document the methodology behind every number, and refuse to publish vendor-favourable shortcuts when the underlying signal is noisy.

Who builds this

Editor

Oliver Wakefield-Smith

Founder, Digital Signet

benchmarkingagents.com is one of a small cluster of Digital Signet reference sites covering the cost, capability and evaluation surface of frontier AI. Sister sites focus on per-token model pricing; this one covers what the resulting models can and cannot do, and how to measure it without being captured by vendor framing.

editor@benchmarkingagents.com|LinkedIn|GitHub|digitalsignet.com

Sister sites in the Digital Signet AI-pricing cluster

claudeapipricing.com

Independent reference for Claude API token pricing across models, batch tier, prompt caching.

embeddingcost.com

Multi-provider AI embedding pricing, vector DB storage cost, RAG scenarios.

geminipricing.com

Google Gemini API pricing reference; cross-checks Vertex AI surface.

contextcost.com

Per-million-token cost calculator across model providers; latency and cost trade-offs.

Editorial position

Independent reference. No vendor affiliation. No paid placements on the tool-comparison pages. No sponsored entries in any benchmark or evaluation framework cited. Provider order in tables is determined alphabetically or by category, not by any commercial relationship.

The site does carry a narrow affiliate disclosure on the /tools-compared page, where some outbound links to evaluation tool vendors may route through affiliate URLs. Affiliate status does not influence the order tools appear, the rubric used to compare them, or which tools are recommended for which use case. The neutral editorial framing is enforced regardless of affiliate placement, and the disclosure is shown in the page footer.

No display advertising. No newsletter capture. No content sponsorships. No vendor lead-routing. The site is a reference, not a lead-generation funnel.

What this site covers

Benchmark reference

17 benchmarks indexed across knowledge, coding, reasoning, agentic, multimodal and preference.

MMLU and MMLU-Pro

Why MMLU saturated and why MMLU-Pro is the 2026 replacement.

GPQA and ARC-AGI

Graduate-level reasoning and visual abstraction. Contamination notes for each.

HumanEval and code

Coding benchmarks, saturation status, LiveCodeBench as anti-contamination response.

Agent benchmarks

SWE-bench Verified, WebArena, AgentBench, OSWorld, Terminal-Bench, Tau-Bench compared.

SWE-bench Verified

The canonical coding-agent benchmark, methodology and SOTA progression.

RAG evaluation

Faithfulness, answer relevancy, context precision, context recall. The Ragas framework.

LLM-as-judge

When LLM-as-judge works, when it breaks, and how to calibrate it.

Custom evals

Building a golden test set, choosing a success criterion, regression tracking.

Eval tools compared

Braintrust, Langfuse, LangSmith, Arize Phoenix, DeepEval, Patronus, Helicone, PromptLayer.

Production monitoring

Online eval, sampling strategies, drift detection, on-call eval pipelines.

Human vs automated

When human eval is necessary and when automated is good enough.

What benchmarks miss

Editorial: contamination, saturation, self-reporting, methodology gaming, real-world gap.

Glossary

N-shot, CoT, pass@k, harness, faithfulness, and the other technical vocabulary on the site.

Methodology

Per-benchmark source URLs, refresh cadence, scope, corrections process.

FAQ

Fourteen reader questions on AI benchmarking, evaluation, and tool choice.

Editorial principles

Principle

Source pattern

Every quoted SOTA score on this site traces back to two sources: the official benchmark leaderboard URL (swebench.com, arcprize.org, lmsys.org, papers with code) and the original paper. Where vendor model cards and independent leaderboards disagree, the gap is shown and the more independent source is preferred.

Principle

Capture dates on every score

Frontier benchmarks move fast. SOTA in November is not SOTA in May. Every score quoted is dated to the month it was captured. Once a benchmark crosses the 90 percent saturation threshold, the page flags it explicitly rather than continuing to report the highest reported number.

Principle

N-shot, CoT and harness disclosure

A 5-shot CoT score and a 0-shot greedy score are not the same number. The site documents the evaluation setup for every benchmark cell. Where vendor reports omit setup detail, the cell is flagged as methodology-incomplete rather than treated as comparable.

Principle

Saturation flagged

MMLU, HumanEval, MBPP, HellaSwag and WinoGrande are saturated across frontier models in 2026. The site states this on the benchmark page and points readers to current benchmarks (MMLU-Pro, GPQA-Diamond, ARC-AGI-2, LiveCodeBench, SWE-bench Verified) for live frontier comparison.

Principle

Contamination flagged

Public benchmarks with high contamination risk (MMLU, HumanEval, HellaSwag) are flagged on the relevant pages. Independent verification of vendor-claimed scores on contaminated benchmarks is impossible because the contamination is structural, not a matter of bad actors.

Principle

No vendor ranking shortcuts

The site does not produce a single composite leaderboard that purports to rank all frontier models against each other. Different benchmarks measure different things. A composite ranking would hide the trade-offs that matter for production model selection.

Refresh cadence

SOTA benchmark scores are re-verified quarterly against the official leaderboard for each benchmark. Cluster-head queries (the benchmarks the site has the strongest position on in search, currently SWE-bench Verified, MMLU-Pro, GPQA-Diamond, AgentBench, WebArena) are checked monthly. The last full verification pass closed on May 2026.

The verification date is held in a single constant (LAST_VERIFIED_DATE) in src/lib/schema.ts. Footer text, schema dateModified, and visible headings all read from that single source. This is a deliberate design choice so cosmetic-refresh leaks (rolling a date forward without doing the underlying verification work) are structurally prevented.

Disclosures

01.No affiliate parameters on benchmark or vendor URLs cited in editorial content.
02.Some links on the /tools-compared page may route through affiliate URLs to evaluation tool vendors. Tool order on that page is alphabetical by category, not influenced by affiliate status.
03.Not affiliated with Anthropic, OpenAI, Google, Meta, Mistral, Cohere, AWS Bedrock, Azure OpenAI, or any other listed model vendor.
04.Not affiliated with Braintrust, Langfuse, LangSmith, Arize, DeepEval, Patronus, Helicone, PromptLayer, or any other listed eval-tool vendor.
05.Not affiliated with Princeton, Stanford, OpenAI, Allen AI, EleutherAI, LMSYS, HuggingFace, or any benchmark-producing organisation.

Contact and corrections

Corrections welcome. If you find a misquoted score, an out-of-date capture date, a missing N-shot setting, a methodology error, or a citation that does not resolve to the claimed source, email editor@benchmarkingagents.com and the correction will land in the next refresh pass.

Five business days is the target response window. For substantive corrections (not just a typo) the corrected page also adds a short corrigendum note at the bottom and rolls the dateModified forward in the schema. Cosmetic typos roll silently.

For commercial enquiries (sponsored content, paid placements, lead-routing arrangements) the answer is no. The site is a reference, and accepting any of those would compromise the editorial position above.

Methodology →Browse benchmarks Home