About benchmarkingagents.com
An independent reference for AI agent and LLM benchmarks. Built by Digital Signet. No vendor affiliation, no paid placements, no newsletter capture.
What this site is
benchmarkingagents.com is an independent reference for the public benchmarks used to evaluate large language models and AI agents in 2026. Coverage spans the 17 most-cited benchmarks across knowledge (MMLU, MMLU-Pro, MMMU, Humanity's Last Exam), coding (HumanEval, MBPP, LiveCodeBench, SWE-bench Verified), reasoning (GPQA, GPQA-Diamond, ARC-AGI, ARC-AGI-2, BIG-Bench Hard), agentic capability (SWE-bench Verified, WebArena, AgentBench, OSWorld, Terminal-Bench, Tau-Bench), multimodal (MMMU, MathVista, ChartQA), and human preference (LMSYS Chatbot Arena, MT-Bench).
It also covers the practitioner stack that production teams actually use to evaluate their own systems: custom golden datasets, LLM-as-judge methodology, RAG evaluation frameworks like Ragas, online production monitoring, and the eight evaluation platforms in widest use (Braintrust, Langfuse, LangSmith, Arize Phoenix, DeepEval, Patronus, Helicone, PromptLayer).
It is not a leaderboard that scrapes scores without context. Every quoted score carries its capture date, the N-shot setting, the chain-of-thought flag, the test set version, and a link back to the primary source. Where a benchmark has saturated or has documented contamination risk, the page says so explicitly.
Why this site exists
The benchmark landscape has three structural problems that most AI coverage ignores or treats as footnotes. First, contamination. MMLU test questions appear verbatim in Common Crawl, the primary web-crawl dataset behind most pre-training corpora. HumanEval problems are near-duplicates of LeetCode solutions that pervade GitHub and Stack Overflow crawls. SWE-bench issues have solutions in the same repositories' public git history. A high score on a contaminated benchmark may reflect memorisation as much as reasoning, and the leaderboard cannot tell you which.
Second, saturation. MMLU saturated in 2024. HumanEval, HellaSwag and WinoGrande saturated in 2023 or 2024. All frontier models now score within statistical noise of each other on these benchmarks. The field moved to MMLU-Pro (10-choice, CoT-required), GPQA-Diamond, ARC-AGI-2 and LiveCodeBench specifically because the previous generation no longer discriminates. Most comparison sites still quote the saturated versions, because the numbers are larger and more recognisable.
Third, self-reporting. Model cards are published by the same company that built the model. Common issues: publishing only benchmarks where the model performs well, omitting evaluation setup (N-shot, CoT, harness, test set version), and reporting scores under maximally favourable conditions that independent replications struggle to reproduce. The benchmark-literate reader treats vendor-published scores as a starting point, not a final answer.
This site exists to apply a consistent six-question rubric to every benchmark score it cites, document the methodology behind every number, and refuse to publish vendor-favourable shortcuts when the underlying signal is noisy.
Who builds this
Oliver Wakefield-Smith
Founder, Digital Signet
benchmarkingagents.com is one of a small cluster of Digital Signet reference sites covering the cost, capability and evaluation surface of frontier AI. Sister sites focus on per-token model pricing; this one covers what the resulting models can and cannot do, and how to measure it without being captured by vendor framing.
Independent reference for Claude API token pricing across models, batch tier, prompt caching.
Multi-provider AI embedding pricing, vector DB storage cost, RAG scenarios.
Google Gemini API pricing reference; cross-checks Vertex AI surface.
Per-million-token cost calculator across model providers; latency and cost trade-offs.
Editorial position
Independent reference. No vendor affiliation. No paid placements on the tool-comparison pages. No sponsored entries in any benchmark or evaluation framework cited. Provider order in tables is determined alphabetically or by category, not by any commercial relationship.
The site does carry a narrow affiliate disclosure on the /tools-compared page, where some outbound links to evaluation tool vendors may route through affiliate URLs. Affiliate status does not influence the order tools appear, the rubric used to compare them, or which tools are recommended for which use case. The neutral editorial framing is enforced regardless of affiliate placement, and the disclosure is shown in the page footer.
No display advertising. No newsletter capture. No content sponsorships. No vendor lead-routing. The site is a reference, not a lead-generation funnel.
What this site covers
Editorial principles
Source pattern
Every quoted SOTA score on this site traces back to two sources: the official benchmark leaderboard URL (swebench.com, arcprize.org, lmsys.org, papers with code) and the original paper. Where vendor model cards and independent leaderboards disagree, the gap is shown and the more independent source is preferred.
Capture dates on every score
Frontier benchmarks move fast. SOTA in November is not SOTA in May. Every score quoted is dated to the month it was captured. Once a benchmark crosses the 90 percent saturation threshold, the page flags it explicitly rather than continuing to report the highest reported number.
N-shot, CoT and harness disclosure
A 5-shot CoT score and a 0-shot greedy score are not the same number. The site documents the evaluation setup for every benchmark cell. Where vendor reports omit setup detail, the cell is flagged as methodology-incomplete rather than treated as comparable.
Saturation flagged
MMLU, HumanEval, MBPP, HellaSwag and WinoGrande are saturated across frontier models in 2026. The site states this on the benchmark page and points readers to current benchmarks (MMLU-Pro, GPQA-Diamond, ARC-AGI-2, LiveCodeBench, SWE-bench Verified) for live frontier comparison.
Contamination flagged
Public benchmarks with high contamination risk (MMLU, HumanEval, HellaSwag) are flagged on the relevant pages. Independent verification of vendor-claimed scores on contaminated benchmarks is impossible because the contamination is structural, not a matter of bad actors.
No vendor ranking shortcuts
The site does not produce a single composite leaderboard that purports to rank all frontier models against each other. Different benchmarks measure different things. A composite ranking would hide the trade-offs that matter for production model selection.
Refresh cadence
SOTA benchmark scores are re-verified quarterly against the official leaderboard for each benchmark. Cluster-head queries (the benchmarks the site has the strongest position on in search, currently SWE-bench Verified, MMLU-Pro, GPQA-Diamond, AgentBench, WebArena) are checked monthly. The last full verification pass closed on May 2026.
The verification date is held in a single constant (LAST_VERIFIED_DATE) in src/lib/schema.ts. Footer text, schema dateModified, and visible headings all read from that single source. This is a deliberate design choice so cosmetic-refresh leaks (rolling a date forward without doing the underlying verification work) are structurally prevented.
Disclosures
- 01.No affiliate parameters on benchmark or vendor URLs cited in editorial content.
- 02.Some links on the /tools-compared page may route through affiliate URLs to evaluation tool vendors. Tool order on that page is alphabetical by category, not influenced by affiliate status.
- 03.Not affiliated with Anthropic, OpenAI, Google, Meta, Mistral, Cohere, AWS Bedrock, Azure OpenAI, or any other listed model vendor.
- 04.Not affiliated with Braintrust, Langfuse, LangSmith, Arize, DeepEval, Patronus, Helicone, PromptLayer, or any other listed eval-tool vendor.
- 05.Not affiliated with Princeton, Stanford, OpenAI, Allen AI, EleutherAI, LMSYS, HuggingFace, or any benchmark-producing organisation.
Contact and corrections
Corrections welcome. If you find a misquoted score, an out-of-date capture date, a missing N-shot setting, a methodology error, or a citation that does not resolve to the claimed source, email editor@benchmarkingagents.com and the correction will land in the next refresh pass.
Five business days is the target response window. For substantive corrections (not just a typo) the corrected page also adds a short corrigendum note at the bottom and rolls the dateModified forward in the schema. Cosmetic typos roll silently.
For commercial enquiries (sponsored content, paid placements, lead-routing arrangements) the answer is no. The site is a reference, and accepting any of those would compromise the editorial position above.