AI Benchmarking FAQ - 14 Questions About Evaluating LLMs and Agents
14 questions covering the benchmark landscape, evaluation methodology, and practical guidance for ML engineers building and evaluating AI systems. Each answer links to a dedicated deep-dive page for more detail.
01What is AI benchmarking and why does it matter?+
AI benchmarking is the practice of measuring model or agent performance on standardised test sets to enable objective comparison. Without benchmarks, evaluating models would be purely subjective and unrepeatable. Benchmarks provide shared reference points - a score of 86% on MMLU-Pro in 2026 means something that can be compared across models and over time. The caveat is that benchmarks measure narrow slices of capability, not general intelligence. A model that excels on MMLU-Pro may still produce poor outputs for your specific use case.
Read more: Benchmark reference ->02What is the difference between a benchmark and an eval?+
A benchmark is a standardised public test set used to compare models across the field. It has a fixed dataset, a defined success criterion, and an official leaderboard. An eval is any measurement of model quality - it may use a public benchmark, a custom golden dataset, LLM-as-judge scoring, or human annotation. Public benchmarks are one kind of eval. Custom evals built for your specific workflow are the other kind, and are often more relevant for production decisions.
Read more: Custom eval guide ->03Which benchmarks matter in 2026?+
For frontier model comparisons: MMLU-Pro for knowledge breadth (not plain MMLU, which is saturated at 92-94% for all frontier models), GPQA-Diamond for graduate-level reasoning, ARC-AGI-2 for visual abstraction, Humanity's Last Exam for the absolute frontier. For coding: LiveCodeBench (anti-contamination) and SWE-bench Verified (real software engineering). For agentic capability: SWE-bench Verified, WebArena (web navigation), OSWorld (computer use). For human preference: LMSYS Chatbot Arena.
Read more: Full benchmark reference ->04Is MMLU still useful?+
MMLU is useful for historical comparisons with pre-2024 models. It is not useful for comparing current frontier models against each other, because all frontier models score 92-94% and the variance is within statistical noise. MMLU is also the most contaminated major benchmark - many test questions appear in pre-training corpora. For 2026 frontier comparisons, use MMLU-Pro. For historical analysis, MMLU scores are fine.
Read more: MMLU deep dive ->05What is SWE-bench Verified?+
SWE-bench Verified is a benchmark of 500 real GitHub issues from 12 Python repositories (Django, SymPy, matplotlib, and others). Each task: given the repository state at the time the issue was open, the issue description, and a failing test, can an AI agent produce a patch that makes the test pass without breaking existing tests? It is the canonical benchmark for coding agents as of 2026. Current SOTA is 74.5% (Claude 4.5 Opus, April 2026).
Read more: SWE-bench deep dive ->06How do I evaluate my own agent?+
Build a golden test set of 30-500 examples from your agent's actual use case. Define a success criterion (exact match, functional test, LLM-as-judge, human rating). Run inference against every model/prompt/version you want to compare. Aggregate into headline metrics (pass rate, mean score, cost per correct answer). Track over time for regression detection. Public benchmarks measure general capability; custom evals measure whether the agent does your specific job well. Both are necessary.
Read more: Custom eval guide ->07What is LLM-as-a-judge?+
LLM-as-judge is using a strong language model to evaluate the outputs of another model. The judge reads an output (or a pair of outputs) and assigns a quality score on a rubric. It is the backbone of most automated eval frameworks including Ragas, DeepEval, Braintrust, Langfuse, and LangSmith. LLM-as-judge is fast and directionally correct for open-ended tasks (summarisation, tone, relevance). It is unreliable for factual correctness on topics the judge does not know well, and for tasks where the judge is weaker than the evaluated model.
Read more: LLM-as-Judge methodology ->08How do I evaluate a RAG pipeline?+
RAG evaluation covers two parts: retrieval quality (did you fetch the right context?) and generation quality (did the LLM use the context correctly?). The four canonical metrics from the Ragas framework are faithfulness (is the answer grounded in the retrieved context?), answer relevancy (does the answer address the question?), context precision (of retrieved chunks, how many were relevant?), and context recall (of needed information, how much was retrieved?). Build a golden dataset of 30-100 (question, expected answer, expected source chunks) triples before running these metrics.
Read more: RAG evaluation guide ->09Which eval tool should I choose?+
It depends on your requirements. For OSS self-hosting: Langfuse (most mature) or Arize Phoenix (strong production monitoring). For best developer experience and CI integration: Braintrust. For teams on LangChain: LangSmith. For pytest-style eval with no cloud: DeepEval. If you have fewer than 100 test examples and no production traffic, a Python script and a CSV is fine - you do not need tooling yet.
Read more: Eval tools compared ->10Can I trust vendor-published benchmark scores?+
Treat vendor-published scores as a starting point, not a final answer. Model cards are produced by the company that built the model - structural conflict of interest. Common issues: publishing only benchmarks where the model performs well, not specifying methodology (N-shot, CoT, harness), and independent replications frequently diverging from vendor claims by 1-5%. Cross-reference with independent sources: HuggingFace Open LLM Leaderboard v2, Papers With Code, and the official benchmark leaderboards (swebench.com, arcprize.org, lmsys.org).
Read more: Full critique of benchmark ecosystem ->11What is benchmark contamination?+
Benchmark contamination occurs when test questions from a benchmark appear in a model's training data. The model may then solve problems by recalling memorised answers rather than reasoning through them. Documented cases: MMLU questions appear verbatim in Common Crawl (the primary pre-training web crawl); HumanEval problems are near-duplicates of LeetCode solutions that appear in GitHub and Stack Overflow crawls; SWE-bench issues have solutions in the same repository's public git history. Contamination inflates scores beyond the model's true capability.
Read more: Contamination and saturation ->12What is the difference between offline and online eval?+
Offline eval runs against a fixed golden dataset before you ship - it gates releases and detects regressions. Online eval runs continuously against a sample of live production traffic after you ship - it catches distribution drift and unexpected failures that offline eval missed. Both are necessary. Offline eval is cheaper and more controlled. Online eval is more representative of real user queries. The standard practice is offline eval in CI on every pull request, and online eval as continuous production monitoring.
Read more: Production monitoring guide ->13How much does human evaluation cost?+
Order of magnitude: $0.50-$2.00 per rating using crowdwork platforms (Prolific, MTurk). For a 200-example eval set with 3 raters each (600 ratings), expect $300-$1,200. Domain expert annotation costs $5-$15 per rating and is appropriate for technical domains (medical, legal, code quality). Internal team time costs $8-$25 per rating (fully loaded). Human eval is expensive but provides the most reliable signal for release decisions. The standard hybrid practice is LLM-as-judge for iteration speed and human eval at release milestones.
Read more: Human vs automated evaluation ->14Is Chatbot Arena reliable?+
Chatbot Arena is statistically reliable for ranking models on open-ended chat preference - it has over 2 million human votes as of April 2026, and the confidence intervals at the top of the leaderboard are approximately +/-5 Elo points. It is not reliable for predicting factual accuracy, coding ability, or agentic performance, which are better measured by dedicated benchmarks. A 70-point Elo difference between top models may not be statistically significant. Look for 20+ point gaps before treating the difference as meaningful.
Read more: Chatbot Arena deep dive ->