Independent reference. Not affiliated with OpenAI, Anthropic, Google DeepMind, Meta, Mistral, xAI, Papers With Code, HuggingFace, Langfuse, LangSmith, Braintrust, Arize, Humanloop, or HoneyHive. Scores cited with source and capture date. Affiliate disclosure.
Last verified April 2026

AI Benchmarking Glossary - 30 Terms for LLM and Agent Evaluation

Clear, concise definitions of key terms in AI benchmarking and LLM evaluation. Each entry includes a 2-4 sentence definition and a link to the relevant deep-dive page. Alphabetically ordered.

Agent benchmark

deep dive ->

A benchmark that measures multi-step task completion with tool use, rather than single-turn question answering. Examples include SWE-bench Verified, WebArena, and OSWorld. Agent benchmarks are harder to design, more expensive to run, and more variable than LLM benchmarks because they require realistic environments and complex success criteria.

Answer relevancy

deep dive ->

A RAG evaluation metric measuring whether the generated answer actually addresses the question asked. Computed by having a judge LLM generate several questions that the answer would address, then computing semantic similarity between those questions and the original question. Low answer relevancy often indicates retrieval failure (wrong context was returned).

Abstraction and Reasoning Corpus, created by Francois Chollet (2019). A benchmark of visual puzzle tasks where the model must infer an abstract transformation rule from a few input-output examples and apply it to a new input. Humans solve 85% of tasks; frontier AI models struggled to exceed 50% until 2024. ARC-AGI-2 (2025) is a harder successor.

Benchmark contamination

deep dive ->

The presence of test questions from a benchmark in a model's pre-training data. A contaminated benchmark inflates scores because the model may recall memorised answers rather than reasoning through them. MMLU, HumanEval, and MBPP have documented contamination from web crawl pre-training data.

Chain-of-thought (CoT)

A prompting technique where the model is instructed to reason step-by-step before giving a final answer. CoT improves performance on reasoning-heavy tasks and is required by some benchmarks (MMLU-Pro uses mandatory CoT). A score with CoT is not comparable to a score without CoT on the same benchmark.

Chatbot Arena

deep dive ->

A crowdsourced human-preference benchmark run by LMSYS at UC Berkeley. Users chat with two anonymous models and vote for which they prefer. Votes are aggregated into an Elo ranking via the Bradley-Terry model. Over 2 million votes as of April 2026. Measures open-ended conversation preference, not factual accuracy or agentic capability.

Context precision

deep dive ->

A RAG evaluation metric measuring what fraction of the retrieved context chunks were actually relevant to answering the question. High context precision means the retriever is not returning junk. Low context precision means the retriever is fetching too many off-topic chunks, increasing cost and confusing the generation step.

Context recall

deep dive ->

A RAG evaluation metric measuring what fraction of the information needed to answer the question was present in the retrieved context. High context recall means the retriever successfully found the relevant information. Low context recall means key information was missed, which may cause hallucination in the generation step.

Eval harness

The software scaffolding that runs a benchmark - the code that generates prompts, calls the model API, parses responses, and computes metrics. Harness differences can produce significantly different scores on the same benchmark with the same model. For agent benchmarks, the harness includes the environment (repository, web browser, virtual machine) the agent operates in.

Faithfulness

deep dive ->

A RAG evaluation metric measuring whether the generated answer only makes claims that are supported by the retrieved context. Computed by breaking the answer into individual claims and checking each against the context chunks. Low faithfulness indicates hallucination - the model asserted facts not present in the retrieved information.

Few-shot

An evaluation setting where the model receives a small number of example input-output pairs before the test question. 'Few-shot' is typically 3-5 examples. Few-shot evaluation allows the model to infer the expected output format from examples. Scores are higher than 0-shot and not directly comparable to 0-shot scores.

A framework for LLM-as-judge evaluation published by Liu et al. (2023). G-Eval uses GPT-4 with explicit evaluation criteria and chain-of-thought reasoning to score model outputs. It showed Pearson correlation of 0.80-0.90 with human judgments on summarisation tasks, significantly outperforming traditional metrics like ROUGE.

Gold answer

deep dive ->

The human-verified ground-truth answer for a test example in a custom eval dataset. Serves as the reference point for evaluation metrics. A golden dataset is a collection of (input, gold answer) pairs. Creating high-quality gold answers is the most important and most often skipped step in building a reliable custom eval.

HumanEval

deep dive ->

A code generation benchmark of 164 Python programming problems created by OpenAI (2021). Each problem presents a function signature and docstring; the model must complete the function body. The primary metric is pass@1. HumanEval is saturated as of 2026 (frontier models score 96-98%). Use LiveCodeBench for current comparisons.

Inter-annotator agreement

deep dive ->

A measure of how consistently different human raters assign the same label or score to the same item. Commonly measured by Cohen's kappa (categorical) or Krippendorff's alpha (ordinal). An IAA of 0.7+ kappa is generally acceptable for NLP evaluation. Low IAA (below 0.5) indicates an ambiguous rubric that needs revision before collecting more ratings.

LLM-as-judge

deep dive ->

Using a strong language model to evaluate the outputs of another model or agent. The judge reads outputs and assigns quality scores on a defined rubric. It is the backbone of most automated eval frameworks (Ragas, DeepEval, Braintrust, Langfuse). Directionally reliable for open-ended tasks with explicit rubrics; unreliable for factual correctness tasks where the judge shares the evaluated model's knowledge gaps.

Massive Multitask Language Understanding, created by Hendrycks et al. (2020). 15,908 multiple-choice questions across 57 subjects. Was the defining benchmark of 2020-2023. Saturated as of 2024 - frontier models score 92-94% and the benchmark cannot discriminate between them. Also the most contaminated major benchmark. Use MMLU-Pro for 2026 comparisons.

MMLU-Pro

deep dive ->

An enhanced version of MMLU with 10 answer choices (vs 4) and mandatory chain-of-thought prompting. Created by Wang et al. (2024). Harder, less contaminated, and better discriminates frontier models than MMLU. As of April 2026, frontier models score 79-86%, providing meaningful discrimination. The recommended knowledge benchmark for 2026 model comparisons.

Massive Multitask Multimodal Understanding, created by Yue et al. (2024). The multimodal equivalent of MMLU: college-exam questions that require interpreting images alongside text, across 30 subjects. 11,550 questions covering diagrams, charts, scientific figures, and photographs. The standard benchmark for multimodal model evaluation in 2026.

N-shot

The number of example input-output pairs provided before the test question. 0-shot means no examples. 5-shot means 5 examples. N-shot settings significantly affect benchmark scores. A 5-shot score is typically 5-15% higher than a 0-shot score on the same benchmark. Always specify N-shot when reporting or comparing benchmark scores.

Online eval

deep dive ->

Evaluation of model outputs in production, on live user traffic, as opposed to offline eval on a static golden dataset. Online eval runs continuously on a sampled fraction of production calls (typically 1-5%) and detects distribution drift, model provider updates, and quality regressions that offline eval misses. Both offline and online eval are necessary for production AI systems.

Pairwise comparison

deep dive ->

An evaluation method where a judge (human or LLM) compares two outputs for the same input and picks the better one. More robust than single-point scoring because judges are better at comparing than assigning absolute scores. Used by Chatbot Arena (human pairwise) and many LLM-as-judge frameworks. Main drawback: quadratic cost - comparing N outputs requires N*(N-1)/2 judge calls.

A metric for code generation benchmarks measuring the probability that at least one of k generated solutions passes all tests. pass@1 is the standard metric for production use cases (one attempt). pass@10 and pass@100 are used in research to understand model capability under multiple sampling. Higher k values significantly inflate apparent performance and are less relevant for deployment scenarios.

An open-source framework (Es et al., 2023) for evaluating RAG pipelines using LLM-as-judge. Ragas computes four canonical metrics: faithfulness, answer relevancy, context precision, and context recall. It is pip-installable and works with any LLM or embedding model. Most production RAG evaluation tools (DeepEval, Arize Phoenix, Langfuse) implement Ragas-style metrics.

Regression detection

deep dive ->

The practice of tracking evaluation metrics over time to detect when a model, prompt, or system change causes quality to drop below a baseline. Requires: (1) a versioned eval dataset, (2) metric storage across runs, and (3) alert thresholds. Regression detection is the long-term value of eval pipelines - a single eval run is useful, but a time series of eval runs is the foundation of reliable AI product development.

Saturation

deep dive ->

A benchmark is saturated when frontier models all score above approximately 90%, making it unable to discriminate between them. Saturated benchmarks in 2026: MMLU (92-94% frontier), HumanEval (96-98% frontier), MBPP (97%+ frontier), HellaSwag (95%+ frontier), WinoGrande (90%+ frontier). Saturated benchmarks are still useful for historical comparisons with older models.

State of the Art. The highest-performing result on a benchmark at a given point in time, by any published model or system. SOTA is not a fixed property of a benchmark - it changes as new models are published and evaluated. Every SOTA claim requires a capture date: 'GPT-5 achieves SOTA on SWE-bench Verified at 74.5% (captured April 2026).'

SWE-bench Verified

deep dive ->

A benchmark of 500 human-verified real GitHub issues from 12 Python repositories, published by Yang et al. (2023) and verified by OpenAI (August 2024). The agent must produce a patch that makes failing tests pass without breaking existing tests. The canonical benchmark for coding agents as of 2026. Current SOTA: 74.5% (Claude 4.5 Opus, April 2026).

Test-set leakage

deep dive ->

When benchmark test questions appear in a model's pre-training data (also called contamination). Test-set leakage inflates benchmark scores because the model may recall memorised examples rather than reasoning through them. Leakage is hard to detect after pre-training and is a structural risk for any public benchmark whose questions are available on the internet.

Zero-shot

An evaluation setting where the model receives no example input-output pairs before the test question. The model must respond based on its training knowledge and any instructions in the system prompt. Zero-shot scores are lower than few-shot scores on the same benchmark. Zero-shot evaluation is closer to real deployment conditions for most use cases.