Abstract

Retrieval headlineBEIR (NDCG@10 across 18 datasets)

Embedding headlineMTEB (56 datasets across 8 task types)

End-to-end headlineRAGAS or custom golden dataset with LLM-as-judge

Multi-hopMultiHopRAG (newer); HotPotQA (foundational)

Section II.xiv · Benchmark Comparison|Last verified April 2026

RAG Benchmarks: The 2026 Selection Guide

There is no single "RAG benchmark". Real RAG evaluation needs a benchmark for retrieval, a benchmark for embedding quality, and a benchmark for end-to-end behaviour. BEIR, MTEB, and RAGAS cover the three; HotPotQA and MultiHopRAG add multi-hop reasoning. Quote two or three together; expect the configuration of your pipeline to matter more than any single number.

RAG is several pipelines, not one

A RAG (retrieval-augmented generation) pipeline has three logical stages: encoding (turn text into embeddings), retrieval (find relevant context for a query), and generation (produce an answer grounded in the retrieved context). Each stage has its own failure modes, its own quality metrics, and its own benchmarks. There is no single benchmark that captures all three stages, and any claim of "the best RAG benchmark" usually means the speaker is focused on one stage.

The honest 2026 approach to RAG evaluation is to quote two or three benchmarks covering different stages: BEIR for retrieval quality, MTEB for embedding quality, and either RAGAS or a custom golden dataset for end-to-end behaviour. Multi-hop reasoning gets its own benchmark (MultiHopRAG or HotPotQA) when it is part of the workload. See our RAG evaluation deep dive for the methodology details.

The headline trade-off across RAG benchmarks is precision-versus-realism. Retrieval-only benchmarks (BEIR, MS-MARCO) have programmatic success functions and are highly reproducible. End-to-end benchmarks (RAGAS, custom golden datasets) are closer to production behaviour but rely on LLM-as-judge scoring with non-trivial noise. Quote the precision benchmarks for component selection; quote the end-to-end benchmarks for system-level claims.

Benchmark-by-benchmark comparison

The full picture of RAG benchmarks in 2026 spans pure retrieval benchmarks, embedding benchmarks, end-to-end RAG benchmarks, hallucination-focused benchmarks, and multi-hop QA benchmarks. The summary below lays out what each measures, the headline metric, the strengths, the weaknesses, and the recommendation.

Benchmark

What it measures

Headline metric

Note

Recommend

BEIR

18-dataset retrieval benchmark across 9 task types

NDCG@10

Retrieval-only; no generation evaluation

Yes (retrieval headline)

MTEB

56-dataset embedding benchmark across 8 task types

Per-task; rank-based aggregate

Embedding-only; not full RAG

Yes (embedding selection)

RAGAS

End-to-end RAG metrics: faithfulness, relevance, recall, precision

LLM-as-judge per dimension

Judge-noise; relative not absolute

Yes (production monitoring)

RAGTruth

Hallucination-focused RAG evaluation

Hallucination rate at multiple granularities

Niche focus; not broader RAG

Yes (hallucination evaluation)

HotPotQA

Multi-hop QA across Wikipedia

Exact match + F1

Saturating; Wikipedia-only

Foundational, use for trends

MultiHopRAG

Multi-hop QA explicitly for RAG

Answer accuracy + retrieval evaluation

Newer; less established

Yes (RAG multi-hop)

MS-MARCO

500K passage-ranking benchmark

MRR@10, NDCG@10

Subsumed by BEIR

Foundational

Natural Questions

Open-domain QA from real Google queries

Exact match + F1

Subsumed by broader QA evals

Niche

Use-case-by-use-case selection guide

The right benchmark depends on the question being asked. Picking an embedding model is a different question from monitoring a production RAG pipeline; both are different from evaluating multi-hop reasoning. The table below maps common RAG-engineering questions to the benchmarks that best answer them.

Use case

Primary benchmark

Secondary

Choosing an embedding model

MTEB

BEIR

Evaluating retrieval quality (rank, recall)

BEIR

MS-MARCO

End-to-end RAG production monitoring

RAGAS

Custom golden dataset

Hallucination measurement in RAG

RAGTruth

RAGAS faithfulness

Multi-hop reasoning over documents

MultiHopRAG

HotPotQA

Open-domain QA over Wikipedia

HotPotQA

Natural Questions

BEIR and MTEB: when to use which

BEIR and MTEB both cover retrieval quality but differ in scope and intent. BEIR (Benchmarking IR) is specifically a retrieval benchmark: 18 datasets, all retrieval, with NDCG@10 as the headline metric. It is the right benchmark when the question is "how good is this model at finding relevant documents". MTEB (Massive Text Embedding Benchmark) is broader: 56 datasets across 8 task types including retrieval but also classification, clustering, semantic similarity, and reranking. It is the right benchmark when the question is "how good is this embedding model overall".

In practice, MTEB has become the more cited benchmark for embedding model selection because production RAG pipelines often use embeddings for multiple purposes (retrieval, reranking, semantic search, deduplication), and MTEB's breadth captures the full utility. BEIR remains the right benchmark for pure retrieval comparisons and is often used alongside MTEB. Many embedding model cards now report both numbers; we recommend quoting both when comparing embedding models.

RAGAS and end-to-end RAG evaluation

RAGAS evaluates end-to-end RAG pipelines through four LLM-as-judge metrics: faithfulness (does the generated answer reflect the retrieved context), answer relevance (does the answer address the question), context recall (did retrieval find the relevant context), and context precision (is the retrieved context actually useful). The four metrics together give a more complete picture than any single one; a high faithfulness score with low context recall, for example, means the system is being honest about its limited information rather than hallucinating.

The LLM-as-judge nature of RAGAS introduces noise that pure retrieval benchmarks avoid. Two RAGAS runs on the same data can produce different scores by 2-5 points depending on the judge model and prompt. The framework is best used for relative comparison (which configuration is better) rather than absolute scoring (this is the right number). For production monitoring, RAGAS provides a structured way to track pipeline health over time; for component selection, the more reproducible BEIR or MTEB numbers are usually better.

Hallucination-focused benchmarks

Hallucination is a distinct concern from retrieval quality and end-to-end faithfulness. RAGTruth is one of several benchmarks specifically designed to measure hallucination: how often a generated response contains claims that are not supported by the retrieved context. The benchmark provides graded annotations across multiple granularities (sentence-level, claim-level, factoid-level) and is the most directly applicable benchmark for evaluating systems where hallucination prevention is the primary goal.

Adjacent benchmarks include FACTS Grounding (Google's factuality grounding evaluation) and FaithBench. None has fully displaced RAGAS' faithfulness metric for production monitoring, but RAGTruth and FACTS Grounding are useful for system-level claims about hallucination rate. Quote them when hallucination rate is a primary product metric.

Multi-hop reasoning

Multi-hop reasoning is a separate axis of RAG capability. A pipeline that can retrieve and synthesise information from a single document is different from one that can chain reasoning across multiple documents to answer a question. HotPotQA is the foundational benchmark; MultiHopRAG is the newer benchmark designed specifically for the RAG-plus-multi-hop intersection. Both are useful; MultiHopRAG is more directly applicable to production RAG evaluation.

The multi-hop case is where DSPy and similar prompt-optimisation frameworks tend to add the most value. The prompt structures for retrieve-then-reason multi-hop pipelines are non-obvious to hand-tune, and programmatic optimisation reliably finds better configurations. See our DSPy reference for the detail; the headline is that on multi-hop benchmarks like HotPotQA and MultiHopRAG, DSPy compilation lifts scores by 10-20 points over hand-prompted baselines.

Editor's verdictQuote BEIR for retrieval, MTEB for embedding, RAGAS or a custom golden dataset for end-to-end RAG. Add MultiHopRAG when multi-hop reasoning is part of the workload. There is no single RAG benchmark; the honest claim is the combination of two or three covering different stages.

Reader Questions

Q.01What is the right RAG benchmark to quote in 2026?+

It depends on which part of the RAG pipeline you are evaluating. For pure retrieval (recall@k, NDCG): BEIR. For embedding quality across many task types: MTEB. For end-to-end RAG including generation faithfulness: RAGAS, RAGTruth, or a custom golden dataset with LLM-as-judge. For multi-hop reasoning: HotPotQA or MultiHopRAG. There is no single 'RAG benchmark' that covers the whole pipeline; honest claims quote two or more covering retrieval, embedding, and end-to-end behaviour separately.

Q.02What does BEIR measure?+

BEIR (Benchmarking IR) is a heterogeneous retrieval benchmark covering 18 datasets across 9 task types: question answering, biomedical retrieval, fact checking, citation prediction, news, Wikipedia, and more. The metric is NDCG@10 (Normalised Discounted Cumulative Gain at 10) which captures both precision and rank quality. BEIR's value is breadth: a model that does well on BEIR has demonstrated retrieval competence across many domains and task types, not just the one a benchmark might happen to focus on.

Q.03What does MTEB add over BEIR?+

MTEB (Massive Text Embedding Benchmark) covers 56 datasets across 8 task types: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity, summarisation, and bitext mining. It is broader than BEIR (which focuses on retrieval) and is the canonical benchmark for embedding model selection in 2026. MTEB and BEIR overlap on retrieval; MTEB is the more general embedding benchmark and BEIR is the more specialised retrieval benchmark.

Q.04What is RAGAS and when should I use it?+

RAGAS is a framework for evaluating end-to-end RAG pipelines using LLM-as-judge metrics. It scores four dimensions: faithfulness (does the generated answer reflect the retrieved context), answer relevance (does the answer address the question), context recall (did retrieval find the relevant context), and context precision (is the retrieved context useful). RAGAS is useful for production-RAG monitoring and for comparing pipeline configurations on your own data. The LLM-as-judge nature introduces noise that pure retrieval benchmarks like BEIR avoid; treat RAGAS scores as relative (which configuration is better) rather than absolute (this is the right number).

Q.05How do HotPotQA and MultiHopRAG differ?+

Both test multi-hop question answering, but with different scope. HotPotQA is the original multi-hop QA benchmark; ~113K questions requiring reasoning across multiple Wikipedia articles. MultiHopRAG is a newer benchmark explicitly designed for retrieval-augmented generation with multi-hop reasoning; it includes an explicit evaluation of which retrieved documents the answer depends on. For evaluating a RAG pipeline on multi-hop reasoning, MultiHopRAG is the more directly applicable benchmark; HotPotQA remains the foundational reference and is widely cited.

Q.06What about MS-MARCO and Natural Questions?+

MS-MARCO is a large passage-ranking benchmark from Microsoft (~500K questions); it is the canonical baseline for dense passage retrieval and is included in BEIR. Natural Questions is Google's open-domain QA benchmark; useful for end-to-end QA evaluation. Both are foundational and well-instrumented but are largely subsumed by the more comprehensive benchmarks (MS-MARCO into BEIR, Natural Questions into the broader QA evaluation literature). Use them for specific niches; the headline RAG claims usually cite the broader benchmarks.

RAG Evaluation Methodology →LLM-as-Judge Methodology →DSPy Benchmarks →Agent Benchmarks Overview →Custom Eval Pipelines →Eval Tools Compared →What Benchmarks Miss →

Sources

[1] Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv:2104.08663.
[2] Muennighoff, N. et al. (2022). MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316.
[3] RAGAS framework documentation. docs.ragas.io. Accessed May 2026.
[4] Yang, Z. et al. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600.
[5] Tang, Y. et al. (2024). MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. arXiv:2401.15391.