RAG Benchmarks: The 2026 Selection Guide
There is no single "RAG benchmark". Real RAG evaluation needs a benchmark for retrieval, a benchmark for embedding quality, and a benchmark for end-to-end behaviour. BEIR, MTEB, and RAGAS cover the three; HotPotQA and MultiHopRAG add multi-hop reasoning. Quote two or three together; expect the configuration of your pipeline to matter more than any single number.
RAG is several pipelines, not one
A RAG (retrieval-augmented generation) pipeline has three logical stages: encoding (turn text into embeddings), retrieval (find relevant context for a query), and generation (produce an answer grounded in the retrieved context). Each stage has its own failure modes, its own quality metrics, and its own benchmarks. There is no single benchmark that captures all three stages, and any claim of "the best RAG benchmark" usually means the speaker is focused on one stage.
The honest 2026 approach to RAG evaluation is to quote two or three benchmarks covering different stages: BEIR for retrieval quality, MTEB for embedding quality, and either RAGAS or a custom golden dataset for end-to-end behaviour. Multi-hop reasoning gets its own benchmark (MultiHopRAG or HotPotQA) when it is part of the workload. See our RAG evaluation deep dive for the methodology details.
The headline trade-off across RAG benchmarks is precision-versus-realism. Retrieval-only benchmarks (BEIR, MS-MARCO) have programmatic success functions and are highly reproducible. End-to-end benchmarks (RAGAS, custom golden datasets) are closer to production behaviour but rely on LLM-as-judge scoring with non-trivial noise. Quote the precision benchmarks for component selection; quote the end-to-end benchmarks for system-level claims.
Benchmark-by-benchmark comparison
The full picture of RAG benchmarks in 2026 spans pure retrieval benchmarks, embedding benchmarks, end-to-end RAG benchmarks, hallucination-focused benchmarks, and multi-hop QA benchmarks. The summary below lays out what each measures, the headline metric, the strengths, the weaknesses, and the recommendation.
Use-case-by-use-case selection guide
The right benchmark depends on the question being asked. Picking an embedding model is a different question from monitoring a production RAG pipeline; both are different from evaluating multi-hop reasoning. The table below maps common RAG-engineering questions to the benchmarks that best answer them.
BEIR and MTEB: when to use which
BEIR and MTEB both cover retrieval quality but differ in scope and intent. BEIR (Benchmarking IR) is specifically a retrieval benchmark: 18 datasets, all retrieval, with NDCG@10 as the headline metric. It is the right benchmark when the question is "how good is this model at finding relevant documents". MTEB (Massive Text Embedding Benchmark) is broader: 56 datasets across 8 task types including retrieval but also classification, clustering, semantic similarity, and reranking. It is the right benchmark when the question is "how good is this embedding model overall".
In practice, MTEB has become the more cited benchmark for embedding model selection because production RAG pipelines often use embeddings for multiple purposes (retrieval, reranking, semantic search, deduplication), and MTEB's breadth captures the full utility. BEIR remains the right benchmark for pure retrieval comparisons and is often used alongside MTEB. Many embedding model cards now report both numbers; we recommend quoting both when comparing embedding models.
RAGAS and end-to-end RAG evaluation
RAGAS evaluates end-to-end RAG pipelines through four LLM-as-judge metrics: faithfulness (does the generated answer reflect the retrieved context), answer relevance (does the answer address the question), context recall (did retrieval find the relevant context), and context precision (is the retrieved context actually useful). The four metrics together give a more complete picture than any single one; a high faithfulness score with low context recall, for example, means the system is being honest about its limited information rather than hallucinating.
The LLM-as-judge nature of RAGAS introduces noise that pure retrieval benchmarks avoid. Two RAGAS runs on the same data can produce different scores by 2-5 points depending on the judge model and prompt. The framework is best used for relative comparison (which configuration is better) rather than absolute scoring (this is the right number). For production monitoring, RAGAS provides a structured way to track pipeline health over time; for component selection, the more reproducible BEIR or MTEB numbers are usually better.
Hallucination-focused benchmarks
Hallucination is a distinct concern from retrieval quality and end-to-end faithfulness. RAGTruth is one of several benchmarks specifically designed to measure hallucination: how often a generated response contains claims that are not supported by the retrieved context. The benchmark provides graded annotations across multiple granularities (sentence-level, claim-level, factoid-level) and is the most directly applicable benchmark for evaluating systems where hallucination prevention is the primary goal.
Adjacent benchmarks include FACTS Grounding (Google's factuality grounding evaluation) and FaithBench. None has fully displaced RAGAS' faithfulness metric for production monitoring, but RAGTruth and FACTS Grounding are useful for system-level claims about hallucination rate. Quote them when hallucination rate is a primary product metric.
Multi-hop reasoning
Multi-hop reasoning is a separate axis of RAG capability. A pipeline that can retrieve and synthesise information from a single document is different from one that can chain reasoning across multiple documents to answer a question. HotPotQA is the foundational benchmark; MultiHopRAG is the newer benchmark designed specifically for the RAG-plus-multi-hop intersection. Both are useful; MultiHopRAG is more directly applicable to production RAG evaluation.
The multi-hop case is where DSPy and similar prompt-optimisation frameworks tend to add the most value. The prompt structures for retrieve-then-reason multi-hop pipelines are non-obvious to hand-tune, and programmatic optimisation reliably finds better configurations. See our DSPy reference for the detail; the headline is that on multi-hop benchmarks like HotPotQA and MultiHopRAG, DSPy compilation lifts scores by 10-20 points over hand-prompted baselines.
Q.01What is the right RAG benchmark to quote in 2026?+
Q.02What does BEIR measure?+
Q.03What does MTEB add over BEIR?+
Q.04What is RAGAS and when should I use it?+
Q.05How do HotPotQA and MultiHopRAG differ?+
Q.06What about MS-MARCO and Natural Questions?+
Sources
- [1] Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv:2104.08663.
- [2] Muennighoff, N. et al. (2022). MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316.
- [3] RAGAS framework documentation. docs.ragas.io. Accessed May 2026.
- [4] Yang, Z. et al. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600.
- [5] Tang, Y. et al. (2024). MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. arXiv:2401.15391.