RAG Evaluation 2026 - Faithfulness, Relevancy, and Context Metrics
RAG (Retrieval-Augmented Generation) systems have two places where things go wrong: retrieval (did you fetch the right context?) and generation (did the answer use the context correctly?). Evaluating a RAG system requires measuring both. This is a tool-agnostic guide to the four canonical metrics, how to build a golden dataset, and which tools compute these metrics in 2026.
What RAG Eval Is
Retrieval Evaluation
Did the retriever fetch the right context? Metrics: context precision (of what was retrieved, how much was relevant?) and context recall (of what was needed, how much was retrieved?). Poor retrieval upstream causes generation failures downstream.
Generation Evaluation
Did the LLM use the context correctly? Metrics: faithfulness (did the answer only make claims supported by context?) and answer relevancy (did the answer actually address the question?). Good retrieval cannot save a bad generation step.
The Four Canonical Metrics (Ragas Framework)
Faithfulness
Range: 0 - 1“Does the answer only make claims that are supported by the retrieved context?”
The judge LLM breaks the answer into individual claims, then checks each claim against the retrieved context chunks. Faithfulness = (number of supported claims) / (total number of claims). A claim is 'supported' if it can be directly inferred from the context.
The model asserts facts that are not in the retrieved context - a hallucination. Particularly dangerous in legal, medical, and financial applications.
> 0.85 for most production applications. > 0.95 for high-stakes domains.
Answer Relevancy
Range: 0 - 1“Does the answer actually address the question that was asked?”
The judge LLM generates several plausible questions that the answer would address, then computes the semantic similarity between those generated questions and the original question. High similarity means the answer was on-topic.
The model produces a technically accurate answer to a different or broader question than asked. Common when the retriever returns context about a related topic.
> 0.80. Lower scores often indicate retrieval issues (wrong context retrieved) rather than generation issues.
Context Precision
Range: 0 - 1“Of the chunks retrieved, how many were actually relevant to answering the question?”
Each retrieved chunk is evaluated for relevance to the question. Context Precision = (relevant chunks retrieved) / (total chunks retrieved). Computed with LLM-as-judge or embedding similarity against the ground-truth answer.
The retriever returns too many chunks that are topically adjacent but not directly useful. Increases context length, cost, and confusion for the generation step.
> 0.75. Low precision suggests retriever tuning - chunk size, embedding model, similarity threshold.
Context Recall
Range: 0 - 1“Of the information needed to answer the question, how much was actually retrieved?”
The ground-truth answer is broken into claims. Each claim is checked against retrieved chunks - is this information present? Context Recall = (ground-truth claims supported by retrieved context) / (total ground-truth claims). Requires a ground-truth answer.
The retriever misses key context needed to answer the question. The generation step may hallucinate to fill the gap, or refuse to answer.
> 0.80. Low recall often means chunk size is too small, top-k is too low, or the embedding model does not match your domain.
Note: all four Ragas metrics use LLM-as-judge under the hood. See /llm-as-judge for the methodology and bias considerations.
Building a Golden Dataset
A golden dataset is a set of (question, expected answer, expected source chunks) triples where the expected answer and sources are human-verified ground truth. This is the most important step most teams skip, and the most important investment in eval quality.
Building a golden dataset requires human effort. There is no automated shortcut that produces reliable ground truth. The process: (1) Select representative questions from your application's real traffic or anticipated query distribution. (2) For each question, identify the source chunks in your knowledge base that contain the information needed to answer it. (3) Write the ground-truth answer based only on those source chunks. (4) Label edge cases - questions where the knowledge base does not contain the answer, and the correct behavior is to say so.
Minimum viable golden dataset: 30 examples covering common queries, edge cases, and at least 5 cases where the correct answer is “I don't know.” 100 examples give you statistically reliable metric scores. Budget 15-30 minutes of human time per example for the initial creation; much faster for ongoing additions.
Version control your golden dataset alongside your code. Track which version of the dataset each eval run used. A change in the dataset between runs makes the metric comparison meaningless.
Tools That Compute These Metrics
| Tool | OSS | RAG Metrics Included | Notes |
|---|---|---|---|
| Ragas | Yes | All 4 canonical + BLEU, ROUGE, AnswerSimilarity | The original. Pip-installable, works with any LLM or embedding model. |
| DeepEval | Yes | GEval, Faithfulness, Contextual Precision/Recall | pytest-style API, CI-friendly. Confident AI cloud adds dashboard. |
| Arize Phoenix | Yes | Faithfulness, Relevancy, Hallucination detection | OSS + Arize AI cloud. Strong production tracing alongside evals. |
| Langfuse | Yes | Custom metrics + LLM-as-judge templates | Self-host or cloud. Flexible eval definition, strong tracing. |
| LangSmith | No | LLM-as-judge templates, custom evaluators | LangChain-native. Best for teams already on LangChain. |
| Braintrust | No | LLM-as-judge, custom scorers, BLEU, ROUGE | Strong CI integration. Does not natively compute all 4 Ragas metrics. |
| Some links to eval tools are affiliate links; rankings are not influenced. Full comparison at /tools-compared. | |||
Beyond the Four Metrics
The four Ragas metrics cover retrieval and generation quality on individual responses. For production RAG systems, additional dimensions matter:
- Answer similarity: How similar is the generated answer to a reference answer? Useful when there is a known ground truth. Measured via embedding cosine similarity or BLEU/ROUGE.
- Context relevance ranking: Beyond binary relevant/not-relevant, how does the relevance of retrieved chunks rank? A retriever that ranks the most relevant chunk first is better than one that buries it at position 5.
- End-to-end task success: For agentic RAG (the agent takes actions based on retrieved information), did the downstream task succeed? Faithfulness is necessary but not sufficient for a research agent that must also take correct actions.
- Latency: RAG adds retrieval latency on top of generation latency. P95 latency under load is the most relevant metric for user experience.
- Cost per query: Longer contexts increase inference cost. Context precision directly impacts cost - irrelevant chunks make every query more expensive.
Frequently Asked Questions
What is a good faithfulness score?+
Do I need all four Ragas metrics?+
How many test cases do I need for RAG evaluation?+
What is the difference between context precision and context recall?+
Is LLM-as-judge reliable for RAG eval?+
Sources
- [1] Ragas framework - ragas.io
- [2] Es et al., RAGAS: Automated Evaluation of RAG - arxiv.org/abs/2309.15217 - 2023
- [3] DeepEval documentation - docs.confident-ai.com
- [4] Arize Phoenix RAG tracing docs - docs.arize.com/phoenix