Building a Custom LLM Eval Pipeline in 2026 - End-to-End Guide
Public benchmarks measure general capability. Your agent has a specific job - answer questions about your documentation, route customer support tickets, generate SQL from natural language. You need to measure that specific job, not MMLU. This is a tool-agnostic guide to building an eval pipeline that actually tells you whether your system is improving.
The 5-Stage Eval Lifecycle
Define
What does success look like? This is the step most teams rush through. Before touching any tooling, write down 5 example inputs, the exact output you would accept, and the output you would reject. If you cannot write these examples in 30 minutes, your success definition is too vague to measure.
- -What is the task? (Be specific: 'summarise a 2,000-word support transcript in under 100 words, in plain English, covering the resolution.')
- -What are the success criteria? (Faithfulness to source, relevancy to the question, format compliance, length constraints.)
- -What are the failure modes? (Hallucination, irrelevance, wrong format, too long, too short.)
- -What is the ground truth? (Known-correct answers, expected source documents for RAG, expected function calls for agents.)
Dataset
Build a golden test set. A golden test set is a collection of (input, expected_output) or (input, ground_truth_answer, expected_source_chunks) pairs that a human has verified. This is the most important and most often skipped step.
- -Minimum viable: 30 examples. Statistically reliable: 100 examples. Production-grade: 300-500 examples.
- -Include common cases (80% of real traffic patterns), edge cases (adversarial inputs, ambiguous queries), and negative cases (inputs where the correct answer is 'I don't know').
- -Source from real production traffic where possible - synthetic examples miss the distribution of real user queries.
- -Version control the dataset in git alongside your code. A dataset change is as significant as a model change.
Metrics
Choose how you will score each example. The right metric depends on the task. Pick 1-3 metrics that directly reflect your success criteria - avoid metric proliferation.
- -Exact match: for structured outputs (JSON, SQL, function calls). The gold standard - binary, reproducible, no LLM-as-judge cost.
- -Semantic similarity: for open-ended text where the exact wording doesn't matter. Embedding cosine similarity vs a reference answer.
- -LLM-as-judge: for quality dimensions that are hard to compute (faithfulness, relevancy, tone). Fast and cheap. Not reliable for factual correctness.
- -Functional / task success: for agents - did the downstream action succeed? (Email sent, ticket created, search returned results.) Binary, ground-truth.
- -Latency and cost: often neglected. Track p50 and p95 latency, and cost per call. A model that is 10% more accurate but 5x more expensive may not be the right choice.
Run
Execute the dataset against every model, prompt, or agent version you want to compare. Run in parallel where possible. Log every result - input, output, score, latency, token count, cost.
- -Run with temperature 0 (or your production temperature if it is non-zero) for reproducibility.
- -Log everything: input, full output, each metric score, latency, token count (prompt + completion), estimated cost, model version, prompt version, dataset version.
- -Use a retry policy: LLM API calls fail. 3 retries with exponential backoff is standard.
- -Separate the run step from the scoring step: save raw outputs first, then score. Lets you re-score with a different judge without re-running inference.
Aggregate
Summarise results into headline metrics and track over time. The output should be a comparison table per model/prompt/version, not a spreadsheet of individual example scores.
- -Headline metrics: pass rate (% examples above threshold), mean score per metric, cost per correct answer, p50/p95 latency.
- -Slice by example category: common queries vs edge cases, topic area, user segment if data is available.
- -Track over time: regression detection is the long-term value of evals. A eval run that is not stored and compared against previous runs is wasted.
- -Set decision thresholds: below X faithfulness, do not ship. Above Y latency, investigate. Document these thresholds in your runbook.
Tooling Decision Tree
Do you want to self-host and avoid vendor lock-in?
Langfuse (full-featured OSS) or Arize Phoenix (OSS with strong tracing) or DeepEval (OSS, pytest-style).
Do you want the best developer experience and CI pipeline integration?
Braintrust. First-class CI integration, clean SDK, strong dataset versioning.
Are you already using LangChain in your application?
LangSmith. Zero-friction integration with LangChain. Not ideal for non-LangChain stacks.
Do you need strong production monitoring alongside offline evals?
Arize AI (cloud, strong for production) or Langfuse (self-host, strong for both). LangSmith also covers production traces.
Do you have fewer than 100 test examples and no production traffic yet?
A Python script and a CSV is fine. You do not need tooling yet. Spend the time on the golden dataset instead.
Full tool comparison with pricing and honest “avoid if” notes ->
Common Mistakes
Tiny datasets (<20 examples)
At fewer than 20 examples, metric variance between runs exceeds signal. You cannot reliably detect a 5% improvement with 20 examples.
Running once and calling it done
An eval run without regression tracking over time is wasted. The value of evals compounds when you compare each run against previous runs.
No cost-per-correct tracking
A model that costs 3x more but is correct 10% more often is usually not worth it. Track cost per correct answer alongside accuracy.
Trusting LLM-as-judge without sanity checks
Spot-check 10-20% of LLM-as-judge scores against human ratings before trusting the metric. Calibrate once per new judge model.
Not versioning the dataset
A dataset change invalidates historical comparisons. If you cannot say 'this run used dataset v1.4', you cannot track trends reliably.
Optimising for the benchmark rather than the task
An eval should measure your actual use case, not a convenient proxy. If you optimise for your eval metric without checking real user outcomes, you may be gaming your own benchmark.
Example: Evaluating a Summarisation Agent
A concrete walkthrough: a customer support transcript summarisation agent, 50 golden examples, faithfulness + relevancy + length ratio metrics, comparing GPT-5 vs Claude 4.5 Opus vs Gemini 2.5 Pro.
# Summarisation agent eval - example pseudocode
# Dataset: 50 golden (input, expected_output) pairs
# Models: GPT-5, Claude 4.5 Opus, Gemini 2.5 Pro
# Metrics: faithfulness, relevancy, length_ratio
import json
JUDGE_PROMPT = """
Rate this summary on faithfulness (1-5) and relevancy (1-5).
Source document: {source}
Summary to evaluate: {summary}
Expected key points: {expected_points}
Return JSON: {"faithfulness": X, "relevancy": X}
"""
results = []
for example in golden_dataset: # 50 examples
for model in ["gpt-5", "claude-4-5-opus", "gemini-2-5-pro"]:
response = call_model(model, example["input"])
# Functional metric - no LLM needed
length_ratio = len(response.split()) / len(example["source"].split())
# LLM-as-judge metric
scores = json.loads(
call_judge("gpt-5", JUDGE_PROMPT.format(
source=example["source"],
summary=response,
expected_points=example["expected_points"]
))
)
results.append({
"model": model,
"example_id": example["id"],
"faithfulness": scores["faithfulness"],
"relevancy": scores["relevancy"],
"length_ratio": length_ratio,
})
# Aggregate per model
for model in ["gpt-5", "claude-4-5-opus", "gemini-2-5-pro"]:
model_results = [r for r in results if r["model"] == model]
print(f"{model}: faithfulness={avg(model_results, 'faithfulness'):.2f}, "
f"relevancy={avg(model_results, 'relevancy'):.2f}, "
f"length_ratio={avg(model_results, 'length_ratio'):.2f}")Pseudocode only. Actual implementation depends on your chosen tooling and model SDKs. Running inference for 50 examples across 3 models = 150 API calls. Scoring with GPT-5 as judge adds another 150 calls. Estimated cost: $5-20 depending on output length and model pricing.
For API cost estimates: see claudeapipricing.com and geminipricing.com.