Last verified April 2026

Building a Custom LLM Eval Pipeline in 2026 - End-to-End Guide

Q: How many examples do I need for a custom eval?

Minimum 30 examples gives directional signal. 100 examples gives statistically reliable metrics with usable confidence intervals. 300-500 examples lets you slice by query type, user segment, or task category. At fewer than 30 examples, the variance between runs is large enough that metric changes are within noise - you cannot reliably detect a 5% improvement. Start with 50 examples covering your most common query patterns and edge cases.

Q: Should I version control my eval dataset?

Yes, always. A dataset change is as significant as a model change - it changes what you are measuring. Track dataset version alongside model version and prompt version in every eval run. If you add 50 examples to a 200-example dataset, re-run previous model versions on the new dataset before comparing. Without version control, you cannot determine whether a metric change is due to the model or the dataset.

Q: How often should I re-run evals?

At minimum: before any model, prompt, or retrieval system change ships to production. Ideally: in CI on every pull request that touches model inference code, prompt templates, or chunking logic. If running evals on every PR is too expensive, use a fast smoke test (10-20 examples, LLM-as-judge) on every PR and full eval (100+ examples, slower) before each release.

Q: Is LLM-as-judge good enough for custom evals?

LLM-as-judge is good enough for fast iteration. It is directionally correct 80-90% of the time on well-designed rubrics and enables daily or per-PR evaluation cadence. It is not good enough as the sole signal for major release decisions, high-stakes tasks (medical, legal, financial), or tasks where the judge model shares the evaluated model's failure modes. Use LLM-as-judge for iteration speed and add human eval at release milestones.

Q: What is the cheapest way to start building custom evals?

Start with a Python script, a CSV golden dataset, and a simple LLM-as-judge call via the API. No tooling needed. Build the golden dataset manually (30-50 examples, 20-30 minutes each). Write a prompt that asks the judge model to score the output 1-5 on your chosen criteria. Log results to the CSV. This costs approximately $10-50 in API fees for a full eval run and takes 2-3 days to set up. Add tooling (Langfuse, Braintrust, etc.) when you need team collaboration, prompt versioning, or production trace integration.

Public benchmarks measure general capability. Your agent has a specific job - answer questions about your documentation, route customer support tickets, generate SQL from natural language. You need to measure that specific job, not MMLU. This is a tool-agnostic guide to building an eval pipeline that actually tells you whether your system is improving.

The 5-Stage Eval Lifecycle

§ 01

Define

What does success look like? This is the step most teams rush through. Before touching any tooling, write down 5 example inputs, the exact output you would accept, and the output you would reject. If you cannot write these examples in 30 minutes, your success definition is too vague to measure.

-What is the task? (Be specific: 'summarise a 2,000-word support transcript in under 100 words, in plain English, covering the resolution.')
-What are the success criteria? (Faithfulness to source, relevancy to the question, format compliance, length constraints.)
-What are the failure modes? (Hallucination, irrelevance, wrong format, too long, too short.)
-What is the ground truth? (Known-correct answers, expected source documents for RAG, expected function calls for agents.)

§ 02

Dataset

Build a golden test set. A golden test set is a collection of (input, expected_output) or (input, ground_truth_answer, expected_source_chunks) pairs that a human has verified. This is the most important and most often skipped step.

-Minimum viable: 30 examples. Statistically reliable: 100 examples. Production-grade: 300-500 examples.
-Include common cases (80% of real traffic patterns), edge cases (adversarial inputs, ambiguous queries), and negative cases (inputs where the correct answer is 'I don't know').
-Source from real production traffic where possible - synthetic examples miss the distribution of real user queries.
-Version control the dataset in git alongside your code. A dataset change is as significant as a model change.

§ 03

Metrics

Choose how you will score each example. The right metric depends on the task. Pick 1-3 metrics that directly reflect your success criteria - avoid metric proliferation.

-Exact match: for structured outputs (JSON, SQL, function calls). The gold standard - binary, reproducible, no LLM-as-judge cost.
-Semantic similarity: for open-ended text where the exact wording doesn't matter. Embedding cosine similarity vs a reference answer.
-LLM-as-judge: for quality dimensions that are hard to compute (faithfulness, relevancy, tone). Fast and cheap. Not reliable for factual correctness.
-Functional / task success: for agents - did the downstream action succeed? (Email sent, ticket created, search returned results.) Binary, ground-truth.
-Latency and cost: often neglected. Track p50 and p95 latency, and cost per call. A model that is 10% more accurate but 5x more expensive may not be the right choice.

§ 04

Run

Execute the dataset against every model, prompt, or agent version you want to compare. Run in parallel where possible. Log every result - input, output, score, latency, token count, cost.

-Run with temperature 0 (or your production temperature if it is non-zero) for reproducibility.
-Log everything: input, full output, each metric score, latency, token count (prompt + completion), estimated cost, model version, prompt version, dataset version.
-Use a retry policy: LLM API calls fail. 3 retries with exponential backoff is standard.
-Separate the run step from the scoring step: save raw outputs first, then score. Lets you re-score with a different judge without re-running inference.

§ 05

Aggregate

Summarise results into headline metrics and track over time. The output should be a comparison table per model/prompt/version, not a spreadsheet of individual example scores.

-Headline metrics: pass rate (% examples above threshold), mean score per metric, cost per correct answer, p50/p95 latency.
-Slice by example category: common queries vs edge cases, topic area, user segment if data is available.
-Track over time: regression detection is the long-term value of evals. A eval run that is not stored and compared against previous runs is wasted.
-Set decision thresholds: below X faithfulness, do not ship. Above Y latency, investigate. Document these thresholds in your runbook.

Tooling Decision Tree

Do you want to self-host and avoid vendor lock-in?

Langfuse (full-featured OSS) or Arize Phoenix (OSS with strong tracing) or DeepEval (OSS, pytest-style).

Do you want the best developer experience and CI pipeline integration?

Braintrust. First-class CI integration, clean SDK, strong dataset versioning.

Are you already using LangChain in your application?

LangSmith. Zero-friction integration with LangChain. Not ideal for non-LangChain stacks.

Do you need strong production monitoring alongside offline evals?

Arize AI (cloud, strong for production) or Langfuse (self-host, strong for both). LangSmith also covers production traces.

Do you have fewer than 100 test examples and no production traffic yet?

A Python script and a CSV is fine. You do not need tooling yet. Spend the time on the golden dataset instead.

Full tool comparison with pricing and honest “avoid if” notes ->

Common Mistakes

Tiny datasets (<20 examples)

At fewer than 20 examples, metric variance between runs exceeds signal. You cannot reliably detect a 5% improvement with 20 examples.

Running once and calling it done

An eval run without regression tracking over time is wasted. The value of evals compounds when you compare each run against previous runs.

No cost-per-correct tracking

A model that costs 3x more but is correct 10% more often is usually not worth it. Track cost per correct answer alongside accuracy.

Trusting LLM-as-judge without sanity checks

Spot-check 10-20% of LLM-as-judge scores against human ratings before trusting the metric. Calibrate once per new judge model.

Not versioning the dataset

A dataset change invalidates historical comparisons. If you cannot say 'this run used dataset v1.4', you cannot track trends reliably.

Optimising for the benchmark rather than the task

An eval should measure your actual use case, not a convenient proxy. If you optimise for your eval metric without checking real user outcomes, you may be gaming your own benchmark.

Example: Evaluating a Summarisation Agent

A concrete walkthrough: a customer support transcript summarisation agent, 50 golden examples, faithfulness + relevancy + length ratio metrics, comparing GPT-5 vs Claude 4.5 Opus vs Gemini 2.5 Pro.

# Summarisation agent eval - example pseudocode
# Dataset: 50 golden (input, expected_output) pairs
# Models: GPT-5, Claude 4.5 Opus, Gemini 2.5 Pro
# Metrics: faithfulness, relevancy, length_ratio

import json

JUDGE_PROMPT = """
Rate this summary on faithfulness (1-5) and relevancy (1-5).
Source document: {source}
Summary to evaluate: {summary}
Expected key points: {expected_points}
Return JSON: {"faithfulness": X, "relevancy": X}
"""

results = []

for example in golden_dataset:      # 50 examples
  for model in ["gpt-5", "claude-4-5-opus", "gemini-2-5-pro"]:
    response = call_model(model, example["input"])

    # Functional metric - no LLM needed
    length_ratio = len(response.split()) / len(example["source"].split())

    # LLM-as-judge metric
    scores = json.loads(
      call_judge("gpt-5", JUDGE_PROMPT.format(
        source=example["source"],
        summary=response,
        expected_points=example["expected_points"]
      ))
    )

    results.append({
      "model": model,
      "example_id": example["id"],
      "faithfulness": scores["faithfulness"],
      "relevancy": scores["relevancy"],
      "length_ratio": length_ratio,
    })

# Aggregate per model
for model in ["gpt-5", "claude-4-5-opus", "gemini-2-5-pro"]:
  model_results = [r for r in results if r["model"] == model]
  print(f"{model}: faithfulness={avg(model_results, 'faithfulness'):.2f}, "
        f"relevancy={avg(model_results, 'relevancy'):.2f}, "
        f"length_ratio={avg(model_results, 'length_ratio'):.2f}")

Pseudocode only. Actual implementation depends on your chosen tooling and model SDKs. Running inference for 50 examples across 3 models = 150 API calls. Scoring with GPT-5 as judge adds another 150 calls. Estimated cost: $5-20 depending on output length and model pricing.

For API cost estimates: see claudeapipricing.com and geminipricing.com.

Frequently Asked Questions

How many examples do I need for a custom eval?+

Minimum 30 examples gives directional signal. 100 examples gives statistically reliable metrics. 300-500 examples lets you slice by query type. At fewer than 30 examples, variance between runs is too large to detect a 5% improvement reliably. Start with 50 examples covering your most common query patterns and edge cases.

Should I version control my eval dataset?+

Yes, always. A dataset change is as significant as a model change - it changes what you are measuring. Track dataset version alongside model version and prompt version in every eval run. Without version control, you cannot determine whether a metric change is due to the model or the dataset.

How often should I re-run evals?+

At minimum: before any model, prompt, or retrieval system change ships to production. Ideally: in CI on every pull request that touches model inference code or prompt templates. Use a fast smoke test on every PR and a full eval before each release.

Is LLM-as-judge good enough for custom evals?+

LLM-as-judge is good enough for fast iteration - directionally correct 80-90% of the time on well-designed rubrics. It is not good enough as the sole signal for major release decisions or high-stakes tasks. Use it for iteration speed and add human eval at release milestones.

What is the cheapest way to start building custom evals?+

Start with a Python script, a CSV golden dataset, and a simple LLM-as-judge call via the API. No tooling needed. Build the golden dataset manually (30-50 examples). This costs approximately $10-50 in API fees for a full eval run and takes 2-3 days to set up. Add tooling when you need team collaboration or production trace integration.

LLM-as-Judge methodology ->RAG evaluation ->Eval tools compared ->Production monitoring ->