LLM-as-Judge 2026 - Methodology, Prompts, Bias, When to Use It
LLM-as-judge is the practice of using a strong language model to evaluate the outputs of another model or agent. It is the backbone of most modern automated evaluation frameworks - Ragas, DeepEval, Braintrust, Langfuse, and LangSmith all use LLM-as-judge under the hood for open-ended quality metrics. Understanding how it works, where it fails, and how to use it reliably is essential for building trustworthy eval pipelines.
Two Canonical Methods
Single-Point Scoring
The judge reads one output and assigns a score (1-5 or 0-1) on one or more rubric dimensions. Simpler, faster, scales linearly with the number of outputs.
Use when: You have many outputs to score, a clear rubric, and need fast results. Best for: faithfulness, relevance, format compliance.
Pairwise Comparison
The judge reads two outputs for the same input and picks the better one. More robust, harder to game, but quadratic cost - comparing N outputs requires N*(N-1)/2 pairs.
Use when: Choosing between model versions, prompt variants, or fine-tuning directions. Best for: release decisions, A/B comparisons.
G-Eval - The Canonical Methodology
G-Eval was published by Yang Liu and colleagues (2023) as the first rigorous study of LLM-as-judge with structured rubrics and chain-of-thought reasoning. The key finding: GPT-4 with CoT reasoning achieves Pearson correlation of 0.80-0.90 with human judgments on NLG evaluation tasks - significantly outperforming traditional automated metrics like ROUGE (which correlate at 0.2-0.4).
The G-Eval methodology: (1) Write an explicit task description - what is the model supposed to do? (2) Define evaluation criteria with step-by-step evaluation instructions - not just “rate quality” but “first check whether the answer is grounded in the provided context, then check whether it addresses the question”. (3) Ask the judge to reason through each criterion before scoring. (4) Parse the structured output.
The CoT step is the most important element. A judge that reasons before scoring outperforms a judge that scores directly. Temperature 0 (greedy decoding) is standard for reproducibility. All major eval frameworks implement G-Eval-style evaluation under the hood.
Known Bias Vectors
Position bias
Severity: HighIn pairwise comparisons, judges tend to favor the first option presented. Studies show 10-15% preference for the first response even when outputs are swapped. The response “A is better than B” is more common than “B is better than A” for the same pair in different orderings.
Mitigation: Randomise presentation order for each pair. Better: score both orderings and average. Flag and discard pairs where the judge gives contradictory verdicts when order is swapped.
Verbosity bias
Severity: HighJudges consistently prefer longer responses, even when a shorter response is more correct or more appropriate. A 500-word answer that is 70% correct is often rated higher than a 100-word answer that is 100% correct.
Mitigation: Include explicit length guidance in the rubric: 'A shorter correct answer is better than a longer answer with unnecessary content.' Add a specific criterion for conciseness when brevity matters.
Self-preference bias
Severity: MediumA judge model prefers outputs from its own model family. GPT-4 judges tend to rate GPT-4 outputs higher; Claude judges tend to rate Claude outputs higher, even when controlled for actual quality.
Mitigation: Use a judge from a different model family when comparing across model families. Cross-validate with multiple judges. Use human evaluation for final release decisions.
Rubric drift
Severity: MediumThe same rubric is interpreted differently across runs, especially at temperature > 0. A score of 4 in one batch may correspond to a score of 3 in another batch.
Mitigation: Use temperature 0 for reproducibility. Anchor the scale with explicit examples in the prompt: 'A score of 5 means... A score of 3 means... A score of 1 means...' Run a calibration set across batches.
When LLM-as-Judge Works
Works well
- Summarisation quality (does this summary capture the key points?)
- Tone and style matching (does this match the specified writing style?)
- Instruction following (did the response follow all specified constraints?)
- RAG faithfulness (are claims grounded in provided context?)
- Response relevance (does this address the question?)
- Format compliance (is the output in the specified format?)
Does not work well
- Factual correctness on topics the judge model does not know well
- Evaluating outputs from a model stronger than the judge
- High-stakes decisions (medical diagnosis, legal advice, financial) where any bias is unacceptable
- Tasks requiring domain expertise the judge lacks
- Creative tasks where quality is highly subjective and personal-preference-dependent
Template Prompts
Released under CC0 / public domain. Copy-paste freely.
Single-Point Scoring Template
You are an expert evaluator assessing the quality of an AI assistant's response.
Task: {task_description}
Question: {question}
Response to evaluate: {response}
Evaluate the response on the following criteria:
1. Relevance (1-5): Does the response directly address the question?
2. Accuracy (1-5): Are the claims factually correct based on your knowledge?
3. Completeness (1-5): Does the response cover the key aspects of the question?
4. Clarity (1-5): Is the response clearly written and easy to understand?
Think step by step through each criterion before assigning scores.
Output your scores in JSON format: {"relevance": X, "accuracy": X, "completeness": X, "clarity": X, "overall": X}Pairwise Comparison Template
You are an expert evaluator comparing two AI assistant responses.
Task: {task_description}
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response is better, and why? Consider:
- Factual accuracy
- Completeness and relevance
- Clarity and conciseness
- Helpfulness for the user's actual need
Think step by step through your comparison before giving your verdict.
Output: {"winner": "A" | "B" | "tie", "reasoning": "...", "confidence": "high" | "medium" | "low"}Frequently Asked Questions
Is LLM-as-judge reliable?+
Which model is best as a judge?+
How does G-Eval work?+
What is pairwise scoring?+
Can I use the same model to generate and judge?+
Sources
- [1] Liu et al., G-Eval - arxiv.org/abs/2303.16634 - 2023
- [2] Wang et al., Large Language Models are not Fair Evaluators - arxiv.org/abs/2305.17926 - 2023
- [3] Zheng et al., Judging LLM-as-a-Judge with MT-Bench - arxiv.org/abs/2306.05685 - 2023