Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatHolistic evaluation framework with 7 metrics across 30+ scenarios; published methodology and disclosure standard.
WhoStanford Center for Research on Foundation Models (CRFM), Percy Liang group, 2022 launch.
2026 UseReference methodology and disclosure baseline; multiple specialised lineages (Safety, MedHELM, LegalHELM).
Projectcrfm.stanford.edu/helm
Section V.vi Tools|Last verified April 2026

Stanford HELM: 7 Metrics, 30 Scenarios, Why Coverage Beats Headline Score

The framework that made the case for holistic evaluation. Still the published gold standard for disclosure.

I

The Seven Metrics

Metric
What it measures
Accuracy
Headline correctness against gold answer.
Calibration
Does the model assign higher probability to correct answers?
Robustness
Does accuracy hold under typos, paraphrase, and perturbation?
Fairness
Does accuracy hold across demographic subgroups?
Bias
Does the model amplify stereotypes when given the chance?
Toxicity
Does the model emit toxic outputs in response to neutral prompts?
Efficiency
Inference cost in tokens, latency, and dollars.
II

Why the Holistic Frame Matters

A 92% accuracy model with high toxicity and slow inference is not interchangeable with a 90% accuracy model with low toxicity and fast inference. Reporting only accuracy makes them look interchangeable. HELM's design forces the trade-offs into view and is the reason vendor-side procurement teams have started using HELM-style reporting in RFPs.

III

The Specialised Lineages

HELM Safety adds safety-specific scenarios (jailbreak resistance, harmful behaviour refusal). HELM MedHELM covers MedQA, MultiMedQA, and clinical scenarios under the same disclosure standard. HELM LegalHELM applies the framework to LegalBench and related tasks. HELM AirHELM (released 2024) covers aviation-domain knowledge. Each lineage uses the same methodology with domain-specific scenarios bolted on.

Inspect (UK AISI)OpenAI EvalsReproducibility methodology
Reader Questions
Q.01What is HELM?+
HELM (Holistic Evaluation of Language Models) is the Stanford CRFM evaluation framework released in November 2022 and expanded continuously since. The framework evaluates each model under test on 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across 30+ scenarios spanning question answering, summarisation, sentiment, toxicity detection, code generation, and information retrieval. The result is a published table rather than a single number.
Q.02Why a 7-metric matrix instead of a leaderboard?+
The HELM team argued that a single accuracy number conceals trade-offs that matter for deployment. A model with high accuracy and high toxicity is worse than a slightly less accurate model with low toxicity for a consumer product. A model with high accuracy and high efficiency is worth more than a more accurate but 10x slower model for a high-throughput API. The 7-metric framing forces these trade-offs into view.
Q.03Is HELM still maintained?+
Yes. HELM has expanded into several specialised lineages: HELM Classic (the original capability matrix), HELM Instruct (for instruction-tuned models), HELM Safety (the safety-focused expansion), and HELM MedHELM, LegalHELM, AirHELM (domain-specific spin-offs). All are public and updated regularly with new model submissions.
Q.04How does HELM differ from MMLU or Open LLM Leaderboard?+
MMLU is a single benchmark with a single number. The HuggingFace Open LLM Leaderboard runs several benchmarks and reports a few numbers. HELM runs more benchmarks under a tighter methodology (held-fixed prompt template per scenario, multiple seeds, disclosed harness) and reports more dimensions. HELM numbers are usually 1 to 3 points lower than Open LLM Leaderboard numbers on the same model, which reflects the methodological tightness.
Q.05Why does HELM matter for benchmark literacy?+
HELM is the published baseline for what a defensible eval methodology looks like. Reading the HELM paper (Liang et al. 2022) is the fastest way to understand why reporting standards matter. The HELM disclosure checklist (prompt template, decoding, sample count, scoring harness, multiple seeds) is the de facto standard that the rest of the field is slowly catching up to.

Sources

  1. [1] HELM project: crfm.stanford.edu/helm
  2. [2] Liang et al., HELM paper (2022): arxiv.org/abs/2211.09110
  3. [3] HELM Safety extension: crfm.stanford.edu/helm/safety
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.