Abstract

WhatHolistic evaluation framework with 7 metrics across 30+ scenarios; published methodology and disclosure standard.

WhoStanford Center for Research on Foundation Models (CRFM), Percy Liang group, 2022 launch.

2026 UseReference methodology and disclosure baseline; multiple specialised lineages (Safety, MedHELM, LegalHELM).

Projectcrfm.stanford.edu/helm

Section V.vi Tools|Last verified April 2026

Stanford HELM: 7 Metrics, 30 Scenarios, Why Coverage Beats Headline Score

The framework that made the case for holistic evaluation. Still the published gold standard for disclosure.

The Seven Metrics

Metric

What it measures

Accuracy

Headline correctness against gold answer.

Calibration

Does the model assign higher probability to correct answers?

Robustness

Does accuracy hold under typos, paraphrase, and perturbation?

Fairness

Does accuracy hold across demographic subgroups?

Bias

Does the model amplify stereotypes when given the chance?

Toxicity

Does the model emit toxic outputs in response to neutral prompts?

Efficiency

Inference cost in tokens, latency, and dollars.

Why the Holistic Frame Matters

A 92% accuracy model with high toxicity and slow inference is not interchangeable with a 90% accuracy model with low toxicity and fast inference. Reporting only accuracy makes them look interchangeable. HELM's design forces the trade-offs into view and is the reason vendor-side procurement teams have started using HELM-style reporting in RFPs.

III

The Specialised Lineages

HELM Safety adds safety-specific scenarios (jailbreak resistance, harmful behaviour refusal). HELM MedHELM covers MedQA, MultiMedQA, and clinical scenarios under the same disclosure standard. HELM LegalHELM applies the framework to LegalBench and related tasks. HELM AirHELM (released 2024) covers aviation-domain knowledge. Each lineage uses the same methodology with domain-specific scenarios bolted on.

Inspect (UK AISI) →OpenAI Evals →Reproducibility methodology →

Reader Questions

Q.01What is HELM?+

HELM (Holistic Evaluation of Language Models) is the Stanford CRFM evaluation framework released in November 2022 and expanded continuously since. The framework evaluates each model under test on 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across 30+ scenarios spanning question answering, summarisation, sentiment, toxicity detection, code generation, and information retrieval. The result is a published table rather than a single number.

Q.02Why a 7-metric matrix instead of a leaderboard?+

The HELM team argued that a single accuracy number conceals trade-offs that matter for deployment. A model with high accuracy and high toxicity is worse than a slightly less accurate model with low toxicity for a consumer product. A model with high accuracy and high efficiency is worth more than a more accurate but 10x slower model for a high-throughput API. The 7-metric framing forces these trade-offs into view.

Q.03Is HELM still maintained?+

Yes. HELM has expanded into several specialised lineages: HELM Classic (the original capability matrix), HELM Instruct (for instruction-tuned models), HELM Safety (the safety-focused expansion), and HELM MedHELM, LegalHELM, AirHELM (domain-specific spin-offs). All are public and updated regularly with new model submissions.

Q.04How does HELM differ from MMLU or Open LLM Leaderboard?+

MMLU is a single benchmark with a single number. The HuggingFace Open LLM Leaderboard runs several benchmarks and reports a few numbers. HELM runs more benchmarks under a tighter methodology (held-fixed prompt template per scenario, multiple seeds, disclosed harness) and reports more dimensions. HELM numbers are usually 1 to 3 points lower than Open LLM Leaderboard numbers on the same model, which reflects the methodological tightness.

Q.05Why does HELM matter for benchmark literacy?+

HELM is the published baseline for what a defensible eval methodology looks like. Reading the HELM paper (Liang et al. 2022) is the fastest way to understand why reporting standards matter. The HELM disclosure checklist (prompt template, decoding, sample count, scoring harness, multiple seeds) is the de facto standard that the rest of the field is slowly catching up to.

Sources

[1] HELM project: crfm.stanford.edu/helm
[2] Liang et al., HELM paper (2022): arxiv.org/abs/2211.09110
[3] HELM Safety extension: crfm.stanford.edu/helm/safety