Stanford HELM: 7 Metrics, 30 Scenarios, Why Coverage Beats Headline Score
The framework that made the case for holistic evaluation. Still the published gold standard for disclosure.
The Seven Metrics
Why the Holistic Frame Matters
A 92% accuracy model with high toxicity and slow inference is not interchangeable with a 90% accuracy model with low toxicity and fast inference. Reporting only accuracy makes them look interchangeable. HELM's design forces the trade-offs into view and is the reason vendor-side procurement teams have started using HELM-style reporting in RFPs.
The Specialised Lineages
HELM Safety adds safety-specific scenarios (jailbreak resistance, harmful behaviour refusal). HELM MedHELM covers MedQA, MultiMedQA, and clinical scenarios under the same disclosure standard. HELM LegalHELM applies the framework to LegalBench and related tasks. HELM AirHELM (released 2024) covers aviation-domain knowledge. Each lineage uses the same methodology with domain-specific scenarios bolted on.
Q.01What is HELM?+
Q.02Why a 7-metric matrix instead of a leaderboard?+
Q.03Is HELM still maintained?+
Q.04How does HELM differ from MMLU or Open LLM Leaderboard?+
Q.05Why does HELM matter for benchmark literacy?+
Sources
- [1] HELM project: crfm.stanford.edu/helm
- [2] Liang et al., HELM paper (2022): arxiv.org/abs/2211.09110
- [3] HELM Safety extension: crfm.stanford.edu/helm/safety