Abstract

WhatThree safety eval suites compared: HarmBench (ICML 2024), JailbreakBench (NeurIPS 2024), AILuminate (MLCommons 2024).

WhoAcademic-industry consortia plus MLCommons.

2026 UsePre-release safety testing and post-deployment monitoring.

HarmBenchharmbench.org

Section IV.vi Methodology|Last verified April 2026

Safety Eval Suites 2026: HarmBench, JailbreakBench, AILuminate Compared

Three suites, three audiences, three failure-mode focuses.

Side-by-Side

Suite

Prompts

Metric

Audience

HarmBench

400 across 7 categories

ASR per attack method

Red-team researchers

JailbreakBench

100 curated behaviours

Defence ASR leaderboard

Defence-method authors

AILuminate

24,000 across 12 hazards

5-grade rating

Enterprise risk teams

Reading ASR

Attack Success Rate is the standard metric for HarmBench and JailbreakBench. Lower ASR means the model resisted more attacks. ASR is sensitive to the panel of attacks evaluated: a defended model can show low ASR against the published attack set and high ASR against a newer attack not in the panel. Always specify the attack panel when reporting ASR.

III

AILuminate's Five-Grade Approach

MLCommons designed AILuminate to bridge the gap between safety research and procurement decisions. A single number (ASR 12%) is hard for a procurement reviewer to interpret. A five-grade label per category (Excellent in Hate Speech, Good in Physical Harm, Fair in CSAM-adjacent) is easier to consume, easier to compare across vendors, and harder to game by tuning on a specific test set. The trade-off is loss of resolution: AILuminate cannot distinguish between two models that both rate Excellent.

Inspect: the framework UK AISI uses for safety →Site methodology overview →What these benchmarks miss →

Reader Questions

Q.01What is HarmBench?+

HarmBench (Mazeika et al., ICML 2024) is a standardised evaluation framework for automated red-teaming. It includes 400 harmful behaviours across 7 categories (cybercrime, chemical and biological, illegal activity, harassment, misinformation, copyright, harmful behaviours generally) and a panel of 18 red-teaming methods. The headline metric is Attack Success Rate (ASR), the fraction of harmful behaviours that an attack method elicits from the target model.

Q.02What is JailbreakBench?+

JailbreakBench (Chao et al., 2024) is a benchmark of 100 harmful behaviours with curated jailbreak prompts from the literature. It serves as a leaderboard for defence methods (lower ASR is better) and as a publicly maintained record of which prompts work against which target models. Unlike HarmBench, JailbreakBench focuses on prompt-only attacks (no fine-tuning, no model-internal access).

Q.03What is AILuminate?+

AILuminate (MLCommons, December 2024) is an industry-grade safety benchmark with 12 hazard categories and 24,000 evaluation prompts. It is the safety counterpart to MLPerf in the MLCommons ecosystem. The output is a five-grade rating (Excellent, Very Good, Good, Fair, Poor) per category, not a single ASR percentage, which makes it more accessible to non-safety-specialist readers but harder to compare directly with HarmBench.

Q.04Which should I use?+

Pick HarmBench when the question is 'how does this defence method work against the strongest published attacks?'. Pick JailbreakBench when the question is 'does this prompt-only jailbreak still work in 2026 against current frontier models?'. Pick AILuminate when the question is 'how do I report a single safety rating to a non-technical stakeholder?'. The three are complements, not substitutes.

Q.05Are safety eval scores trustworthy?+

More trustworthy than capability scores in 2026, partly because the safety eval community is small and methodology is published. The remaining caveats are: (a) ASR is the wrong metric when the underlying behaviour rate is also low; (b) judge models for free-text harm detection inherit their own bias; (c) prompt-only attacks underestimate the threat from fine-tuning attacks and model-weight access. Always report several metrics and several attack types.

Sources

[1] HarmBench (Mazeika et al. 2024): arxiv.org/abs/2402.04249
[2] JailbreakBench (Chao et al. 2024): arxiv.org/abs/2404.01318
[3] AILuminate (MLCommons 2024): mlcommons.org/benchmarks/ai-safety