Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatThree safety eval suites compared: HarmBench (ICML 2024), JailbreakBench (NeurIPS 2024), AILuminate (MLCommons 2024).
WhoAcademic-industry consortia plus MLCommons.
2026 UsePre-release safety testing and post-deployment monitoring.
HarmBenchharmbench.org
Section IV.vi Methodology|Last verified April 2026

Safety Eval Suites 2026: HarmBench, JailbreakBench, AILuminate Compared

Three suites, three audiences, three failure-mode focuses.

I

Side-by-Side

Suite
Prompts
Metric
Audience
HarmBench
400 across 7 categories
ASR per attack method
Red-team researchers
JailbreakBench
100 curated behaviours
Defence ASR leaderboard
Defence-method authors
AILuminate
24,000 across 12 hazards
5-grade rating
Enterprise risk teams
II

Reading ASR

Attack Success Rate is the standard metric for HarmBench and JailbreakBench. Lower ASR means the model resisted more attacks. ASR is sensitive to the panel of attacks evaluated: a defended model can show low ASR against the published attack set and high ASR against a newer attack not in the panel. Always specify the attack panel when reporting ASR.

III

AILuminate's Five-Grade Approach

MLCommons designed AILuminate to bridge the gap between safety research and procurement decisions. A single number (ASR 12%) is hard for a procurement reviewer to interpret. A five-grade label per category (Excellent in Hate Speech, Good in Physical Harm, Fair in CSAM-adjacent) is easier to consume, easier to compare across vendors, and harder to game by tuning on a specific test set. The trade-off is loss of resolution: AILuminate cannot distinguish between two models that both rate Excellent.

Inspect: the framework UK AISI uses for safetySite methodology overviewWhat these benchmarks miss
Reader Questions
Q.01What is HarmBench?+
HarmBench (Mazeika et al., ICML 2024) is a standardised evaluation framework for automated red-teaming. It includes 400 harmful behaviours across 7 categories (cybercrime, chemical and biological, illegal activity, harassment, misinformation, copyright, harmful behaviours generally) and a panel of 18 red-teaming methods. The headline metric is Attack Success Rate (ASR), the fraction of harmful behaviours that an attack method elicits from the target model.
Q.02What is JailbreakBench?+
JailbreakBench (Chao et al., 2024) is a benchmark of 100 harmful behaviours with curated jailbreak prompts from the literature. It serves as a leaderboard for defence methods (lower ASR is better) and as a publicly maintained record of which prompts work against which target models. Unlike HarmBench, JailbreakBench focuses on prompt-only attacks (no fine-tuning, no model-internal access).
Q.03What is AILuminate?+
AILuminate (MLCommons, December 2024) is an industry-grade safety benchmark with 12 hazard categories and 24,000 evaluation prompts. It is the safety counterpart to MLPerf in the MLCommons ecosystem. The output is a five-grade rating (Excellent, Very Good, Good, Fair, Poor) per category, not a single ASR percentage, which makes it more accessible to non-safety-specialist readers but harder to compare directly with HarmBench.
Q.04Which should I use?+
Pick HarmBench when the question is 'how does this defence method work against the strongest published attacks?'. Pick JailbreakBench when the question is 'does this prompt-only jailbreak still work in 2026 against current frontier models?'. Pick AILuminate when the question is 'how do I report a single safety rating to a non-technical stakeholder?'. The three are complements, not substitutes.
Q.05Are safety eval scores trustworthy?+
More trustworthy than capability scores in 2026, partly because the safety eval community is small and methodology is published. The remaining caveats are: (a) ASR is the wrong metric when the underlying behaviour rate is also low; (b) judge models for free-text harm detection inherit their own bias; (c) prompt-only attacks underestimate the threat from fine-tuning attacks and model-weight access. Always report several metrics and several attack types.

Sources

  1. [1] HarmBench (Mazeika et al. 2024): arxiv.org/abs/2402.04249
  2. [2] JailbreakBench (Chao et al. 2024): arxiv.org/abs/2404.01318
  3. [3] AILuminate (MLCommons 2024): mlcommons.org/benchmarks/ai-safety
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.