Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
What162 legal reasoning tasks tagged against the IRAC framework, contributed by practising lawyers and law-school researchers.
WhoGuha, Nyarko, Ho, Re, Chilton, et al. (Stanford CRFM, NeurIPS 2023 Datasets and Benchmarks track).
2026 TierFrontier above 86 macro-F1.
Projecthazyresearch.stanford.edu/legalbench
Section I.vii Industry Domain|Last verified April 2026

LegalBench: 162 Legal Reasoning Tasks, GPT-4 Macro-F1 at 77.0

The benchmark that maps legal reasoning to LLM evaluation under the IRAC frame.

I

Construction

LegalBench was assembled through a 2022 to 2023 call for task contributions from practising lawyers, law professors, and researchers. The result is 162 tasks organised under six categories: Issue-Spotting, Rule-Recall, Rule-Application, Rule-Conclusion, Interpretation, and Rhetorical Analysis. Each category maps to a step in IRAC analysis.

Task formats vary by reasoning type. Some tasks are binary classification (is this clause an arbitration clause?). Others are multi-class (which area of law does this case touch?). A handful are free-text generation evaluated by exact match or by judges. The headline metric is unweighted macro-F1 across all 162 tasks.

II

SOTA Progression

Date
Tier / Score
Note
Aug 2023
GPT-4 at 77.0 macro-F1
Original Guha et al. paper baseline.
Mar 2024
Claude 3 Opus at 78.4
Anthropic-reported with chain-of-thought.
Sep 2024
GPT-4o at 80.1
Mainstream model card.
Apr 2026
Frontier above 86 macro-F1
Captured from public reports; task-level variance remains high.
III

Where Models Underperform

The original paper found Rule-Application (mapping a given set of facts to a stated legal rule) was the hardest category. Frontier 2026 models retain this pattern: Issue-Spotting and Rule-Recall sit in the low 90s, Application sits in the high 70s, Conclusion lags further. This matches the intuition that pattern recognition is easier than legal judgement.

MedQA for medical reasoningRepoBench for multi-file codeContamination in domain benchmarks
Reader Questions
Q.01What is LegalBench?+
LegalBench is a collaboratively-built benchmark for evaluating LLM legal reasoning. It collects 162 tasks contributed by legal practitioners and researchers, organised under the IRAC (Issue, Rule, Application, Conclusion) framework that mirrors how lawyers actually reason. Tasks span case classification, doctrine application, statute interpretation, and contract clause detection.
Q.02What was the headline number?+
The original Guha et al. paper (Stanford CRFM, 2023) reported GPT-4 at 77.0 macro-F1 across the 162-task suite. Mid-tier models (Claude 2, PaLM 2) trailed in the high 60s. Open-weight legal-finetunes (SaulLM-7B) reached around 70 macro-F1 on the suite. In 2026 the frontier scores in the high 80s, with task-level variance still significant.
Q.03Why use IRAC as the structuring framework?+
IRAC (Issue, Rule, Application, Conclusion) is the standard legal-analysis pattern taught in U.S. law schools and recognised in most common-law jurisdictions. By tagging tasks against the IRAC stage they require, the benchmark surfaces where model failure clusters. The original paper found models do well on Issue spotting and Rule recall but worse on Application (mapping facts to a rule).
Q.04Is LegalBench US-centric?+
Mostly. The bulk of tasks derive from U.S. statutes, U.S. case law, and U.S. contract conventions. Some tasks (CUAD-derived contract analysis) are language-of-the-document rather than jurisdiction-of-law, so they generalise better. For UK, EU, or commonwealth legal reasoning, LegalBench results overestimate capability and should be supplemented with jurisdiction-specific evaluation.
Q.05Should I rely on LegalBench scores for production legal tools?+
No, not alone. LegalBench measures benchmark-task performance; production legal work requires confidentiality controls, citation accuracy, hallucination detection, and human-in-the-loop review. A high LegalBench score is a necessary but not sufficient signal. The Stanford CRFM authors are explicit about this in the paper limitations section.

Sources

  1. [1] Guha et al. (2023): arxiv.org/abs/2308.11462
  2. [2] LegalBench project: hazyresearch.stanford.edu/legalbench
  3. [3] LegalBench repository: github.com/HazyResearch/legalbench
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.