LegalBench: 162 Legal Reasoning Tasks, GPT-4 Macro-F1 at 77.0
The benchmark that maps legal reasoning to LLM evaluation under the IRAC frame.
Construction
LegalBench was assembled through a 2022 to 2023 call for task contributions from practising lawyers, law professors, and researchers. The result is 162 tasks organised under six categories: Issue-Spotting, Rule-Recall, Rule-Application, Rule-Conclusion, Interpretation, and Rhetorical Analysis. Each category maps to a step in IRAC analysis.
Task formats vary by reasoning type. Some tasks are binary classification (is this clause an arbitration clause?). Others are multi-class (which area of law does this case touch?). A handful are free-text generation evaluated by exact match or by judges. The headline metric is unweighted macro-F1 across all 162 tasks.
SOTA Progression
Where Models Underperform
The original paper found Rule-Application (mapping a given set of facts to a stated legal rule) was the hardest category. Frontier 2026 models retain this pattern: Issue-Spotting and Rule-Recall sit in the low 90s, Application sits in the high 70s, Conclusion lags further. This matches the intuition that pattern recognition is easier than legal judgement.
Q.01What is LegalBench?+
Q.02What was the headline number?+
Q.03Why use IRAC as the structuring framework?+
Q.04Is LegalBench US-centric?+
Q.05Should I rely on LegalBench scores for production legal tools?+
Sources
- [1] Guha et al. (2023): arxiv.org/abs/2308.11462
- [2] LegalBench project: hazyresearch.stanford.edu/legalbench
- [3] LegalBench repository: github.com/HazyResearch/legalbench