BIG-Bench Hard: 23 Tasks, 6,511 Examples, Frontier Above 90%
The reasoning subset that made chain-of-thought prompting mandatory. Still discriminates everything below the frontier.
The Construction
BIG-Bench Hard is the canonical hard subset of BIG-Bench (BIG-Bench has 204 tasks total). Suzgun and collaborators at Google Research selected 23 tasks where the best PaLM 540B run trailed the average human-rater score by a meaningful margin in mid-2022. The 23 tasks total 6,511 problems and cover logical deduction, multi-step arithmetic, tracking shuffled objects, navigation in a grid, web-of-lies inference, salient translation error detection, and a dozen more probes.
The official protocol is 3-shot prompting with canonical CoT exemplars released by the Suzgun group. Most published numbers use this exact protocol, which is one reason BBH numbers are unusually comparable across papers and model cards.
SOTA Progression 2022 to 2026
Which Tasks Still Have Headroom
Even in 2026, three BBH tasks remain hard for frontier models: tracking_shuffled_objects (7-object variant), web_of_lies (large depth), and dyck_languages (deep nesting). These tasks have combinatorial state that strains attention across long CoT chains. The remaining 20 tasks are at ceiling or noise.
Why BBH Was Important
The 2022 paper established chain-of-thought as the default prompting style for reasoning tasks. The 12.4 point lift from direct to CoT prompting on PaLM 540B was the most cited evidence that prompting matters as much as architecture. Modern reasoning-tuned models (o1, o3, Claude thinking modes) generalise this insight: reasoning chains live inside the model, not the prompt.
Q.01What is BIG-Bench Hard?+
Q.02Is BBH still useful in 2026?+
Q.03What is the difference between BBH and BIG-Bench Lite?+
Q.04Why does chain-of-thought help BBH so much?+
Q.05Are BBH scores comparable across papers?+
Sources
- [1] Suzgun et al. (2022): arxiv.org/abs/2210.09261
- [2] BIG-Bench repository: github.com/google/BIG-bench
- [3] Gemini technical report (BBH 83.6): arxiv.org/abs/2312.11805
- [4] Claude 3.5 Sonnet model card: anthropic.com/news/claude-3-5-sonnet