Abstract

What23 hardest tasks from BIG-Bench, 6,511 examples total, where 2022 LLMs trailed humans.

WhoSuzgun, Scales, Schaerli et al. (Google Research, 2022).

2026 TierFrontier above 95%, useful for mid-tier and open-weight discrimination.

Section I.v Reasoning|Last verified April 2026

BIG-Bench Hard: 23 Tasks, 6,511 Examples, Frontier Above 90%

The reasoning subset that made chain-of-thought prompting mandatory. Still discriminates everything below the frontier.

The Construction

BIG-Bench Hard is the canonical hard subset of BIG-Bench (BIG-Bench has 204 tasks total). Suzgun and collaborators at Google Research selected 23 tasks where the best PaLM 540B run trailed the average human-rater score by a meaningful margin in mid-2022. The 23 tasks total 6,511 problems and cover logical deduction, multi-step arithmetic, tracking shuffled objects, navigation in a grid, web-of-lies inference, salient translation error detection, and a dozen more probes.

The official protocol is 3-shot prompting with canonical CoT exemplars released by the Suzgun group. Most published numbers use this exact protocol, which is one reason BBH numbers are unusually comparable across papers and model cards.

SOTA Progression 2022 to 2026

Date

Tier / Score

Note

Oct 2022

PaLM 540B at 65.7% (direct), 78.1% (CoT)

Original Suzgun et al. paper baseline.

Mar 2023

GPT-4 at 83.1% (3-shot CoT)

First frontier model to clear the 80s.

Dec 2023

Gemini Ultra at 83.6%

Reported in the Gemini technical report.

Aug 2024

Claude 3.5 Sonnet at 93.1%

Anthropic model card.

Apr 2025

Frontier above 95% with CoT and majority vote

Mostly noise at this point; per-task variance dominates.

May 2026

Used for mid-tier discrimination only

Frontier comparison has moved to GPQA-Diamond and HLE.

III

Which Tasks Still Have Headroom

Even in 2026, three BBH tasks remain hard for frontier models: tracking_shuffled_objects (7-object variant), web_of_lies (large depth), and dyck_languages (deep nesting). These tasks have combinatorial state that strains attention across long CoT chains. The remaining 20 tasks are at ceiling or noise.

Why BBH Was Important

The 2022 paper established chain-of-thought as the default prompting style for reasoning tasks. The 12.4 point lift from direct to CoT prompting on PaLM 540B was the most cited evidence that prompting matters as much as architecture. Modern reasoning-tuned models (o1, o3, Claude thinking modes) generalise this insight: reasoning chains live inside the model, not the prompt.

Reasoning benchmarks compared →GPQA-Diamond, ARC-AGI-2 →Prompt template variance →

Reader Questions

Q.01What is BIG-Bench Hard?+

BBH is a 23-task subset of the 204-task BIG-Bench suite, hand-picked by Suzgun et al. (Google Research, 2022) as the tasks where current LLMs underperformed the average human rater. The subset spans logical deduction, multi-step arithmetic, tracking shuffled objects, navigation, and other reasoning probes, totalling 6,511 problems.

Q.02Is BBH still useful in 2026?+

It is still useful for mid-tier and open-weight comparisons. Frontier models with chain-of-thought score in the high 80s to mid 90s, leaving some headroom but not much. For frontier comparisons, GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam discriminate better.

Q.03What is the difference between BBH and BIG-Bench Lite?+

BIG-Bench Lite is a 24-task balanced subset designed for cheap evaluation across the full BIG-Bench coverage map. BBH is selected for difficulty: the 23 tasks where 2022 models trailed humans. The two subsets have only partial overlap.

Q.04Why does chain-of-thought help BBH so much?+

The Suzgun et al. paper showed CoT prompting raised PaLM 540B from 65.7% to 78.1% on BBH average. Many BBH tasks require explicit intermediate reasoning (tracking 7 shuffled objects, multi-step deduction), which one-shot answer prediction cannot recover. The gap between direct and CoT prompting on BBH is roughly 13 points for frontier models in 2026.

Q.05Are BBH scores comparable across papers?+

Mostly. BBH has a fixed prompt template (3-shot, official CoT exemplars) that most papers respect. The risk is that papers using few-shot with different exemplars or different decoding (greedy vs nucleus vs self-consistency) produce non-comparable numbers. Always check the methodology footnote.

Sources

[1] Suzgun et al. (2022): arxiv.org/abs/2210.09261
[2] BIG-Bench repository: github.com/google/BIG-bench
[3] Gemini technical report (BBH 83.6): arxiv.org/abs/2312.11805
[4] Claude 3.5 Sonnet model card: anthropic.com/news/claude-3-5-sonnet