Abstract

What466 general-assistant questions across 3 difficulty levels; humans at 92%, frontier agents at 70-75%.

WhoMialon et al., Meta AI, 2023 (arXiv:2311.12983)

2026 TierFrontier overall 70-75 percent; Level 3 still around 45 percent.

Leaderboardhuggingface.co/spaces/gaia-benchmark

Section II.iii · Agent Benchmarks|Last verified April 2026

GAIA: General AI Assistants, 466 Questions, Three Levels

The benchmark that puts a number on everyday assistant capability. 466 deliberately mundane questions that are easy for humans and hard for AI: find a fact in a PDF, identify a band, count items in a table. Private test answers, level-graded scoring, the cleanest contamination defence in the agent-benchmark family.

What GAIA measures

GAIA, released by Mialon et al. at Meta AI in November 2023, was designed around a simple observation: AI assistants were starting to do well on engineering and reasoning benchmarks but still failed at the kind of mundane research task a competent human assistant performs every day. The benchmark consists of 466 questions phrased as ordinary user requests, with answers that a human assistant can produce given some web access, the ability to read attached files, and basic arithmetic.

The questions span what the authors call General AI Assistant capability: multi-step research, cross-referencing facts across sources, reading documents, interpreting images, and combining results to produce a single concise answer. A typical Level 1 question might be "What is the file size of the third image on the Wikipedia article for Tower Bridge?". A typical Level 3 question chains five or six such lookups together with intermediate computation and constraint satisfaction. The benchmark deliberately avoids questions that test pure knowledge recall, which differentiates it from MMLU and GPQA. GAIA tests the workflow, not the encyclopaedia.

The most important methodological choice is that the test-set answers are held privately on the Hugging Face leaderboard. Submissions return only a score; the ground-truth answers have never been published in plaintext. This single decision makes GAIA the cleanest contamination defence in the agent-benchmark family, with the possible exception of LiveCodeBench's rolling-cutoff design. Even when frontier models are trained on web crawls that include discussion threads about GAIA, the underlying answer set is not available for memorisation.

The three difficulty levels

GAIA tasks are labelled at one of three difficulty levels. The level reflects how many steps a competent human assistant needs to reach the answer, calibrated against pilot annotations. Level distribution is intentional: roughly a third of tasks at each level, weighted slightly toward Level 2 in the test set.

Level

Tasks

What it tests

Level 1

165

Fewer than 5 steps. Single tool, basic reasoning. Humans: ~95% accuracy in around a minute. Frontier agents: ~85% by May 2026.

Level 2

184

5 to 10 steps. Multi-tool composition; web plus file plus light computation. Humans: ~90% in around 5 minutes. Frontier agents: ~70% by May 2026.

Level 3

117

10+ steps. Multi-source research with synthesis and verification. Humans: ~85% in 30+ minutes. Frontier agents: ~45% by May 2026 (the headroom).

The level-stratified scoring is the most useful comparison frame. Two models with the same overall score but very different per-level distributions have meaningfully different real-world utility. A model that scores 80 percent on Level 1 and 30 percent on Level 3 is a strong everyday assistant with a research-task ceiling. A model that scores 60 percent on every level is more genuinely capable across the difficulty range, even if its overall number is lower.

Scoring and the private test set

GAIA uses exact-match scoring with a small set of accepted-answer variants for each question. The expected answer is typically a short string: a number, a name, a date, a filename. The exactness is deliberate; longer free-form answers would require LLM-as-judge scoring, which would weaken the benchmark's reproducibility.

The public split is divided into a fully-released dev set (165 tasks) and a private test set (301 tasks). Both are graded on the leaderboard, but only test-set scores count for benchmark comparisons. Many papers report only test-set scores; some report both, which we recommend because dev-set numbers help validate that the agent harness is configured correctly. Test-only scores are the canonical number.

Because test answers are private, GAIA scoring has the cleanest data hygiene of any major agent benchmark. The risk that a model has memorised the answers from training data is structurally low, not merely empirically low. This contrasts sharply with MMLU and HumanEval where test items have been demonstrated to appear in Common Crawl; see our contamination explainer for the wider issue.

Tools and harness

Most GAIA tasks require web browsing. A substantial fraction require reading attached files: PDFs, CSVs, audio transcripts, images, or zip archives. A subset require image understanding, which gives multimodal models a clear edge. The benchmark does not enforce a specific tool inventory; agents choose how to obtain the information they need, and harness design is consequential.

A capable GAIA harness in 2026 typically includes: a web search tool, a page-fetch tool with content extraction, a PDF/CSV/audio reader, an image-understanding tool (for the multimodal subset), a code execution tool for computation, and a planner that decomposes the question into sub-queries. Removing any of the file-reading tools costs a meaningful number of points; removing the planner costs more on Level 3 than on Level 1. Per-tool ablation is a regular feature of GAIA papers.

The harness sensitivity also means that comparing two models on GAIA is only meaningful if they use the same harness. A frontier model in a minimal harness can be beaten by a mid-tier model in a strong agentic scaffold. This is true for all agent benchmarks, but GAIA is unusually transparent: the leaderboard accepts code submissions that document the harness exactly, and reproducibility is the norm rather than the exception.

SOTA progression 2023 to 2026

GAIA scores have climbed steadily but unevenly across levels. Level 1 saturates faster than Level 3; the human-to-AI gap on Level 3 was still 40 points in May 2026, the smallest gap GAIA has ever shown but still significant. This is the kind of fast-moving niche the brief flagged: results change quarter to quarter, and citing a 2024 number in 2026 is misleading.

Date

Tier

Note

Nov 2023

Initial paper, GPT-4 at 15% overall

Mialon et al. release benchmark. Humans at 92%. Frontier gap is enormous.

Mar 2024

First agentic submissions in the 20s

AutoGPT-style scaffolds with tool use start to rank.

Aug 2024

Frontier scaffolds reach 40%

First half of human gap closed within nine months of launch.

Feb 2025

Top submissions at 55-60%

Multi-agent scaffolds and stronger search dominate.

Sep 2025

Top submissions break 65%

First credible 70%-class submissions on Level 1.

May 2026

Frontier 70-75% (overall)

Public-leaderboard frontier. Vendor-internal scores sometimes higher.

Strengths and limitations

GAIA's main strengths are: privacy of test answers (cleanest contamination defence in the agent-benchmark family), level-stratified scoring (separates everyday assistant capability from deeper research capability), realistic task framing (questions sound like real user requests, not synthetic puzzles), and harness reproducibility (the leaderboard accepts code submissions).

The main limitations: the 466-task corpus is modest, so noise in any individual category is real; the questions are English-only, so multilingual capability is not measured; the tasks are static, so an agent trained on web data including older GAIA discussions could in principle have weak indirect contamination even on the private set; and exact-match scoring is brittle for borderline-correct answers (a question that asks for "the artist of the third track on album X" could have multiple defensible answers if the album has featured artists, and the scoring may not credit a defensible alternative).

The level-3 ceiling is also a feature of the benchmark as much as of the models. Some level-3 questions require chains of inferences that even strong human annotators disagree about, which is one reason human accuracy is 85 percent (not 100) at that level. We recommend reading Level 3 scores as "capability under multi-step assistant workflow" rather than as a pure measure of intelligence.

When to use GAIA in 2026

GAIA is the right headline benchmark for consumer-facing assistant capability. If your product is "answer the user's question correctly, going to the web and reading files as needed", GAIA captures that workflow more closely than SWE-bench, WebArena, OSWorld, or Tau-Bench. Use GAIA Level 1 plus Level 2 as the day-one capability bar; use GAIA Level 3 as the moving target your roadmap is chasing.

For engineering agents prefer SWE-bench Verified; for browser agents prefer WebArena or OSWorld; for tool-use dialogue prefer Tau-Bench; and for academic knowledge prefer GPQA-Diamond. GAIA is the assistant-shaped slice of the benchmark family, and the 70-percent ceiling on Level 3 means it has real headroom left for years to come.

Editor's verdictGAIA is the cleanest agent benchmark for everyday assistant capability. Private test answers, level-stratified scoring, and open harnesses make it the most trustworthy public number for assistant-shaped workflows in 2026. Quote it as the consumer-facing companion to SWE-bench Verified.

Reader Questions

Q.01What is GAIA?+

GAIA is the General AI Assistants benchmark, introduced by Meta AI in November 2023. It contains 466 questions designed to be easy for humans (92 percent human accuracy) but hard for AI assistants. Each question is annotated with a difficulty level and requires multi-step reasoning, often involving web browsing, file reading, image interpretation, or basic computation. The questions span everyday assistant tasks: finding a fact in a PDF, computing a quantity from a webpage table, identifying an object in an image, or following a recipe-like instruction.

Q.02What are the three GAIA difficulty levels?+

Level 1 tasks require fewer than five steps and basic tool use; humans complete them in roughly a minute. Level 2 tasks require five to ten steps and combine multiple tools; humans take around five minutes. Level 3 tasks are multi-step research-style problems that can take a competent human 30 minutes or more. Frontier model scores degrade sharply with level: a model that solves 80 percent of Level 1 might solve 40 percent of Level 2 and 10 to 20 percent of Level 3.

Q.03Why does GAIA matter compared to other agent benchmarks?+

GAIA was the first agent benchmark to deliberately target everyday assistant tasks rather than enterprise workflows. The questions are written like the kind of thing a real user would ask a competent assistant: 'How many albums did the band on the third floor of the Hilton in Vienna release in 2008?' Tasks require web browsing, file interpretation, and reasoning, but the goal is concrete and the answer fits on a line. This matches how consumers actually use assistants, which makes GAIA a useful complement to engineering-flavoured benchmarks like SWE-bench Verified.

Q.04How is the GAIA test set kept fresh?+

The Hugging Face GAIA leaderboard at huggingface.co/spaces/gaia-benchmark/leaderboard holds the test answers private. Models submit predictions and receive a score without seeing ground truth. This is the most robust contamination defence in any agent benchmark in 2026, because the test set has never been published in plaintext form. The public dev set is fully released and has been used in training of recent models, so dev-set scores are inflated relative to test-set scores.

Q.05What kinds of tools do GAIA tasks require?+

Most tasks require web browsing. Many require reading attached files: PDFs, CSVs, audio, images, or zip archives. Some require basic computation (sum, mean, range). A subset require image understanding, which gives multimodal models a clear advantage on those tasks. The benchmark does not enforce a tool inventory; agents choose how to obtain the information. This is why GAIA scores depend heavily on the harness (which tools, how prompted) as well as the underlying model.

Q.06What is the current frontier on GAIA?+

As of May 2026 the leaderboard frontier sits around 70 to 75 percent on the test set for the strongest agentic scaffolds, with sharp drop-off by level (around 85 percent Level 1, 70 percent Level 2, 45 percent Level 3). Hugging Face's own H2OGPT submissions and various academic teams trade the top spot regularly. Vendor-published numbers (e.g. OpenAI's deep-research-style products) sometimes exceed leaderboard scores but use proprietary scaffolds with extended tool access that is not directly comparable to public submissions.

Agent Benchmarks Overview →AgentBench →WebArena Methodology →SWE-bench Verified →Tau-Bench →Benchmark Contamination →What Benchmarks Miss →

Sources

[1] Mialon, G. et al. (2023). GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983.
[2] GAIA leaderboard on Hugging Face Spaces, accessed May 2026. huggingface.co/spaces/gaia-benchmark/leaderboard.
[3] GAIA dataset card. huggingface.co/datasets/gaia-benchmark/GAIA.