GAIA: General AI Assistants, 466 Questions, Three Levels
The benchmark that puts a number on everyday assistant capability. 466 deliberately mundane questions that are easy for humans and hard for AI: find a fact in a PDF, identify a band, count items in a table. Private test answers, level-graded scoring, the cleanest contamination defence in the agent-benchmark family.
What GAIA measures
GAIA, released by Mialon et al. at Meta AI in November 2023, was designed around a simple observation: AI assistants were starting to do well on engineering and reasoning benchmarks but still failed at the kind of mundane research task a competent human assistant performs every day. The benchmark consists of 466 questions phrased as ordinary user requests, with answers that a human assistant can produce given some web access, the ability to read attached files, and basic arithmetic.
The questions span what the authors call General AI Assistant capability: multi-step research, cross-referencing facts across sources, reading documents, interpreting images, and combining results to produce a single concise answer. A typical Level 1 question might be "What is the file size of the third image on the Wikipedia article for Tower Bridge?". A typical Level 3 question chains five or six such lookups together with intermediate computation and constraint satisfaction. The benchmark deliberately avoids questions that test pure knowledge recall, which differentiates it from MMLU and GPQA. GAIA tests the workflow, not the encyclopaedia.
The most important methodological choice is that the test-set answers are held privately on the Hugging Face leaderboard. Submissions return only a score; the ground-truth answers have never been published in plaintext. This single decision makes GAIA the cleanest contamination defence in the agent-benchmark family, with the possible exception of LiveCodeBench's rolling-cutoff design. Even when frontier models are trained on web crawls that include discussion threads about GAIA, the underlying answer set is not available for memorisation.
The three difficulty levels
GAIA tasks are labelled at one of three difficulty levels. The level reflects how many steps a competent human assistant needs to reach the answer, calibrated against pilot annotations. Level distribution is intentional: roughly a third of tasks at each level, weighted slightly toward Level 2 in the test set.
The level-stratified scoring is the most useful comparison frame. Two models with the same overall score but very different per-level distributions have meaningfully different real-world utility. A model that scores 80 percent on Level 1 and 30 percent on Level 3 is a strong everyday assistant with a research-task ceiling. A model that scores 60 percent on every level is more genuinely capable across the difficulty range, even if its overall number is lower.
Scoring and the private test set
GAIA uses exact-match scoring with a small set of accepted-answer variants for each question. The expected answer is typically a short string: a number, a name, a date, a filename. The exactness is deliberate; longer free-form answers would require LLM-as-judge scoring, which would weaken the benchmark's reproducibility.
The public split is divided into a fully-released dev set (165 tasks) and a private test set (301 tasks). Both are graded on the leaderboard, but only test-set scores count for benchmark comparisons. Many papers report only test-set scores; some report both, which we recommend because dev-set numbers help validate that the agent harness is configured correctly. Test-only scores are the canonical number.
Because test answers are private, GAIA scoring has the cleanest data hygiene of any major agent benchmark. The risk that a model has memorised the answers from training data is structurally low, not merely empirically low. This contrasts sharply with MMLU and HumanEval where test items have been demonstrated to appear in Common Crawl; see our contamination explainer for the wider issue.
Tools and harness
Most GAIA tasks require web browsing. A substantial fraction require reading attached files: PDFs, CSVs, audio transcripts, images, or zip archives. A subset require image understanding, which gives multimodal models a clear edge. The benchmark does not enforce a specific tool inventory; agents choose how to obtain the information they need, and harness design is consequential.
A capable GAIA harness in 2026 typically includes: a web search tool, a page-fetch tool with content extraction, a PDF/CSV/audio reader, an image-understanding tool (for the multimodal subset), a code execution tool for computation, and a planner that decomposes the question into sub-queries. Removing any of the file-reading tools costs a meaningful number of points; removing the planner costs more on Level 3 than on Level 1. Per-tool ablation is a regular feature of GAIA papers.
The harness sensitivity also means that comparing two models on GAIA is only meaningful if they use the same harness. A frontier model in a minimal harness can be beaten by a mid-tier model in a strong agentic scaffold. This is true for all agent benchmarks, but GAIA is unusually transparent: the leaderboard accepts code submissions that document the harness exactly, and reproducibility is the norm rather than the exception.
SOTA progression 2023 to 2026
GAIA scores have climbed steadily but unevenly across levels. Level 1 saturates faster than Level 3; the human-to-AI gap on Level 3 was still 40 points in May 2026, the smallest gap GAIA has ever shown but still significant. This is the kind of fast-moving niche the brief flagged: results change quarter to quarter, and citing a 2024 number in 2026 is misleading.
Strengths and limitations
GAIA's main strengths are: privacy of test answers (cleanest contamination defence in the agent-benchmark family), level-stratified scoring (separates everyday assistant capability from deeper research capability), realistic task framing (questions sound like real user requests, not synthetic puzzles), and harness reproducibility (the leaderboard accepts code submissions).
The main limitations: the 466-task corpus is modest, so noise in any individual category is real; the questions are English-only, so multilingual capability is not measured; the tasks are static, so an agent trained on web data including older GAIA discussions could in principle have weak indirect contamination even on the private set; and exact-match scoring is brittle for borderline-correct answers (a question that asks for "the artist of the third track on album X" could have multiple defensible answers if the album has featured artists, and the scoring may not credit a defensible alternative).
The level-3 ceiling is also a feature of the benchmark as much as of the models. Some level-3 questions require chains of inferences that even strong human annotators disagree about, which is one reason human accuracy is 85 percent (not 100) at that level. We recommend reading Level 3 scores as "capability under multi-step assistant workflow" rather than as a pure measure of intelligence.
When to use GAIA in 2026
GAIA is the right headline benchmark for consumer-facing assistant capability. If your product is "answer the user's question correctly, going to the web and reading files as needed", GAIA captures that workflow more closely than SWE-bench, WebArena, OSWorld, or Tau-Bench. Use GAIA Level 1 plus Level 2 as the day-one capability bar; use GAIA Level 3 as the moving target your roadmap is chasing.
For engineering agents prefer SWE-bench Verified; for browser agents prefer WebArena or OSWorld; for tool-use dialogue prefer Tau-Bench; and for academic knowledge prefer GPQA-Diamond. GAIA is the assistant-shaped slice of the benchmark family, and the 70-percent ceiling on Level 3 means it has real headroom left for years to come.
Q.01What is GAIA?+
Q.02What are the three GAIA difficulty levels?+
Q.03Why does GAIA matter compared to other agent benchmarks?+
Q.04How is the GAIA test set kept fresh?+
Q.05What kinds of tools do GAIA tasks require?+
Q.06What is the current frontier on GAIA?+
Sources
- [1] Mialon, G. et al. (2023). GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983.
- [2] GAIA leaderboard on Hugging Face Spaces, accessed May 2026. huggingface.co/spaces/gaia-benchmark/leaderboard.
- [3] GAIA dataset card. huggingface.co/datasets/gaia-benchmark/GAIA.