Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
What466 general-assistant questions across 3 difficulty levels; humans at 92% on the test set.
WhoMialon et al., Meta AI, 2023 (arXiv:2311.12983)
ScoringExact-match against a private test set; level-stratified (L1/L2/L3).
Leaderboardhuggingface.co/spaces/gaia-benchmark
Section II.iii · Agent Benchmarks|Reviewed 2026

GAIA: General AI Assistants, 466 Questions, Three Levels

The benchmark that puts a number on everyday assistant capability. 466 deliberately mundane questions that are easy for humans and hard for AI: find a fact in a PDF, identify a band, count items in a table. Private test answers, level-graded scoring, the cleanest contamination defence in the agent-benchmark family.

01

What GAIA measures

GAIA, released by Mialon et al. at Meta AI in November 2023, was designed around a simple observation: AI assistants were starting to do well on engineering and reasoning benchmarks but still failed at the kind of mundane research task a competent human assistant performs every day. The benchmark consists of 466 questions phrased as ordinary user requests, with answers that a human assistant can produce given some web access, the ability to read attached files, and basic arithmetic.

The questions span what the authors call General AI Assistant capability: multi-step research, cross-referencing facts across sources, reading documents, interpreting images, and combining results to produce a single concise answer. A typical Level 1 question might be "What is the file size of the third image on the Wikipedia article for Tower Bridge?". A typical Level 3 question chains five or six such lookups together with intermediate computation and constraint satisfaction. The benchmark deliberately avoids questions that test pure knowledge recall, which differentiates it from MMLU and GPQA. GAIA tests the workflow, not the encyclopaedia.

The most important methodological choice is that the test-set answers are held privately on the Hugging Face leaderboard. Submissions return only a score; the ground-truth answers have never been published in plaintext. This single decision makes GAIA the cleanest contamination defence in the agent-benchmark family, with the possible exception of LiveCodeBench's rolling-cutoff design. Even when frontier models are trained on web crawls that include discussion threads about GAIA, the underlying answer set is not available for memorisation.

02

The three difficulty levels

GAIA tasks are labelled at one of three difficulty levels. The level reflects how many steps a competent human assistant needs to reach the answer, calibrated against pilot annotations. Level distribution is intentional: roughly a third of tasks at each level, weighted slightly toward Level 2 in the test set.

Level
Tasks
What it tests
Level 1
165
Fewer than 5 steps. Single tool, basic reasoning. Humans: ~95% accuracy in around a minute. The easiest level; every agent scores highest here.
Level 2
184
5 to 10 steps. Multi-tool composition; web plus file plus light computation. Humans: ~90% in around 5 minutes.
Level 3
117
10+ steps. Multi-source research with synthesis and verification. Humans: ~85% in 30+ minutes. This is where the headroom lives.

The level-stratified scoring is the most useful comparison frame. Two models with the same overall score but very different per-level distributions have meaningfully different real-world utility. A model that scores 80 percent on Level 1 and 30 percent on Level 3 is a strong everyday assistant with a research-task ceiling. A model that scores 60 percent on every level is more genuinely capable across the difficulty range, even if its overall number is lower.

03

Scoring and the private test set

GAIA uses exact-match scoring with a small set of accepted-answer variants for each question. The expected answer is typically a short string: a number, a name, a date, a filename. The exactness is deliberate; longer free-form answers would require LLM-as-judge scoring, which would weaken the benchmark's reproducibility.

The public split is divided into a fully-released dev set (165 tasks) and a private test set (301 tasks). Both are graded on the leaderboard, but only test-set scores count for benchmark comparisons. Many papers report only test-set scores; some report both, which we recommend because dev-set numbers help validate that the agent harness is configured correctly. Test-only scores are the canonical number.

Because test answers are private, GAIA scoring has the cleanest data hygiene of any major agent benchmark. The risk that a model has memorised the answers from training data is structurally low, not merely empirically low. This contrasts sharply with MMLU and HumanEval where test items have been demonstrated to appear in Common Crawl; see our contamination explainer for the wider issue.

04

Tools and harness

Most GAIA tasks require web browsing. A substantial fraction require reading attached files: PDFs, CSVs, audio transcripts, images, or zip archives. A subset require image understanding, which gives multimodal models a clear edge. The benchmark does not enforce a specific tool inventory; agents choose how to obtain the information they need, and harness design is consequential.

A capable GAIA harness in 2026 typically includes: a web search tool, a page-fetch tool with content extraction, a PDF/CSV/audio reader, an image-understanding tool (for the multimodal subset), a code execution tool for computation, and a planner that decomposes the question into sub-queries. Removing any of the file-reading tools costs a meaningful number of points; removing the planner costs more on Level 3 than on Level 1. Per-tool ablation is a regular feature of GAIA papers.

The harness sensitivity also means that comparing two models on GAIA is only meaningful if they use the same harness. A frontier model in a minimal harness can be beaten by a mid-tier model in a strong agentic scaffold. This is true for all agent benchmarks, but GAIA is unusually transparent: the leaderboard accepts code submissions that document the harness exactly, and reproducibility is the norm rather than the exception.

05

Reading the scores

We do not reprint a per-model GAIA score table. A responsibly-sourced one cannot be published here: the official Hugging Face leaderboard is JavaScript-rendered and moves continuously, it lags the current frontier generation, and vendor-published numbers use proprietary scaffolds with extended tool access that is not comparable to public submissions. A GAIA score is only meaningful with its harness attached, so a bare grid of numbers misleads more than it informs.

For the live ranking, read the official GAIA leaderboard and the harness disclosure for each submission together. To pick the right benchmark for a use case rather than chase a headline number, start from the homepage task picker. When you do quote a GAIA score, quote the level breakdown and the scaffold, not just one overall figure.

06

Strengths and limitations

GAIA's main strengths are: privacy of test answers (cleanest contamination defence in the agent-benchmark family), level-stratified scoring (separates everyday assistant capability from deeper research capability), realistic task framing (questions sound like real user requests, not synthetic puzzles), and harness reproducibility (the leaderboard accepts code submissions).

The main limitations: the 466-task corpus is modest, so noise in any individual category is real; the questions are English-only, so multilingual capability is not measured; the tasks are static, so an agent trained on web data including older GAIA discussions could in principle have weak indirect contamination even on the private set; and exact-match scoring is brittle for borderline-correct answers (a question that asks for "the artist of the third track on album X" could have multiple defensible answers if the album has featured artists, and the scoring may not credit a defensible alternative).

The level-3 ceiling is also a feature of the benchmark as much as of the models. Some level-3 questions require chains of inferences that even strong human annotators disagree about, which is one reason human accuracy is 85 percent (not 100) at that level. We recommend reading Level 3 scores as "capability under multi-step assistant workflow" rather than as a pure measure of intelligence.

07

When to use GAIA in 2026

GAIA is the right headline benchmark for consumer-facing assistant capability. If your product is "answer the user's question correctly, going to the web and reading files as needed", GAIA captures that workflow more closely than SWE-bench, WebArena, OSWorld, or Tau-Bench. Use GAIA Level 1 plus Level 2 as the day-one capability bar; use GAIA Level 3 as the moving target your roadmap is chasing.

For engineering agents prefer SWE-bench Verified; for browser agents prefer WebArena or OSWorld; for tool-use dialogue prefer Tau-Bench; and for academic knowledge prefer GPQA-Diamond. GAIA is the assistant-shaped slice of the benchmark family, and the gap to the human baseline on Level 3 means it has real headroom left for years to come.

Editor's verdictGAIA is the cleanest agent benchmark for everyday assistant capability. Private test answers, level-stratified scoring, and open harnesses make it the most trustworthy public number for assistant-shaped workflows in 2026. Quote it as the consumer-facing companion to SWE-bench Verified.
Reader Questions
Q.01What is GAIA?+
GAIA is the General AI Assistants benchmark, introduced by Meta AI in November 2023. It contains 466 questions designed to be easy for humans (92 percent human accuracy) but hard for AI assistants. Each question is annotated with a difficulty level and requires multi-step reasoning, often involving web browsing, file reading, image interpretation, or basic computation. The questions span everyday assistant tasks: finding a fact in a PDF, computing a quantity from a webpage table, identifying an object in an image, or following a recipe-like instruction.
Q.02What are the three GAIA difficulty levels?+
Level 1 tasks require fewer than five steps and basic tool use; humans complete them in roughly a minute. Level 2 tasks require five to ten steps and combine multiple tools; humans take around five minutes. Level 3 tasks are multi-step research-style problems that can take a competent human 30 minutes or more. Scores degrade sharply with level: every agent solves far fewer Level 3 tasks than Level 1, and the level breakdown is more informative than any single overall number.
Q.03Why does GAIA matter compared to other agent benchmarks?+
GAIA was the first agent benchmark to deliberately target everyday assistant tasks rather than enterprise workflows. The questions are written like the kind of thing a real user would ask a competent assistant: 'How many albums did the band on the third floor of the Hilton in Vienna release in 2008?' Tasks require web browsing, file interpretation, and reasoning, but the goal is concrete and the answer fits on a line. This matches how consumers actually use assistants, which makes GAIA a useful complement to engineering-flavoured benchmarks like SWE-bench Verified.
Q.04How is the GAIA test set kept fresh?+
The Hugging Face GAIA leaderboard at huggingface.co/spaces/gaia-benchmark/leaderboard holds the test answers private. Models submit predictions and receive a score without seeing ground truth. This is the most robust contamination defence in any agent benchmark in 2026, because the test set has never been published in plaintext form. The public dev set is fully released and has been used in training of recent models, so dev-set scores are inflated relative to test-set scores.
Q.05What kinds of tools do GAIA tasks require?+
Most tasks require web browsing. Many require reading attached files: PDFs, CSVs, audio, images, or zip archives. Some require basic computation (sum, mean, range). A subset require image understanding, which gives multimodal models a clear advantage on those tasks. The benchmark does not enforce a tool inventory; agents choose how to obtain the information. This is why GAIA scores depend heavily on the harness (which tools, how prompted) as well as the underlying model.
Q.06Where can I see current GAIA scores?+
The live ranking is the Hugging Face GAIA leaderboard at huggingface.co/spaces/gaia-benchmark/leaderboard. We do not reprint a per-model score table here: that board is JavaScript-rendered and moves continuously, vendor-published numbers (e.g. deep-research-style products) use proprietary scaffolds with extended tool access that is not comparable to public submissions, and a GAIA score is only meaningful with its harness attached. Read the live board and the harness disclosure together; quote the scaffold alongside the model.
Agent Benchmarks OverviewAgentBenchWebArena MethodologySWE-bench VerifiedTau-BenchBenchmark ContaminationWhat Benchmarks Miss

Sources

  1. [1] Mialon, G. et al. (2023). GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983.
  2. [2] GAIA leaderboard on Hugging Face Spaces. huggingface.co/spaces/gaia-benchmark/leaderboard.
  3. [3] GAIA dataset card. huggingface.co/datasets/gaia-benchmark/GAIA.
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.