Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatCross-file code completion benchmark, roughly 27,000 examples from real GitHub Python and Java repositories.
WhoLiu, Xu, McAuley (UC San Diego, ICLR 2024).
2026 TierFrontier above 52% on contamination-filtered RepoBench-P.
Paperarxiv.org/abs/2306.03091
Section I.ix Industry Domain|Last verified April 2026

RepoBench: 27,000 Multi-File Code Completion Tasks, GPT-4 at 36.4%

Single-file benchmarks like HumanEval do not measure what your IDE assistant actually does.

I

Construction

RepoBench is built from real public GitHub repositories in Python and Java. The dataset was filtered for compilable, well-tested projects. Each example places the model cursor at a specific line in a file, with the rest of that file and the rest of the repo available as candidate context. The task is to emit the next line or block at the cursor.

Three splits: RepoBench-R (retrieval only: pick the right cross-file snippets), RepoBench-C (completion only: assume the right context is given), and RepoBench-P (pipeline: do both end-to-end). The pipeline split is the headline because it matches the real IDE-autocomplete deployment.

II

SOTA Progression

Date
Tier / Score
Note
Jun 2023
GPT-4 at 36.4% Edit Similarity (RepoBench-P Python)
Original Liu et al. paper, ICLR 2024.
Mar 2024
Claude 3 Opus at 38.7%
Anthropic-reported on the public split.
Sep 2024
Frontier above 45% with retrieval-augmented completion
RAG-style retrieval becomes default in coding agents.
Apr 2026
Frontier above 52% on contamination-filtered RepoBench-P
Captured from public reports.
III

When to Pick RepoBench

Pick RepoBench when the question is IDE-autocomplete quality at the cursor. Pick HumanEval or MBPP for single-function generation. Pick SWE-bench Verified for end-to-end patch generation against failing tests. Pick LiveCodeBench for problem-solving over fresh competitive problems. The four benchmarks measure four distinct coding-agent capabilities and a strong score on one does not imply a strong score on the others.

SWE-bench VerifiedHumanEval, MBPPLiveCodeBench
Reader Questions
Q.01What does RepoBench test?+
RepoBench tests three subtasks: Retrieval (given the cursor position, retrieve the right cross-file context), Code Completion (predict the next line or block given context), and Pipeline (chain retrieval and completion end-to-end). It draws from real GitHub repositories in Python and Java and explicitly requires cross-file reasoning, unlike HumanEval which is single-file.
Q.02What was the headline launch number?+
Liu et al. (ICLR 2024) reported GPT-4 at 36.4% Edit Similarity score on the RepoBench-P (pipeline) Python split. UniXcoder and CodeBERT baselines trailed in the teens. The benchmark stresses cross-file context, which made it harder than HumanEval at the time and remains harder in 2026 even as models improve.
Q.03How is RepoBench different from SWE-bench?+
SWE-bench Verified tests patch generation for real GitHub issues with a failing test as ground truth. RepoBench tests autocomplete-style code completion at the cursor position, like a Copilot inline suggestion. Both touch multi-file repositories, but the task is fundamentally different: patch generation (SWE-bench) vs context-aware completion (RepoBench).
Q.04Why does retrieval matter so much for RepoBench?+
The completion task often requires referring to a class, function, or import defined in another file. A naive context window can fit only the current file. Strong RepoBench performance requires either long-context models that can fit the whole repo, or a retrieval step that pulls the right cross-file snippets. The original paper showed that retrieval quality bounds completion quality.
Q.05Is RepoBench gameable through training-set overlap?+
Partially. The dataset is built from public GitHub repos, which are also pre-training data for most code models. The authors address this with a contamination-filtered subset that excludes repositories with high overlap to the model's training cutoff. Numbers on the filtered subset are roughly 5 to 10 points lower than the full-set numbers and are the more meaningful comparison.

Sources

  1. [1] Liu et al. (2023): arxiv.org/abs/2306.03091
  2. [2] RepoBench dataset on Hugging Face: huggingface.co/datasets/tianyang/repobench-p
  3. [3] RepoBench repository: github.com/Leolty/repobench
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.