RepoBench: 27,000 Multi-File Code Completion Tasks, GPT-4 at 36.4%
Single-file benchmarks like HumanEval do not measure what your IDE assistant actually does.
Construction
RepoBench is built from real public GitHub repositories in Python and Java. The dataset was filtered for compilable, well-tested projects. Each example places the model cursor at a specific line in a file, with the rest of that file and the rest of the repo available as candidate context. The task is to emit the next line or block at the cursor.
Three splits: RepoBench-R (retrieval only: pick the right cross-file snippets), RepoBench-C (completion only: assume the right context is given), and RepoBench-P (pipeline: do both end-to-end). The pipeline split is the headline because it matches the real IDE-autocomplete deployment.
SOTA Progression
When to Pick RepoBench
Pick RepoBench when the question is IDE-autocomplete quality at the cursor. Pick HumanEval or MBPP for single-function generation. Pick SWE-bench Verified for end-to-end patch generation against failing tests. Pick LiveCodeBench for problem-solving over fresh competitive problems. The four benchmarks measure four distinct coding-agent capabilities and a strong score on one does not imply a strong score on the others.
Q.01What does RepoBench test?+
Q.02What was the headline launch number?+
Q.03How is RepoBench different from SWE-bench?+
Q.04Why does retrieval matter so much for RepoBench?+
Q.05Is RepoBench gameable through training-set overlap?+
Sources
- [1] Liu et al. (2023): arxiv.org/abs/2306.03091
- [2] RepoBench dataset on Hugging Face: huggingface.co/datasets/tianyang/repobench-p
- [3] RepoBench repository: github.com/Leolty/repobench