Abstract

WhatCompetitive-programming code-gen benchmark with rolling-cutoff anti-contamination design

WhoJain et al., UC Berkeley + MIT, 2024 (arXiv:2403.07974)

2026 TierFrontier 72-78% pass@1; hard tier still 45-55% with real headroom.

Leaderboardlivecodebench.github.io

Section I.v · Code Benchmarks|Last verified April 2026

LiveCodeBench: Code Generation Without Contamination

The benchmark that fixed code-generation evaluation. Rolling cutoff design, hidden test suites, three difficulty tiers, sourced from LeetCode, AtCoder, and Codeforces. The structural answer to the HumanEval contamination problem: only problems released after a model's training cutoff are scored, so memorisation is impossible by construction.

Why LiveCodeBench exists

HumanEval, MBPP, and APPS were the standard code-generation benchmarks from 2021 to 2024. By mid-2024 all three were saturated and all three had documented training-data contamination. HumanEval problems echo LeetCode solutions that appear in pre-training corpora; MBPP's tasks were released as a benchmark in 2021 and have been discussed extensively on the web ever since. A high score on these benchmarks does not cleanly separate "can generate code from spec" from "has memorised the test set".

LiveCodeBench, introduced by Jain et al. in March 2024, addresses this with a structural design choice rather than a content filter. Each problem in LiveCodeBench has a release date. When evaluating a model, only problems released after that model's training cutoff are scored. The benchmark refreshes automatically as new competitive-programming problems are released, which means contamination is impossible by construction rather than mitigated by heuristics.

This is the same insight GAIA applied to general assistant tasks (private test set, never released in plaintext) and the same insight Humanity's Last Exam applied to frontier knowledge (curated by active researchers, rotated periodically). The shift from filter-based to design-based contamination resistance is one of the clearest methodology improvements in the 2024-2026 benchmark generation. See our contamination explainer for the wider pattern.

Sources and problem characteristics

LiveCodeBench draws problems from three competitive-programming sites: LeetCode, AtCoder, and Codeforces. Each platform contributes a different style of problem, and the mix is intentional. LeetCode problems are widely-known and pattern-heavy (data-structure and dynamic-programming staples). AtCoder problems are mathematically flavoured with a focus on observation and clean reasoning. Codeforces problems span a wider difficulty range and include more algorithmic-trickery style problems.

Each problem comes with a release date, official difficulty rating, time limit, memory limit, and a hidden test suite. The hidden test suite is essential: it ensures that a memorised partial solution will fail. The official difficulty ratings from the source platforms are bucketed into LiveCodeBench's own three tiers (easy, medium, hard) to enable consistent per-difficulty reporting across platforms.

Problem topics include: array and string manipulation, hash maps, sliding windows, sorting, two pointers, binary search, dynamic programming, graph algorithms (BFS, DFS, Dijkstra, MST), trees, recursion, bit manipulation, mathematics, geometry, and game theory. The distribution is biased toward DP and graph algorithms because these dominate competitive programming generally; teams have noted that under-represented topics like geometry give a slightly noisier score signal because of small sample size.

The rolling-cutoff design

The rolling cutoff works as follows. Each model documents its training cutoff (typically published on the model card). When evaluating that model on LiveCodeBench, the benchmark filters to problems released strictly after that cutoff date. If GPT-4o has a cutoff of April 2024, GPT-4o is scored on the May 2024 onwards problem set. If Claude 3.5 Sonnet has a cutoff of October 2024, Claude is scored on November 2024 onwards problems.

The mechanical effect is that two models with different cutoffs are scored on different problem sets, which complicates direct head-to-head comparison. The standard practice is to also report a fixed-window score: e.g. scoring all models on problems released between two specific dates (typically the most recent six months). Both numbers tell something useful: the cutoff-respecting score is the cleanest contamination signal; the fixed-window score is the cleanest head-to-head signal.

The trade-off worth understanding is sample size. As problems are added monthly to LiveCodeBench, a six-month fixed window contains a different set every month. A score reported in May 2026 against the "Nov 2025 to Apr 2026" window is on a different question set than a score reported in November 2025 against "May 2025 to Oct 2025". Trending a single model's scores over time requires either a consistent fixed window or careful interpretation of the moving baseline.

Scoring and metrics

The primary metric is pass@1: the fraction of problems where the model's single generated solution passes all hidden tests within the time and memory limits. pass@5 and pass@10 are also reported but pass@1 is the canonical comparison number; these higher-k metrics inflate apparent capability because real deployments rarely retry many times.

Pass@1 scoring runs the model's generated code against the hidden test suite. Solutions that produce correct output for all test cases score 1; solutions that produce wrong output, time out, exceed memory, or fail to compile score 0. There is no partial credit. The benchmark uses a sandboxed execution environment to prevent malicious code (a real concern with arbitrary LLM-generated code being executed); the standard sandbox is documented in the LiveCodeBench repository.

LiveCodeBench also reports per-difficulty breakdowns, which are essential for comparison. A frontier model that scores 75 percent overall typically scores 90 on easy, 75 on medium, and 50 on hard. Reporting only the overall number masks meaningful capability differences; the better practice is to report all three tiers plus the overall.

SOTA progression Mar 2024 to May 2026

LiveCodeBench scores have climbed steadily since launch, with the most progress in the medium difficulty tier. Easy is nearing saturation (90 percent); hard still has substantial headroom (45-55 percent). The benchmark is a good signal of code-generation capability over the 2024-2026 window.

Date

Tier

Note

Mar 2024

Launch, GPT-4 baseline at 28.7% pass@1

Jain et al. paper introducing the benchmark; rolling cutoff design.

Oct 2024

Frontier closed-source at 50% pass@1

Strong code-specific models close the gap.

Apr 2025

Frontier at 65% pass@1

Open-weight code models (DeepSeek-Coder, Qwen2.5-Coder) post competitive numbers.

Nov 2025

Frontier crosses 70% pass@1

Strong agentic scaffolds with test-execution feedback emerge.

May 2026

Frontier 72-78% pass@1

Top tier closes; per-difficulty: easy ~90%, medium ~75%, hard ~50%.

Three difficulty tiers

Tier

What it tests, and where 2026 frontier sits

Easy

Standard competitive-programming patterns: array manipulation, string parsing, basic data structures. Frontier saturation around 90 percent in May 2026; remains useful for floor-testing weaker models.

Medium

Dynamic programming, graph algorithms, geometry. The most discriminating difficulty in 2026; frontier scores 70-80 percent and the gap between top models is largest here.

Hard

Complex DP, advanced data structures, tricky observations. Frontier scores 45-55 percent; humans at top-tier competitive level score 60-70 percent. Real headroom remains.

The hard tier is the most informative for comparing capable models in 2026. The medium tier is the best general-purpose comparison signal. The easy tier is useful for floor-testing smaller models and for sanity-checking that an evaluation pipeline is configured correctly; if a frontier model scores below 85 percent on easy, the pipeline is misconfigured (e.g. answer extraction is failing).

Strengths, limits, and when to use LiveCodeBench

Strengths: structural contamination resistance, hidden test suites, per-difficulty reporting, automatic dataset refresh, well-instrumented sandboxed execution. The benchmark is the right replacement for HumanEval and MBPP for frontier code-generation comparisons in 2026.

Limits: competitive-programming patterns dominate the corpus, so a strong LiveCodeBench score predicts competitive-programming-style code generation more than general software engineering capability. Multi-file code is not tested. Long-context understanding is not tested. Debugging existing code is not tested. For these wider capabilities, prefer SWE-bench Verified; LiveCodeBench is the right complement, not the only number to quote.

Use LiveCodeBench when you want a fresh, contamination-resistant number on a model's code-from-spec generation capability. Quote pass@1, per-difficulty, with the evaluation window dates. Pair with SWE-bench Verified for real-engineering capability and with Terminal-Bench for shell-living agent capability. See our coding-agent benchmark comparison for the full landscape.

Editor's verdictLiveCodeBench is the right replacement for HumanEval in 2026. Quote pass@1 with the evaluation window dates and per-difficulty breakdown. The benchmark is honest, fresh, and well-instrumented. The remaining gap to SWE-bench Verified reflects competitive-programming versus real-engineering, not benchmark quality.

Reader Questions

Q.01What problem does LiveCodeBench solve?+

LiveCodeBench solves the contamination problem with code-generation benchmarks. HumanEval, MBPP, and APPS all suffer from training-data overlap: their problems were released years before current models were trained, so a model's high score may reflect memorisation of the test set rather than genuine code-generation capability. LiveCodeBench addresses this with a rolling-cutoff design: when evaluating a model, only problems released after that model's training cutoff are scored. As new problems are released, the benchmark refreshes automatically.

Q.02Where do LiveCodeBench problems come from?+

LiveCodeBench collects problems from competitive-programming sites: LeetCode, AtCoder, and Codeforces. Each problem has a release date, time limit, memory limit, and a hidden test suite. The benchmark's website at livecodebench.github.io tracks problems over time and exposes a filterable interface where you can specify a date range to evaluate a model against.

Q.03How does the rolling-cutoff work in practice?+

Each model has a known training cutoff (e.g. April 2024 for GPT-4o, October 2024 for Claude 3.5 Sonnet). LiveCodeBench scores the model only on problems released after that cutoff. For GPT-4o, problems before April 2024 are excluded; for Claude 3.5 Sonnet, problems before October 2024 are excluded. This ensures the model cannot have seen the problems during training. The trade-off is that two models with different cutoffs are scored on slightly different question sets, which makes direct head-to-head comparison less clean than on a fixed-set benchmark.

Q.04What metrics does LiveCodeBench report?+

The primary metric is pass@1: the fraction of problems for which the single generated solution passes all hidden tests within the time and memory limits. pass@5 and pass@10 are also reported but pass@1 is the canonical comparison number. LiveCodeBench also reports per-difficulty breakdowns (easy, medium, hard) which are more informative than the overall number; frontier models typically saturate easy and post lower scores on hard.

Q.05Is LiveCodeBench gameable?+

Less than other code benchmarks. The hidden test suite means a memorised solution must actually run correctly; partial memorisation is not enough. The rolling cutoff means problems are fresh relative to model training. The main residual risk is that competitive-programming problems share patterns: a model trained on millions of LeetCode-style problems has seen the genre even if it has not seen the specific problem. LiveCodeBench measures pattern competence as much as raw generation ability, but this is honest as long as the benchmark is read as 'competitive-programming-style code generation' rather than 'general software engineering'.

Q.06What is the difference between LiveCodeBench and SWE-bench Verified?+

LiveCodeBench tests competitive-programming-style single-function generation: read a problem, write code that passes tests, optimise for correctness and complexity. SWE-bench Verified tests real software engineering: navigate a multi-file repository, understand an issue, write a patch, ensure existing tests still pass. The two benchmarks measure largely disjoint capabilities. A model that scores 75 percent on LiveCodeBench might score 50 percent on SWE-bench Verified; the inverse is also seen. Quote both for serious coding-agent comparisons.

HumanEval and MBPP →SWE-bench Verified →Coding-Agent Benchmarks Compared →Terminal-Bench →Benchmark Contamination →Full Benchmark Reference →What Benchmarks Miss →

Sources

[1] Jain, N. et al. (2024). LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974.
[2] LiveCodeBench project site and leaderboard. livecodebench.github.io. Accessed May 2026.
[3] LiveCodeBench repository. github.com/LiveCodeBench/LiveCodeBench.