LiveCodeBench: Code Generation Without Contamination
The benchmark that fixed code-generation evaluation. Rolling cutoff design, hidden test suites, three difficulty tiers, sourced from LeetCode, AtCoder, and Codeforces. The structural answer to the HumanEval contamination problem: only problems released after a model's training cutoff are scored, so memorisation is impossible by construction.
Why LiveCodeBench exists
HumanEval, MBPP, and APPS were the standard code-generation benchmarks from 2021 to 2024. By mid-2024 all three were saturated and all three had documented training-data contamination. HumanEval problems echo LeetCode solutions that appear in pre-training corpora; MBPP's tasks were released as a benchmark in 2021 and have been discussed extensively on the web ever since. A high score on these benchmarks does not cleanly separate "can generate code from spec" from "has memorised the test set".
LiveCodeBench, introduced by Jain et al. in March 2024, addresses this with a structural design choice rather than a content filter. Each problem in LiveCodeBench has a release date. When evaluating a model, only problems released after that model's training cutoff are scored. The benchmark refreshes automatically as new competitive-programming problems are released, which means contamination is impossible by construction rather than mitigated by heuristics.
This is the same insight GAIA applied to general assistant tasks (private test set, never released in plaintext) and the same insight Humanity's Last Exam applied to frontier knowledge (curated by active researchers, rotated periodically). The shift from filter-based to design-based contamination resistance is one of the clearest methodology improvements in the 2024-2026 benchmark generation. See our contamination explainer for the wider pattern.
Sources and problem characteristics
LiveCodeBench draws problems from three competitive-programming sites: LeetCode, AtCoder, and Codeforces. Each platform contributes a different style of problem, and the mix is intentional. LeetCode problems are widely-known and pattern-heavy (data-structure and dynamic-programming staples). AtCoder problems are mathematically flavoured with a focus on observation and clean reasoning. Codeforces problems span a wider difficulty range and include more algorithmic-trickery style problems.
Each problem comes with a release date, official difficulty rating, time limit, memory limit, and a hidden test suite. The hidden test suite is essential: it ensures that a memorised partial solution will fail. The official difficulty ratings from the source platforms are bucketed into LiveCodeBench's own three tiers (easy, medium, hard) to enable consistent per-difficulty reporting across platforms.
Problem topics include: array and string manipulation, hash maps, sliding windows, sorting, two pointers, binary search, dynamic programming, graph algorithms (BFS, DFS, Dijkstra, MST), trees, recursion, bit manipulation, mathematics, geometry, and game theory. The distribution is biased toward DP and graph algorithms because these dominate competitive programming generally; teams have noted that under-represented topics like geometry give a slightly noisier score signal because of small sample size.
The rolling-cutoff design
The rolling cutoff works as follows. Each model documents its training cutoff (typically published on the model card). When evaluating that model on LiveCodeBench, the benchmark filters to problems released strictly after that cutoff date. If GPT-4o has a cutoff of April 2024, GPT-4o is scored on the May 2024 onwards problem set. If Claude 3.5 Sonnet has a cutoff of October 2024, Claude is scored on November 2024 onwards problems.
The mechanical effect is that two models with different cutoffs are scored on different problem sets, which complicates direct head-to-head comparison. The standard practice is to also report a fixed-window score: e.g. scoring all models on problems released between two specific dates (typically the most recent six months). Both numbers tell something useful: the cutoff-respecting score is the cleanest contamination signal; the fixed-window score is the cleanest head-to-head signal.
The trade-off worth understanding is sample size. As problems are added monthly to LiveCodeBench, a six-month fixed window contains a different set every month. A score reported in May 2026 against the "Nov 2025 to Apr 2026" window is on a different question set than a score reported in November 2025 against "May 2025 to Oct 2025". Trending a single model's scores over time requires either a consistent fixed window or careful interpretation of the moving baseline.
Scoring and metrics
The primary metric is pass@1: the fraction of problems where the model's single generated solution passes all hidden tests within the time and memory limits. pass@5 and pass@10 are also reported but pass@1 is the canonical comparison number; these higher-k metrics inflate apparent capability because real deployments rarely retry many times.
Pass@1 scoring runs the model's generated code against the hidden test suite. Solutions that produce correct output for all test cases score 1; solutions that produce wrong output, time out, exceed memory, or fail to compile score 0. There is no partial credit. The benchmark uses a sandboxed execution environment to prevent malicious code (a real concern with arbitrary LLM-generated code being executed); the standard sandbox is documented in the LiveCodeBench repository.
LiveCodeBench also reports per-difficulty breakdowns, which are essential for comparison. A frontier model that scores 75 percent overall typically scores 90 on easy, 75 on medium, and 50 on hard. Reporting only the overall number masks meaningful capability differences; the better practice is to report all three tiers plus the overall.
SOTA progression Mar 2024 to May 2026
LiveCodeBench scores have climbed steadily since launch, with the most progress in the medium difficulty tier. Easy is nearing saturation (90 percent); hard still has substantial headroom (45-55 percent). The benchmark is a good signal of code-generation capability over the 2024-2026 window.
Three difficulty tiers
The hard tier is the most informative for comparing capable models in 2026. The medium tier is the best general-purpose comparison signal. The easy tier is useful for floor-testing smaller models and for sanity-checking that an evaluation pipeline is configured correctly; if a frontier model scores below 85 percent on easy, the pipeline is misconfigured (e.g. answer extraction is failing).
Strengths, limits, and when to use LiveCodeBench
Strengths: structural contamination resistance, hidden test suites, per-difficulty reporting, automatic dataset refresh, well-instrumented sandboxed execution. The benchmark is the right replacement for HumanEval and MBPP for frontier code-generation comparisons in 2026.
Limits: competitive-programming patterns dominate the corpus, so a strong LiveCodeBench score predicts competitive-programming-style code generation more than general software engineering capability. Multi-file code is not tested. Long-context understanding is not tested. Debugging existing code is not tested. For these wider capabilities, prefer SWE-bench Verified; LiveCodeBench is the right complement, not the only number to quote.
Use LiveCodeBench when you want a fresh, contamination-resistant number on a model's code-from-spec generation capability. Quote pass@1, per-difficulty, with the evaluation window dates. Pair with SWE-bench Verified for real-engineering capability and with Terminal-Bench for shell-living agent capability. See our coding-agent benchmark comparison for the full landscape.
Q.01What problem does LiveCodeBench solve?+
Q.02Where do LiveCodeBench problems come from?+
Q.03How does the rolling-cutoff work in practice?+
Q.04What metrics does LiveCodeBench report?+
Q.05Is LiveCodeBench gameable?+
Q.06What is the difference between LiveCodeBench and SWE-bench Verified?+
Sources
- [1] Jain, N. et al. (2024). LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974.
- [2] LiveCodeBench project site and leaderboard. livecodebench.github.io. Accessed May 2026.
- [3] LiveCodeBench repository. github.com/LiveCodeBench/LiveCodeBench.