Abstract

WhatUnbiased estimator for the probability that at least one of k samples passes.

WhoChen, Tworek, Jun, Yuan, Pinto, Kaplan, et al. (OpenAI, HumanEval paper 2021).

2026 UseStandard reporting for HumanEval, MBPP, LiveCodeBench, SWE-bench (pass@1 only).

Section IV.iii Methodology|Last verified April 2026

pass@1 vs pass@k: Why HumanEval Numbers Are Often Misquoted

The metric you pick changes the headline more than the model does.

The Formal Definition

The OpenAI Codex paper introduced pass@k as an unbiased estimator. Given a model that produces n samples per task with c of them correct, the estimator is pass@k = 1 minus C(n minus c, k) over C(n, k). This means you do not run exactly k samples to estimate pass@k. You run n samples (typically 100 to 200) and use the combinatoric identity to estimate the probability that a randomly chosen subset of size k would contain at least one correct sample.

The estimator is critical because the naive alternative (run k samples, check if any passed) has high variance and is biased upward for small k. Papers that quote pass@k from k samples directly are technically wrong, though the bias is small for k = 100 plus.

The Three Common Variants

Metric

Definition

Use Case

pass@1

Probability a single sample passes.

Honest single-shot user experience.

pass@k

Probability at least 1 of k samples passes.

Best-of-k deployment with a verifier.

pass^k

Probability all k samples pass.

Consistency requirement, no retries.

III

Why the Difference Moves Numbers

Take a model with 60% per-sample success on a benchmark. The three metrics compute to: pass@1 = 0.60, pass@4 = 1 minus 0.4^4 = 0.974, pass^4 = 0.6^4 = 0.130. The same model "scores" 60%, 97%, or 13% depending on which estimator you pick. Picking the right metric to quote is therefore a reporting choice with substantial consequences.

Reading Rule

When you see a benchmark score, ask three things. First, which estimator (pass@1, pass@k, pass^k, best-of-k with verifier, majority vote)? Second, how many samples per task? Third, what decoding (greedy, temperature, nucleus, self-consistency)? Two scores are comparable only when all three match. This is also why model cards that omit decoding settings should be read sceptically.

HumanEval (pass@k origin) →Tau-Bench pass^k →Reproducibility failures →

Reader Questions

Q.01What is pass@k?+

pass@k is the probability that at least one of k randomly drawn samples from a model would pass the test. The Chen et al. (2021) HumanEval paper defined it as an unbiased estimator computed from n samples, not k samples: pass@k = 1 minus C(n-c, k) divided by C(n, k), where c is the number of correct samples out of n. Reporting pass@k from just k samples introduces large variance and overestimates capability.

Q.02What is pass@1?+

pass@1 is the probability that a single greedy sample passes the test. In practice it is usually estimated by running the model once with greedy decoding and taking the fraction of tasks that pass, or by averaging the per-task pass rate over a small number of samples. pass@1 is the most honest single-number metric because it matches the user's experience of asking once and getting an answer.

Q.03How is pass^k different from pass@k?+

pass@k requires success on at least one of k samples. pass^k (used by Tau-Bench and several other agentic benchmarks) requires success on all k samples. pass^k is much harsher. For a task where the model has 70% per-sample success, pass@4 is roughly 99.2% and pass^4 is roughly 24%. The metric you pick changes the headline more than the model does.

Q.04When is pass@10 or pass@100 a fair reporting?+

When the deployment supports best-of-k sampling (an outer verifier picks the best of k tries) and the verifier accuracy is genuinely high. Then pass@k matches the real-world success rate. If there is no verifier and users see one sample, pass@k overstates capability and pass@1 is the honest number to publish.

Q.05Why is HumanEval pass@1 sometimes quoted higher than pass@10?+

It cannot be, mathematically. If you see HumanEval pass@1 quoted higher than pass@10 for the same model, the report is mixing decoding settings: pass@1 with greedy + chain-of-thought + multi-sample voting vs pass@10 from temperature sampling. They are different estimators on different sampling regimes. Always ask for the decoding temperature, sample count, and aggregation rule when comparing.

Sources

[1] Chen et al. (2021): arxiv.org/abs/2107.03374
[2] HumanEval repository: github.com/openai/human-eval
[3] Tau-Bench paper (pass^k): arxiv.org/abs/2406.12045