Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatPairwise human preference leaderboard for LLMs; tens of millions of votes; Bradley-Terry-Luce Elo rating
WhoLMArena team (formerly LMSYS), 2023 (arXiv:2403.04132)
2026 TierTop tier 1480-1520 Elo; top 5 within 50 points.
Leaderboardlmarena.ai
Section I.vi · Preference Benchmarks|Last verified April 2026

Chatbot Arena: The Human-Preference Leaderboard

The largest human-preference evaluation of LLMs ever assembled. Millions of pairwise votes, Bradley-Terry-Luce Elo, open-ended prompts. A genuinely useful signal when read correctly: not a capability measure, a preference measure. The benchmark to quote when the question is which model users prefer, not which model is most capable.

01

What Chatbot Arena measures

Chatbot Arena, introduced by Chiang et al. at LMSYS in March 2024 (and operational since May 2023), aggregates pairwise human preference judgements between large language models. Users submit a prompt, receive responses from two anonymous models, and choose the better response (or vote "tie" or "both are bad"). The platform aggregates millions of such votes using a Bradley-Terry-Luce model to estimate per-model strength scores, which are reported as Elo-style numbers where higher is better and a 100-point gap corresponds to roughly 64 percent expected win-rate.

The benchmark is unique in two ways. First, the test set is whatever real users submit. There is no fixed prompt distribution; the prompts that actually come in shape the score. This is both a strength (the score reflects real usage) and a weakness (the prompt distribution is biased toward English-speaking technical users and the kinds of questions they ask). Second, scale: the platform has accumulated tens of millions of votes, which is orders of magnitude more human-preference data than any single research lab has produced.

What Chatbot Arena does not measure: deep capability on specific tasks. The prompts are typically short conversational queries, not multi-step research tasks or real engineering work. A model that wins on Arena may lose on SWE-bench Verified, GAIA Level 3, or OSWorld because those benchmarks measure different things. Use Arena as the holistic preference signal; quote capability-specific benchmarks alongside it for serious comparisons.

02

Six leaderboard categories

LMArena exposes multiple leaderboard views, each filtered to a subset of prompts or with different scoring adjustments.

Category
What it shows
Overall
Aggregate Elo across all user-submitted prompts. The headline number most people quote. Reflects the full prompt distribution.
Style-controlled
Adjusts for response length and markdown formatting. Closer to pure-content preference signal. Reshuffles top-5 by a few positions typically.
Hard prompts
Filtered to prompts the LMArena team flags as challenging. The most discriminating category for frontier comparison.
Coding
Filtered to programming-related prompts. Useful sanity check but quote LiveCodeBench or SWE-bench for capability claims.
Math
Filtered to math-related prompts. Useful sanity check; quote MATH or AIME for capability claims.
Creative writing
Filtered to creative-writing prompts. The category where style effects are largest; style-controlled version is essential.

The overall and style-controlled boards are the headline numbers; the hard-prompts board is the most discriminating for frontier comparison; the per-topic boards (coding, math, creative writing) are sanity checks rather than capability tests. The honest practice is to look at all of them: a model that ranks first overall but third on hard prompts has a less robust lead than its headline number suggests.

03

Bradley-Terry-Luce scoring and confidence intervals

Chatbot Arena uses the Bradley-Terry-Luce (BTL) model to estimate per-model strength from pairwise outcomes. The BTL model assumes each model has a latent strength parameter, and the probability that model A beats model B is a sigmoid of the strength difference. The strengths are estimated by maximum likelihood over the entire vote history; the output is mapped to Elo-style numbers for readability.

The platform reports 95 percent confidence intervals alongside point estimates. For models with low vote volume (newly added, niche, or low-traffic) the intervals can be 30 Elo or more, which means a 25-point apparent rank difference may not be statistically meaningful. The vote-volume threshold for inclusion on the public leaderboard is currently around 100,000 votes; models below this threshold appear in a separate "new" section without firm ranks.

Two practical consequences. First, single-digit rank differences between adjacent models are often within noise; treat the top-5 cluster as a tie unless the gap is 50 Elo or more. Second, models added recently take weeks or months of vote accumulation to settle to their final ranking; early ranks for new entrants tend to be high (selection bias from early-adopter users) and drift down as broader users vote.

04

Style-controlled rankings

In November 2024 the LMArena team introduced style-controlled Elo, which attempts to remove the effect of formatting and verbosity from the headline number. The motivation is real: pairwise voting reliably rewards longer responses with bullet points and markdown formatting, even when content quality is similar. Some models had developed long-form, structured response styles partly because they helped on Arena, which arguably distorted the leaderboard relative to pure content-quality preference.

Style-controlled scoring fits a logistic regression that includes response length and markdown-element counts as covariates alongside the model identity, then computes the model coefficient as the style-adjusted strength. The result is a leaderboard where length-and-formatting bonuses are netted out. Top-5 rankings typically reshuffle by 1-3 positions between the standard and style-controlled boards; the gap is largest for verbose, well-formatted models that were previously over-credited.

We treat the style-controlled board as the more honest content-preference signal. The standard board remains useful because real users do prefer well-formatted answers; both boards are worth quoting. When the goal is to compare pure reasoning or writing quality, prefer style-controlled; when the goal is to predict user satisfaction in deployment, the standard board is closer.

05

SOTA progression May 2023 to May 2026

Chatbot Arena Elo scores have climbed steadily for three years. The top tier moved from ~1180 at launch to ~1500 in May 2026, a 320-Elo gain corresponding to a roughly 87 percent win-rate of 2026-top against the May-2023-top model. Open-weight models have closed the gap from ~150 Elo behind in 2023 to ~80 Elo behind in 2026.

Date
Tier
Note
May 2023
Launch, GPT-4 at the top around 1180 Elo
Initial release at lmsys.org; Vicuna and Claude in the early competition.
Feb 2024
Top tier around 1250 Elo
GPT-4-Turbo, Claude 2.1, Gemini Pro trade leadership.
Aug 2024
Top tier 1320 Elo
Claude 3.5 Sonnet and GPT-4o lead; open-weight gap narrows.
Feb 2025
Top tier 1400 Elo
Frontier closed-source models cluster within 30 Elo of each other.
Oct 2025
Top tier 1450 Elo
Style-controlled board introduced; small reshuffles in the top 5.
May 2026
Top tier 1480-1520 Elo
Top 5 within 50 Elo of each other; rebrand to LMArena.ai.
06

Strengths, limits, and gaming concerns

Strengths: largest human-preference dataset in existence; methodology peer-reviewed (NeurIPS 2024); open-source codebase; transparent confidence intervals; style-controlled board for cleaner content signal; per-topic filters; long history enables trend analysis. The board is the closest thing the field has to a holistic preference benchmark.

Limits: prompt distribution skews English-speaking technical user; conversational length favours short-form models that may not represent long-context capability; pairwise voting is sensitive to formatting and sycophancy; the absence of ground-truth means a confidently-wrong response can win against a hedged-but-correct one if the voter does not know the answer.

Gaming concerns: vendors have at various points been caught soliciting votes from their communities, A/B testing system prompts on Arena traffic, or using anonymous-mode votes to gather competitive intelligence. The LMArena team has progressively tightened anti-gaming measures: user-anonymity is enforced, vote-brigading is detected automatically, and minimum-vote thresholds prevent low-volume manipulation. The remaining gaming risk is moderate but not eliminated; treat 10-30 Elo gaps as noisier than the headline number suggests.

07

When to use Chatbot Arena in 2026

Chatbot Arena is the right benchmark to quote when the question is "which model do users prefer". It is the wrong benchmark to quote when the question is "which model is most capable at task X". The two are correlated but not identical. A serious 2026 model comparison should include Arena (preference) alongside MMLU-Pro (knowledge), GPQA-Diamond (reasoning), SWE-bench Verified (engineering), and at least one agent benchmark relevant to the deployment context.

Quote both the overall and style-controlled rankings. Quote the confidence intervals. Treat top-5 clusters within 50 Elo as essentially tied unless the per-topic boards show consistent ordering. And remember the headline: this is human preference, filtered through whatever users happen to ask. It is meaningful, but it is one signal among several.

Editor's verdictChatbot Arena is the canonical preference benchmark and the largest human-evaluation dataset in the field. Quote the style-controlled rating alongside the headline. Pair with task-specific benchmarks; do not use Arena alone for capability claims.
Reader Questions
Q.01What is Chatbot Arena?+
Chatbot Arena (also called LMArena) is a public leaderboard for large language models maintained by the LMSYS research group, recently rebranded as LMArena.ai. Users submit prompts and receive responses from two anonymous models. They choose the better response (or 'tie' or 'both bad'). Votes are aggregated using a Bradley-Terry-Luce model to produce an Elo rating per model. The board has accumulated tens of millions of votes since launch in May 2023, making it the largest human-preference evaluation of LLMs in existence.
Q.02What does Chatbot Arena measure?+
Pairwise human preference under open-ended chat conditions. The prompts are anything users want to ask: code questions, creative writing, factual lookups, jokes, philosophical debates, system prompts, jailbreak attempts. There is no fixed test set; the prompts are whatever real users submit. The score reflects which model produces the more preferred response across this open-ended distribution. It is a meaningful but specific signal: not capability per se, but capability filtered through whatever users happen to ask and value.
Q.03How is the Elo rating calculated?+
Chatbot Arena uses a Bradley-Terry-Luce model to estimate model strength from pairwise outcomes. Each vote (model A beats model B, ties, or both lose) updates the implicit strength estimates. The output is mapped to an Elo-style number where higher is better and a 100-point difference roughly corresponds to a 64 percent win-rate expectation. Confidence intervals are reported alongside point estimates; for models with low vote volume, the intervals can be 30 points or more, which means apparent rank differences may not be statistically meaningful.
Q.04Should I trust Chatbot Arena rankings?+
Yes for what they measure: aggregate human preference across the kind of prompts real users submit. No for inferring specific capabilities: a model that wins on Arena prompts may lose on a specific benchmark (e.g. SWE-bench Verified, GAIA Level 3) because Arena prompts are short and conversational, while real-work benchmarks measure deeper task completion. Use Arena as a holistic preference signal; combine with task-specific benchmarks for capability claims.
Q.05What is style-controlled Elo?+
In late 2024 LMArena introduced style-controlled rankings that try to remove the effect of formatting and verbosity from the headline number. Long, well-formatted responses tend to win pairwise votes even when content quality is similar; style-controlled Elo adjusts for response length and markdown formatting before computing the rating. Frontier model rankings shuffle by a small but meaningful margin in the style-controlled board. Both numbers are useful; the style-controlled board is closer to a pure-content preference signal.
Q.06Can vendors game Chatbot Arena?+
The honest answer is partly. Three known gaming vectors. First, response style: long responses with bullet points and emoji tend to win. Second, sycophancy: responses that agree with the user's framing tend to win. Third, vote brigading: vendors have been caught soliciting votes from their communities, which the LMArena team has progressively cracked down on. The platform now has anti-gaming measures including user-anonymity, automated detection, and minimum-vote requirements for listing. The remaining gaming is moderate; the headline number is broadly trustworthy at the 50-100 Elo-point granularity, less so at the 10-30 point granularity.
MMLU-ProGPQA and ARC-AGISWE-bench VerifiedReasoning Benchmarks ComparedFull Benchmark ReferenceLLM-as-Judge MethodologyWhat Benchmarks Miss

Sources

  1. [1] Chiang, W.-L. et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
  2. [2] LMArena leaderboard. lmarena.ai. Accessed May 2026.
  3. [3] Li, T. et al. (2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv:2406.11939. The Hard Prompts board methodology.
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.