Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
What12,723 USMLE-style multiple-choice medical questions (English subset most cited).
WhoJin, Pan, Oh, Lu, Welleck, Cohen, Lu, Szolovits (2020).
2026 TierFrontier above 93% MedQA-USMLE; MedXpertQA around 68% for harder discrimination.
Paperarxiv.org/abs/2009.13081
Section I.viii Industry Domain|Last verified April 2026

MedQA: 12,723 USMLE Questions, Med-PaLM 2 at 86.5% Accuracy

The benchmark that put medical LLM reasoning on the map. The 2026 frontier has moved on.

I

Construction

The MedQA dataset draws multiple-choice questions from the USMLE step exam practice pools and equivalent Chinese and Taiwanese licensing exams. Each question presents a clinical vignette (patient history, exam findings, lab results) and asks for the most likely diagnosis, the next investigation, or the appropriate treatment. The questions are standardised, well-grounded in clinical practice, and authored by medical educators.

Accuracy is the headline metric, computed against the gold answer. There is no partial credit and no judge. The English (USMLE) subset has 12,723 questions; reported scores almost always use this subset unless explicitly noted.

II

SOTA Progression

Date
Tier / Score
Note
Sep 2020
BioBERT baseline at 38.1%
Original Jin et al. paper, MedQA launch.
Dec 2022
Flan-PaLM 540B at 67.6%
First broadly general model to clear passing-USMLE territory.
May 2023
Med-PaLM 2 at 86.5% (MedQA-USMLE)
Google specialised model; cleared expert-passing threshold.
Oct 2024
Frontier general models above 90%
GPT-4o and Claude 3.5 Sonnet without medical fine-tune.
Apr 2026
Frontier above 93% MedQA-USMLE; MedXpertQA around 68%
MedQA saturating; harder medical benchmarks now used for frontier.
III

YMYL Reading Note

MedQA accuracy is a measure of multiple-choice exam performance. It is not a measure of clinical safety, hallucination rate, citation accuracy, or fitness for medical advice. The Google Med-PaLM 2 paper is explicit on this point: the model required clinician review for free-text consumer answers, and the 86.5% number does not imply 86.5% safe medical advice. Any deployment of an LLM for medical purposes is bound by FDA, MHRA, or EMA regulation depending on jurisdiction. A benchmark score is not a regulatory clearance.

LegalBench for legal reasoningRepoBench for code reasoningHarm and safety evals
Reader Questions
Q.01What is MedQA?+
MedQA is a benchmark of 12,723 free-form multiple-choice medical questions sourced from the US Medical Licensing Examination (USMLE) Step 1, Step 2 CK, and Step 3 practice materials, plus Chinese and Taiwanese equivalents. The English subset (USMLE-only, often called MedQA-USMLE) is the most cited. The benchmark tests factual recall, diagnostic reasoning, and treatment selection.
Q.02What is MultiMedQA?+
MultiMedQA is a Google-curated suite of 7 medical question-answering datasets combined for evaluating the Med-PaLM lineage. It includes MedQA, MedMCQA, PubMedQA, MMLU clinical-topics subsets, LiveQA, MedicationQA, and HealthSearchQA. The benchmark introduced 'consumer health' free-text answers alongside the multiple-choice tests, evaluated by clinicians on 12 dimensions.
Q.03What was the Med-PaLM 2 headline?+
The Singhal et al. paper (Google, 2023) reported Med-PaLM 2 at 86.5% accuracy on MedQA-USMLE, the first model to clear the 85% expert threshold often cited as a passing USMLE benchmark. The same model scored well on MedMCQA and PubMedQA but the team reported it required clinician oversight for free-text consumer answers.
Q.04Are 2026 frontier models reliable for medical work?+
Benchmark scores have risen, but real-world deployment in medicine remains subject to regulatory oversight (FDA in the US, MHRA in the UK, EMA in the EU). A high MedQA score is not a clinical safety claim. Models can still hallucinate drug interactions, miss rare presentations, and fail under prompt-template variance. Reading a model card's MedQA number as 'safe to use' is exactly the misreading that the YMYL ranking guidelines exist to prevent.
Q.05Is MedQA at risk of saturation?+
MedQA-USMLE is approaching saturation. Frontier 2026 models score in the low 90s. The harder MedXpertQA (released 2024) keeps frontier scores in the high 60s and is the more useful frontier-discrimination benchmark in medicine today. PubMedQA is also approaching saturation and is increasingly used as a literature-comprehension probe rather than a frontier signal.

Sources

  1. [1] Jin et al. (2020): arxiv.org/abs/2009.13081
  2. [2] Singhal et al., Med-PaLM 2 (2023): arxiv.org/abs/2305.09617
  3. [3] MultiMedQA introduction (Singhal et al. 2022): arxiv.org/abs/2212.13138
  4. [4] MedXpertQA (2024): arxiv.org/abs/2501.18362
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.