Abstract

What12,723 USMLE-style multiple-choice medical questions (English subset most cited).

WhoJin, Pan, Oh, Lu, Welleck, Cohen, Lu, Szolovits (2020).

2026 TierFrontier above 93% MedQA-USMLE; MedXpertQA around 68% for harder discrimination.

Paperarxiv.org/abs/2009.13081

Section I.viii Industry Domain|Last verified April 2026

MedQA: 12,723 USMLE Questions, Med-PaLM 2 at 86.5% Accuracy

The benchmark that put medical LLM reasoning on the map. The 2026 frontier has moved on.

Construction

The MedQA dataset draws multiple-choice questions from the USMLE step exam practice pools and equivalent Chinese and Taiwanese licensing exams. Each question presents a clinical vignette (patient history, exam findings, lab results) and asks for the most likely diagnosis, the next investigation, or the appropriate treatment. The questions are standardised, well-grounded in clinical practice, and authored by medical educators.

Accuracy is the headline metric, computed against the gold answer. There is no partial credit and no judge. The English (USMLE) subset has 12,723 questions; reported scores almost always use this subset unless explicitly noted.

SOTA Progression

Date

Tier / Score

Note

Sep 2020

BioBERT baseline at 38.1%

Original Jin et al. paper, MedQA launch.

Dec 2022

Flan-PaLM 540B at 67.6%

First broadly general model to clear passing-USMLE territory.

May 2023

Med-PaLM 2 at 86.5% (MedQA-USMLE)

Google specialised model; cleared expert-passing threshold.

Oct 2024

Frontier general models above 90%

GPT-4o and Claude 3.5 Sonnet without medical fine-tune.

Apr 2026

Frontier above 93% MedQA-USMLE; MedXpertQA around 68%

MedQA saturating; harder medical benchmarks now used for frontier.

III

YMYL Reading Note

MedQA accuracy is a measure of multiple-choice exam performance. It is not a measure of clinical safety, hallucination rate, citation accuracy, or fitness for medical advice. The Google Med-PaLM 2 paper is explicit on this point: the model required clinician review for free-text consumer answers, and the 86.5% number does not imply 86.5% safe medical advice. Any deployment of an LLM for medical purposes is bound by FDA, MHRA, or EMA regulation depending on jurisdiction. A benchmark score is not a regulatory clearance.

LegalBench for legal reasoning →RepoBench for code reasoning →Harm and safety evals →

Reader Questions

Q.01What is MedQA?+

MedQA is a benchmark of 12,723 free-form multiple-choice medical questions sourced from the US Medical Licensing Examination (USMLE) Step 1, Step 2 CK, and Step 3 practice materials, plus Chinese and Taiwanese equivalents. The English subset (USMLE-only, often called MedQA-USMLE) is the most cited. The benchmark tests factual recall, diagnostic reasoning, and treatment selection.

Q.02What is MultiMedQA?+

MultiMedQA is a Google-curated suite of 7 medical question-answering datasets combined for evaluating the Med-PaLM lineage. It includes MedQA, MedMCQA, PubMedQA, MMLU clinical-topics subsets, LiveQA, MedicationQA, and HealthSearchQA. The benchmark introduced 'consumer health' free-text answers alongside the multiple-choice tests, evaluated by clinicians on 12 dimensions.

Q.03What was the Med-PaLM 2 headline?+

The Singhal et al. paper (Google, 2023) reported Med-PaLM 2 at 86.5% accuracy on MedQA-USMLE, the first model to clear the 85% expert threshold often cited as a passing USMLE benchmark. The same model scored well on MedMCQA and PubMedQA but the team reported it required clinician oversight for free-text consumer answers.

Q.04Are 2026 frontier models reliable for medical work?+

Benchmark scores have risen, but real-world deployment in medicine remains subject to regulatory oversight (FDA in the US, MHRA in the UK, EMA in the EU). A high MedQA score is not a clinical safety claim. Models can still hallucinate drug interactions, miss rare presentations, and fail under prompt-template variance. Reading a model card's MedQA number as 'safe to use' is exactly the misreading that the YMYL ranking guidelines exist to prevent.

Q.05Is MedQA at risk of saturation?+

MedQA-USMLE is approaching saturation. Frontier 2026 models score in the low 90s. The harder MedXpertQA (released 2024) keeps frontier scores in the high 60s and is the more useful frontier-discrimination benchmark in medicine today. PubMedQA is also approaching saturation and is increasingly used as a literature-comprehension probe rather than a frontier signal.

Sources

[1] Jin et al. (2020): arxiv.org/abs/2009.13081
[2] Singhal et al., Med-PaLM 2 (2023): arxiv.org/abs/2305.09617
[3] MultiMedQA introduction (Singhal et al. 2022): arxiv.org/abs/2212.13138
[4] MedXpertQA (2024): arxiv.org/abs/2501.18362