MedQA: 12,723 USMLE Questions, Med-PaLM 2 at 86.5% Accuracy
The benchmark that put medical LLM reasoning on the map. The 2026 frontier has moved on.
Construction
The MedQA dataset draws multiple-choice questions from the USMLE step exam practice pools and equivalent Chinese and Taiwanese licensing exams. Each question presents a clinical vignette (patient history, exam findings, lab results) and asks for the most likely diagnosis, the next investigation, or the appropriate treatment. The questions are standardised, well-grounded in clinical practice, and authored by medical educators.
Accuracy is the headline metric, computed against the gold answer. There is no partial credit and no judge. The English (USMLE) subset has 12,723 questions; reported scores almost always use this subset unless explicitly noted.
SOTA Progression
YMYL Reading Note
MedQA accuracy is a measure of multiple-choice exam performance. It is not a measure of clinical safety, hallucination rate, citation accuracy, or fitness for medical advice. The Google Med-PaLM 2 paper is explicit on this point: the model required clinician review for free-text consumer answers, and the 86.5% number does not imply 86.5% safe medical advice. Any deployment of an LLM for medical purposes is bound by FDA, MHRA, or EMA regulation depending on jurisdiction. A benchmark score is not a regulatory clearance.
Q.01What is MedQA?+
Q.02What is MultiMedQA?+
Q.03What was the Med-PaLM 2 headline?+
Q.04Are 2026 frontier models reliable for medical work?+
Q.05Is MedQA at risk of saturation?+
Sources
- [1] Jin et al. (2020): arxiv.org/abs/2009.13081
- [2] Singhal et al., Med-PaLM 2 (2023): arxiv.org/abs/2305.09617
- [3] MultiMedQA introduction (Singhal et al. 2022): arxiv.org/abs/2212.13138
- [4] MedXpertQA (2024): arxiv.org/abs/2501.18362