GPT versus resident physicians — a benchmark based on official board scores.
Before large language models (LLM) can be integrated into clinical care, they must be shown to perform at least as well as physicians. This study compared two publicly available GPT models with official physician scores on the Israeli board residency examinations in five core medical disciplines: internal medicine, general surgery, pediatrics, psychiatry, and obstetrics and gynecology (OB/GYN). GPT-4 performance was comparable to that of physicians taking the exam, whereas GPT-3.5 did not reach passing levels on any of the five exams.