Evaluation and mitigation of the limitations of large language models in clinical decision-making.
Researchers, clinicians, and other stakeholders are hopeful that integration of artificial intelligence and large language models (LLMs) can improve patient safety and reduce clinician burden. This study used 2,400 real patient cases to test several LLM's ability to correctly diagnose common abdominal complaints. Each LLM performed significantly worse than physicians, did not follow treatment or diagnostic guidelines, could not interpret laboratory results, and often failed to follow instructions.