(© Slowlifetrader - stock.adobe.com)
For more than sixty years, the test for whether a computer could think like a doctor has stayed roughly the same: hand it a tough medical case and see if it can figure out what’s wrong. Generations of software, from primitive programs in the 1950s to rule-based systems in the 1980s, tried and mostly fell short. Now, a landmark study published inSciencefinds that an advanced AI model has vaulted over that bar, outperforming hundreds of real physicians at diagnosing patients and recommending next steps across multiple experiments, including on messy, real-world cases pulled straight from a busy emergency room.
Led by researchers at Harvard Medical School, Beth Israel Deaconess Medical Center, Stanford University, and other institutions, the study tested a preview version of OpenAI’s o1-series model across six experiments designed to measure the kind of thinking doctors do every day: generating a list of possible diagnoses, choosing the right test to order, estimating how likely a disease is, and deciding on a treatment plan. Across all experiments,the AI generally metor beat the performance of human physicians, sometimes by wide margins.
None of that happened in a vacuum. Diagnostic errors are a persistent and costly problem in American medicine, and the pressure onemergency physiciansto make fast, high-stakes decisions with limited information is well-documented. Against that backdrop, an AI model that can reliably keep pace with or exceed trained clinicians carries real weight.
Perhaps the most eye-opening result came from the emergency room portion of thestudy.Rather than relying only on polished textbook-style cases, the researchers pulled 76 real patient encounters from the Beth Israel Deaconess Medical Center emergency department in Boston and had the AI go head-to-head with two attending physicians. Crucially, the physicians grading the results were kept in the dark about which answers came from a human and which came from a machine. The blinding worked well: one evaluator correctly identified the source only 15% of the time, while the other managed just 3%.
Researchers created three snapshots of each patient’s journey through the ER: the initial intake notes, the evaluation by an emergency physician, and the information available at the time of hospital admission. At each checkpoint, the AI and the two attending physicians each produced a list of up to five possible diagnoses, graded by two separate physicians.
At the earliest and most information-scarce point, the AI identified the correct or very close diagnosis about 67% of the time. The two attending physicians hit that mark roughly 55% and 50% of the time, respectively. As more information became available later, everyone improved, but the AI maintained its lead throughout. By the time ofhospital admission, the AI was correct in about 82% of cases, compared with roughly 79% and 70% for the two physicians. Its biggest advantage appeared at the exact moment when getting the call right matters most: when a patient first walks through the door.
Much of the study was built around a set of famously difficult diagnostic puzzlespublishedby theNew England Journal of Medicine. These cases have been used to test diagnostic tools since the 1950s. After running 143 of them through the AI, the model included the correct diagnosis somewhere in its list of possibilities in about 78% of cases. Its very first guess was correct 52% of the time. When graders also counted diagnoses that were very close or would have been clinically helpful, not just exact matches, the accuracy climbed to nearly 98%.
On a head-to-head comparison using 70 of those cases, the newer o1-series model landed the exact or very close diagnosis in about 89% of cases, compared with roughly 73% for GPT-4 in a previous study.
Beyond naming diseases, the AIproved capableof selecting next steps. When asked to pick the next diagnostic test a doctor should order, it chose correctly in about 88% of cases. In another 11%, its suggestion was judged helpful even if not the exact test used in the actual case. Only about 1.5% of its recommendations were deemed unhelpful.
Source: Drudge Report