A new research paper published in Science suggests that artificial intelligence might soon become a vital asset in high-pressure medical environments. Led by a team from Harvard Medical School and Beth Israel Deaconess Medical Center, the study found that certain large language models can diagnose emergency room patients with higher accuracy than human physicians.
The Experiment: Man vs. Machine
The researchers analyzed 76 real-world cases from the Beth Israel emergency department. They provided OpenAI’s o1 and 4o models with the same raw data available in electronic medical records, without any pre-processing. The AI’s performance was then compared against two internal medicine attending physicians.
To ensure objectivity, a separate panel of doctors—unaware of which diagnoses were generated by humans—evaluated the results. The o1 model emerged as the top performer, particularly during the initial “triage” phase when patient information is most limited and the urgency to decide is highest.
By the Numbers
- OpenAI o1: Achieved the correct or near-correct diagnosis in 67% of triage cases.
- Human Physician A: Reached the mark 55% of the time.
- Human Physician B: Reached the mark 50% of the time.
Crucial Context and Limitations
Despite the impressive statistics, the research team and outside experts emphasize that AI is not ready to take the lead in clinical settings.
- Specialty Mismatch: Critics, including emergency physician Kristen Panthagani, pointed out that the study compared AI to internal medicine doctors rather than ER specialists. In an ER setting, the primary goal is often ruling out life-threatening conditions rather than pinpointing a final diagnosis.
- Text-Only Data: The models were limited to text-based information. Currently, AI lacks the ability to reason effectively over non-text inputs like physical exams or visual cues.
- Accountability: Lead author Adam Rodman noted that there is currently no formal framework for AI accountability in healthcare.
While the study highlights the “urgent need” for real-world trials, it concludes that patients still overwhelmingly prefer humans to guide them through challenging, life-or-death treatment decisions.







