Dr. Adam Rodman, an internal medicine expert at Beth Israel Deaconess Medical Center in Boston, initially believed artificial intelligence (AI) chatbots like ChatGPT-4 would revolutionize medical diagnostics. However, a study he co-designed revealed surprising results: while ChatGPT-4 outperformed doctors in diagnostic accuracy, physicians struggled to fully harness the technology’s capabilities.
In the study, published in JAMA Network Open, ChatGPT achieved an average score of 90% when diagnosing complex medical cases. Doctors with access to ChatGPT scored 76%, only slightly better than those who didn’t use it, who averaged 74%. Dr. Rodman admitted his surprise, saying, “I was shocked.”
The Study Design
The research involved 50 doctors, including residents and attending physicians from major U.S. hospitals. Participants were tasked with diagnosing six medical cases based on real-life scenarios, with graders unaware of whether answers came from a doctor, a doctor using ChatGPT, or ChatGPT alone.
The cases, selected from a set of 105 used since the 1990s, were complex but not exceedingly rare. For example, one case involved a 76-year-old man experiencing severe back and leg pain following a coronary artery procedure. His symptoms and lab results pointed to cholesterol embolism, a diagnosis the chatbot correctly identified.
Doctors were asked to suggest three potential diagnoses, provide supporting and contradicting evidence, name a final diagnosis, and recommend additional diagnostic steps. ChatGPT consistently excelled, scoring higher than its human counterparts.
Why Doctors Struggled with ChatGPT
Despite the chatbot’s capabilities, doctors often ignored or underutilized it. Logs of interactions revealed that physicians frequently dismissed ChatGPT’s suggestions if they contradicted their initial diagnoses. This phenomenon, termed “diagnostic anchoring,” reflects a broader tendency among humans to overestimate their correctness.
“They didn’t listen to AI when it told them things they didn’t agree with,” Dr. Rodman observed.
Another issue was the limited understanding of how to use the chatbot effectively. Many doctors treated it like a basic search engine, asking narrowly focused questions such as, “What are possible diagnoses for eye pain?” Only a minority realized they could paste the entire case history into ChatGPT to receive a comprehensive diagnostic assessment.
Lessons from History: AI in Medicine
Efforts to develop AI tools for medical diagnosis date back 70 years. One notable project, INTERNIST-1, emerged in the 1970s from the University of Pittsburgh. Designed by computer scientists and Dr. Jack Myers, a renowned diagnostician, the program cataloged over 500 diseases and 3,500 symptoms.
Despite its accuracy, INTERNIST-1 failed to gain widespread adoption due to its complexity and the time required to input data. Doctors also found it challenging to trust the system. As Dr. Rodman explained, “It’s not just that it has to be user-friendly, but doctors had to trust it.”
This mistrust of AI persisted into the 1990s when various diagnostic programs were developed but failed to reach clinical ubiquity. The advent of large language models like ChatGPT, however, has shifted the conversation. Unlike earlier systems, ChatGPT doesn’t attempt to mimic human reasoning but excels at processing and interpreting language to provide diagnostic insights.
AI as a “Doctor Extender”
Dr. Rodman and other researchers believe tools like ChatGPT can serve as “doctor extenders,” offering valuable second opinions and enhancing diagnostic accuracy. However, realizing this potential requires doctors to embrace the technology and integrate it into their workflows effectively.
Dr. Jonathan Chen, a physician and computer scientist at Stanford and co-author of the study, emphasized the importance of the chatbot’s user-friendly interface. “We can pop a whole case into the computer,” he noted. “Before a couple of years ago, computers did not understand language.”
Despite this, many doctors fail to exploit AI’s full potential. As Dr. Chen observed, only a fraction of participants leveraged ChatGPT’s ability to analyze entire case histories comprehensively.
Overcoming Resistance and Building Trust
The study underscores the need for training and mindset shifts among doctors. Laura Zwaan, a diagnostic error expert at Erasmus Medical Center in Rotterdam, noted that overconfidence often prevents people from reconsidering their decisions, even when presented with compelling evidence.
By learning to trust AI’s diagnostic capabilities and recognizing its strengths, doctors can better integrate these tools into clinical practice. The study serves as a reminder that while AI holds immense promise, its success depends on human users’ willingness to adapt and collaborate with the technology.
For now, AI systems like ChatGPT offer a glimpse of what’s possible in medical diagnostics. To fully unlock their potential, the medical community must address knowledge gaps, build trust, and foster a more collaborative approach between humans and machines.