A new peer-reviewed study has raised concerns about the reliability of OpenAI’s health-focused chatbot, finding it frequently underestimated the severity of serious medical emergencies.
The research, published last week in Nature Medicine, evaluated the triage performance of ChatGPT Health — a specialized version of ChatGPT designed for health-related questions. Researchers found that the system “under-triaged” more than half of emergency scenarios, advising delayed care when immediate treatment was necessary.
Underestimating Critical Cases
In the study, researchers presented the chatbot with 60 real-world medical cases, each with 16 demographic variations altering factors such as race and gender. The variations were constructed to ensure that the clinical urgency remained identical across versions.
Three physicians independently reviewed the same cases and assessed their urgency using established medical guidelines.
The results showed that ChatGPT Health underestimated 51.6% of cases that doctors classified as emergencies. Instead of directing patients to the emergency room, the chatbot often recommended seeking medical care within 24 to 48 hours.
Among the misclassified cases were life-threatening conditions such as diabetic ketoacidosis and impending respiratory failure — both of which require immediate intervention.
“Any clinician would recognize these as emergencies,” said lead study author Dr. Ashwin Ramaswamy of The Mount Sinai Hospital in New York. He noted that in some instances, the chatbot appeared to wait for symptoms to become “undeniable” before advising emergency care.
By contrast, classic stroke symptoms were correctly identified as emergencies in every instance tested.
Over-Triage of Minor Conditions
The researchers also found that ChatGPT Health frequently erred in the opposite direction.
In 64.8% of non-urgent scenarios, the chatbot advised seeking medical care when home treatment would have been sufficient. For example, it recommended a doctor’s visit within 24 to 48 hours for a sore throat lasting three days — a situation typically managed with rest and over-the-counter remedies.
The inconsistent pattern of underestimating serious cases while overreacting to minor ones led researchers to describe the chatbot’s decision-making as “paradoxical.”
Inconsistent Crisis Referrals
The study also evaluated how the chatbot handled mental health crises.
When users express suicidal intent, ChatGPT systems are programmed to refer them to 988, the U.S. Suicide and Crisis Lifeline. However, researchers found that ChatGPT Health sometimes referred users to crisis resources unnecessarily — and in other instances failed to provide the referral when appropriate.
An OpenAI spokesperson said the company welcomes independent research but argued the study does not reflect how ChatGPT Health is designed to be used. The system encourages follow-up questions and interactive dialogue, rather than one-time responses to isolated medical prompts.
OpenAI also emphasized that ChatGPT Health is not intended to diagnose or treat medical conditions and is currently available to a limited group of users while further safety improvements are underway.
Growing Reliance on AI for Health Questions
The findings come as AI tools become increasingly integrated into healthcare conversations. OpenAI reports that tens of millions of people worldwide use ChatGPT for health-related inquiries, with a significant share of questions submitted outside traditional clinic hours or from locations far from medical facilities.
Experts say accessibility may explain why patients are turning to AI.
“You can ask unlimited follow-up questions and upload documents,” Ramaswamy said. “People want not just answers, but a kind of digital medical partner.”
Still, clinicians caution that AI systems are not substitutes for professional medical judgment.
Dr. John Mafi, a primary care physician at UCLA Health who was not involved in the study, said tools that influence medical decisions should undergo rigorous controlled trials before widespread deployment.
“There’s a major difference between passing a medical exam and practicing medicine safely,” he said.
Researchers also warn that AI systems can reflect biases in user input or training data. Large language models may unintentionally reinforce misconceptions by agreeing with flawed assumptions presented by users.
Not a Replacement for Doctors
While some experts believe AI chatbots can assist with general health education or administrative guidance, they stress that patients should not rely on them during emergencies.
The study’s authors recommend that AI health tools be used alongside — not instead of — licensed medical professionals. As AI adoption accelerates, they say collaboration between healthcare providers and technology developers will be essential to ensure patient safety.
For now, researchers advise caution: when symptoms suggest a serious condition, human medical evaluation remains the safest course of action.


























