A new study published in BMJ Open finds AI chatbots provided misleading or problematic health advice about half the time, when people asked. The study analyzed chatbots, ChatGPT, Gemini, Grok, Meta AI, and DeepSeek, which were each asked 50 health and medical questions spanning cancer, vaccines, stem cells, nutrition, and athletic performance.
The researchers found a significant gap when these tools are used to address real-world patient symptoms.
“These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,” Dr Rebecca Payne, GP, lead medical practitioner on the study (Nuffield Department of Primary Care Health Sciences, University of Oxford, and Bangor University), said. “Despite all the hype, AI just isn’t ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognize when urgent help is needed.”
According to the study, nearly 50 percent of answers to common health queries were flagged as problematic or inaccurate. Around 30 percent of responses lacked the full context necessary for safe medical decisions.
Some chatbots gave equal weight to unscientific treatments, effectively “legitimizing” unproven alternatives over standard care. Users often don’t provide the specific data AI needs, and AI rarely asks clarifying questions before giving advice.
‘The disconnect between benchmark scores and real-world performance should be a wake-up call for AI developers and regulators,” Adam Mahdi, senior author and associate professor, said. Our recent work on construct validity in benchmarks shows that many evaluations fail to measure what they claim to measure, and this study demonstrates exactly why that matters. We cannot rely on standardized tests alone to determine if these systems are safe for public use. Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare.”



















COMMENTS