AI Models in Healthcare: Experts Address Hallucinations and Risk
In a significant development for the integration of artificial intelligence in healthcare, a consortium of experts is calling for comprehensive rules to govern the use of machine learning in medical settings. The concern stems from the tendency of AI models to ‘hallucinate,’ which means they can confidently generate incorrect information. This issue has prompted researchers to assess the risks and formulate strategies to ensure patient safety while still leveraging the benefits of AI assistance.
These experts, representing academic institutions, healthcare organizations, and a major web search and advertising company, have collaborated to analyze medical hallucinations in mainstream foundation models. Their goal is to develop better guidelines for how medical professionals can safely use AI in their practice. The research emphasizes the necessity of deploying harm mitigation strategies. Hallucinations here refer to instances where AI models generate domain-specific outputs that appear logically coherent, thereby making them harder to detect without advanced medical expertise.
Their work is published in an academic paper titled “Medical Hallucinations in Foundation Models and Their Impact on Healthcare,” supported by a corresponding GitHub repository. The study proceeds from the premise that foundation models, which are massive neural networks trained on large datasets, offer substantial opportunities to improve healthcare. However, the potential for errors requires careful management.
Taxonomy of Medical Hallucinations
The researchers have created a taxonomy specific to medical hallucinations. This taxonomy distinguishes errors in medical contexts from AI-generated errors in less consequential situations. A key feature of medical hallucinations is that they are not random; instead, they arise within specialized functions that impact patient care, such as diagnosis and treatment. These hallucinations often use medical terminology and present conclusions in a way that seems logical. This makes it challenging for healthcare providers to discern the errors without focused attention.
The paper’s pie chart taxonomy categorized medical hallucinations into: Factual Errors; Outdated References; Spurious Correlations; Fabricated Sources or Guidelines; and Incomplete Chains of Reasoning. The research also assessed how frequently these hallucinations occur. The team evaluated the clinical reasoning ability of several general-purpose language learning models (LLMs), including o1, gemini-2.0-flash-exp, gpt-4o, gemini-1.5-flash, and Claude-3.5 sonnet. These models were tested on specific diagnostic tasks: chronological ordering of events, lab data interpretation, and differential diagnosis generation. The models were rated from 0 (No Risk) to 5 (Catastrophic).
Findings and Implications
The results revealed that the diagnosis prediction task generally had the lowest hallucination rates, ranging from 0 to 22 percent across the different models. Tasks that required factual recall and temporal assessment – such as chronological ordering and lab data understanding – presented significantly higher hallucination frequencies. The research contradicted the assumption that diagnostic tasks require complex inference, which LLMs are thought to be less proficient in.
The findings suggest that current LLM technology may excel at pattern recognition and diagnostic inference. However, they are weaker in the extraction and synthesis of detailed, factual information from clinical data, particularly the ability to manage chronological and temporal factors.
Among the general-purpose models tested, Anthropic’s Claude-3.5 and OpenAI’s o1 showed the lowest hallucination rates. This result suggests a potential for high-performing models in diagnostic inference. However, the occurrence of errors still presents a risk, requiring close monitoring and the presence of human oversight in clinical tasks.
The study included a survey of 75 medical practitioners to understand the current use of AI tools in practice. The results indicated that many practitioners have already integrated these tools: 40 used them daily, 9 several times a week, 13 a few times a month, and 13 reported rare or no use. Thirty respondents expressed high levels of trust in the output of AI models. Surprisingly, although many are using AI, 91.8 percent of the practitioners surveyed reported encountering a medical hallucination in their clinical practice, and 84.7 percent considered that those hallucinations could affect patient health.
Need for Regulation
The researchers conclude by emphasizing the urgent need for regulations and clarity regarding legal liability for AI-related errors. As the technology continues to advance, the paper asks important clarifying questions: If an AI model provides misleading diagnostic information, who should be held liable? Should this fall on the AI developer, the healthcare provider, or the institution? Given the need for patient safety and accountability, the researchers are urging for regulatory frameworks that govern the use of AI in healthcare.