Are AI Healthcare Tools Living Up to the Hype?
Artificial intelligence is rapidly transforming healthcare, with algorithms now integrated into areas such as breast cancer screenings, clinical note-taking, and even virtual nurse applications. Companies tout these tools as solutions to improve efficiency and reduce the strain on healthcare professionals. However, some experts are raising questions about the reliability and effectiveness of these AI systems, suggesting that current testing methods may not accurately reflect real-world performance.
One of the primary concerns revolves around the use of large language models (LLMs) in healthcare. These models, trained on vast amounts of text data, are designed to generate human-like text and perform various tasks. But, according to critics, the assessments of LLM capabilities in the medical field often rely on tests like medical student exams, which may not be indicative of how these systems will perform in real-world clinical settings.
In a review of healthcare AI models, specifically LLMs, it was found that only 5% of studies used actual patient data. Furthermore, most studies assessed the models by asking questions that tested medical knowledge, rather than evaluating the LLMs’ capacity to perform tasks such as writing prescriptions, summarizing conversations, or engaging in doctor-patient dialogues.
Computer scientist Deborah Raji and her colleagues, writing in the New England Journal of Medicine AI, argue that these benchmarks fall short. The tests, according to Raji, cannot fully measure clinical abilities because they don’t account for the subtleties and complexities of real-world cases that require nuanced decision-making, and they are not flexible in what they measure. Moreover, she says, because the tests are based on the knowledge of physicians, they do not fairly represent the skills and knowledge of nurses or other medical staff.
“A lot of expectations and optimism people have for these systems were anchored to these medical exam test benchmarks,” says Raji, who studies AI auditing and evaluation at the University of California, Berkeley. “That optimism is now translating into deployments, with people trying to integrate these systems into the real world and throw them out there on real patients.”
She and her colleagues emphasize the need for new evaluations that test the performance of LLMs in complex and diverse clinical situations.
Current Benchmarks: Falling Short
In an interview, Raji discussed why the current benchmark tests are inadequate.
Raji: “These benchmarks are not indicative of the types of applications people are aspiring to, so the whole field should not obsess about them in the way they do and to the degree they do. This is not a new problem or specific to health care. This is something that exists throughout machine learning, where we put together these benchmarks and we want it to represent general intelligence or general competence at this particular domain that we care about. But we just have to be really careful about the claims we make around these datasets. The further the representation of these systems is from the situations in which they are actually deployed, the more difficult it is for us to understand the failure modes these systems hold. These systems are far from perfect.
Sometimes they fail on particular populations, and sometimes, because they misrepresent the tasks, they don’t capture the complexity of the task in a way that reveals certain failures in deployment. This sort of benchmark bias issue, where we make the choice to deploy these systems based on information that doesn’t represent the deployment situation, leads to a lot of hubris.”
Creating Better Evaluations
To address these shortcomings, Raji proposes several strategies for creating more accurate evaluations of AI models:
- Gathering Real-World Data: Interviewing healthcare professionals about their typical workflows and collecting real-life examples of how they interact with the models. This includes the different types of queries used and the responses generated.
- “Red Teaming”: Collaborating to form a group of people who can challenge a model through adversarial prompting to uncover weaknesses.
- Hospital Data: Analyzing anonymized patient data and hospital workflow patterns to develop future benchmarks.
Raji also stresses the importance of specialized benchmark testing. For example, testing focused on question-answering and knowledge recall will be fundamentally different from testing the model’s capacity to summarize doctors’ notes or answer questions based on uploaded data.
Policies and Frameworks for Evaluation
To create effective AI evaluations, Raji calls for:
- Research Investment: A commitment from researchers to create useful benchmarks and evaluation methods that better reflect how AI models are expected to perform.
- Transparency: Increased transparency at the institutional level. Hospitals and other healthcare facilities should openly share information about the AI products they use in their clinical practice, as well as information about their workflows.
- Vendor Transparency: Vendors should share information about their current evaluation methods and benchmarks to help identify gaps between current practices and more realistic approaches.
Advice for AI Model Users
Raji advises that the healthcare field focus on developing evaluations that more accurately reflect the intended use of these models. While medical exams are easily accessible data sets, they’re not fully representational of the models’ application. She encourages researchers to be more thoughtful in designing tests that reveal the models’ true value once they are deployed. “I would challenge the field to be a lot more thoughtful and to pay more attention to really constructing valid representations of what we hope the models do and our expectations for these models once they are deployed.”