OpenAI Launches HealthBench to Evaluate AI Healthcare Responses
OpenAI has launched HealthBench, a comprehensive dataset designed to test and improve AI healthcare responses. The dataset includes 5,000 health conversations and more than 57,000 criteria to assess the quality and safety of AI-generated health information.
HealthBench is considered a significant step forward in AI healthcare evaluation. “Our mission as OpenAI is to ensure AI is beneficial to humanity,” said a spokesperson. “HealthBench provides a worthwhile target for model improvements for months to come.”
The dataset was created with the help of 300 doctors from five countries who evaluated AI responses based on various criteria. Experts believe that while HealthBench is a valuable tool, it still requires further analysis to ensure its effectiveness.
“One part of what OpenAI has done is they’ve provided this in a scalable way from a really big, reputable brand that’s going to enable people to use this very easily,” said Dr. Raj Rawani, a healthcare AI researcher at MedStar Health. “But models performed poorly in areas like contextual awareness and completeness, experts said.”
OpenAI also tested its own models as well as those from Google, Meta, and xAI. The results showed that while some models performed well in certain areas, they struggled with more complex health-related queries.
The development of HealthBench highlights the ongoing challenges in creating reliable AI healthcare systems. “In sensitive contexts like healthcare, where we’re discussing life and death, that’s unacceptable,” said Girish Nadkarni, a healthcare AI expert at Mount Sinai. “It may hide errors shared by both model and grader.”
Others called for more reviews to ensure models work well in different countries and demographics. “HealthBench improves {{| healthcare evaluation but still needs subgroup analysis and wider human review before it can support safety claims,” Nadkarni said.
More information is needed to ensure AI healthcare models are safe and effective across different populations.