Study: AI Search Results Often Incorrect, Sometimes Wildly So
It might not be a surprise, but a new study published by the Columbia Journalism Review reveals that a significant portion of AI search results are inaccurate. Researchers at the Tow Center for Digital Journalism conducted the analysis. They examined eight AI models, including OpenAI’s ChatGPT search and Google’s Gemini.
The results showed that these models provided incorrect answers to over 60 percent of the queries. Even the most accurate model tested, Perplexity from Perplexity AI, answered 37 percent of its questions incorrectly. At the bottom of the barrel, Elon Musk’s chatbot Grok 3 was shockingly inaccurate, failing 94 percent of the time.
The authors of the study warned that these generative search tools often repackage information themselves, which can cut off traffic flow to original sources. They stated that these chatbots often obscure serious underlying issues with information quality.
The study highlights a known issue: large language models’ propensity to provide incorrect information. However, some tech companies are still trying to replace traditional web search with these models. Some have released tailored versions of their chatbots to do this, such as ChatGPT search. Google has even introduced an “AI Mode” that displays Gemini summaries instead of web links.
This latest study quantifies the flaws in this approach. Researchers selected ten random articles each from twenty publications, including The Wall Street Journal and TechCrunch. The chatbots were asked to identify an article’s headline, publisher, publication date, and URL.
To ensure a straightforward test, the researchers made sure to select article excerpts that returned the original source within the first three results of a standard Google search.
In addition to showing that the AI models were often wrong, the tests revealed some other concerning issues. The AI models provided this dubious information “with alarming confidence,” often without qualifying their responses or refusing to answer questions they couldn’t. Previous research has documented that AI models often “hallucinate” answers, or make them up, rather than admitting they don’t know. Microsoft’s Copilot, for example, declined more questions than it answered, according to the researchers.
The AI search tools were also poor at citing their sources. ChatGPT Search linked to the wrong source article nearly 40 percent of the time and didn’t provide a source at all in another 21 percent of cases. This is problematic for fact-checking and also for publishers, who may not receive traffic from AI models that have scraped their content. The quality of the online information ecosystem is at risk.