The Looming Crisis: AI and the Data Drought
As artificial intelligence models continue to evolve at a rapid pace, a critical challenge is emerging: the availability of high-quality training data. This “data drought” is forcing AI developers to explore alternative approaches, including the use of synthetic data.
In December 2024, Sundar Pichai, the CEO of Google, acknowledged this growing concern. He highlighted that the supply of readily available, high-quality training data is dwindling, signaling that future progress in AI development will become increasingly difficult. “I think the progress is going to get harder,” Pichai said at the New York Times’ annual Dealbook Summit.
The Rise of Synthetic Data
With the scarcity of real-world datasets, many AI researchers are turning to synthetic data. This isn’t a new concept. Synthetic data has a history dating back to the late 1960s and has been used in statistics and machine learning. It relies on algorithms and simulations to generate artificial datasets that mimic real-world information.
Professor Muriel Médard, a Professor of Software Engineering at MIT, explained the concept in an interview at ETH Denver 2025. “Synthetic data has been around in statistics forever—it’s called bootstrapping. You start with actual data and think, ‘I want more but don’t want to pay for it. I’ll make it up based on what I have,'”
Medard, who is also the co-founder of the decentralized memory infrastructure platform Optimum, further elaborated that the main challenge in training AI models is not necessarily the lack of data itself, but rather the accessibility of the existing data.
“You either search for more or fake it with what you have. Accessing data—especially on-chain, where retrieval and updates are crucial—adds another layer of complexity,” she said.
AI developers face mounting privacy restrictions and limited access to real-world datasets, making synthetic data a more accessible alternative for training AI models. Nick Sanchez, Senior Solutions Architect at Druid AI, emphasized this point. “As privacy restrictions and general content policies are backed with more and more protection, utilizing synthetic data will become a necessity, both out of ease of access and fear of legal recourse,” he said.
“Currently, it’s not a perfect solution, as synthetic data can contain the same biases you would find in real-world data, but its role in handling consent, copyright, and privacy issues will only grow over time,” Sanchez added.
Risks and Rewards
While the potential for synthetic data is substantial, several risks are associated with its use. The potential for manipulation and misuse is chief among them.
“Synthetic data itself might be used to insert false information into the training set, intentionally misleading the AI models,” Sanchez warned, “This is particularly concerning when applying it to sensitive applications like fraud detection, where bad actors could use the synthetic data to train models that overlook certain fraudulent patterns.”
Blockchain technology could provide a solution to mitigate the risks associated with synthetic data. Médard emphasized the importance of making data tamper-proof rather than unchangeable.
“When updating data, you don’t do it willy-nilly—you change a bit and observe,” she said. “When people talk about immutability, they really mean durability, but the full framework matters.”
As AI continues to evolve and the availability of data becomes a more critical factor, synthetic data is poised to play an increasingly important role. While its adoption requires careful consideration and attention to potential risks, it may prove to be an essential element in the ongoing development of artificial intelligence systems.