The AI Data Drought: A Coming Crisis?
As artificial intelligence models become more sophisticated, a critical question looms: What happens when the internet runs out of training material? The rapid consumption of freely available content by AI systems is creating a data scarcity crisis that could severely impede future development.
Concerns have been raised about models potentially being trained on the outputs of other AI, like DeepSeek, a Chinese AI model whose responses often mirror those of ChatGPT. This has led some experts to believe that the readily available, high-quality training data may be exhausted.
Google CEO Sundar Pichai acknowledged this challenge in December, warning that the supply of usable, high-quality training data is rapidly depleting. “In the current generation of LLM models, roughly a few companies have converged at the top, but I think we’re all working on our next versions too,” Pichai said at the New York Times’ annual Dealbook Summit. “I think the progress is going to get harder.”
Synthetic Data: A Double-Edged Sword
With the supply of authentic, high-quality training data diminishing, AI researchers are increasingly turning to synthetic data generated by other AI systems. Although synthetic data, which utilizes algorithms and simulations to create artificial datasets, dates back to the late 1960s, its growing role in AI development is sparking new concerns, particularly with the increasing integration of AI systems into decentralized technologies.
Professor Muriel Médard, a Professor of Software Engineering at MIT, explained the concept of synthetic data as a form of “bootstrapping” during an interview at ETH Denver 2025. “You start with actual data and think, ‘I want more but don’t want to pay for it. I’ll make it up based on what I have.’”
Medard, who is also the co-founder of the decentralized memory infrastructure platform Optimum, argues that the primary training challenge isn’t a lack of data, but rather its accessibility. “You either search for more or fake it with what you have,” she stated. “Accessing data—especially on-chain, where retrieval and updates are crucial—adds another layer of complexity.”
AI developers face mounting privacy restrictions and limited access to real-world datasets. Synthetic data is becoming an increasingly important tool for model training.
“As privacy restrictions and general content policies are backed with more and more protection, utilizing synthetic data will become a necessity, both out of ease of access and fear of legal recourse,” said Nick Sanchez, a Senior Solutions Architect at Druid AI. “Currently, it’s not a perfect solution, as synthetic data can contain the same biases you would find in real-world data, but its role in handling consent, copyright, and privacy issues will only grow over time,” he added.
Risks and Blockchain Solutions
The expanding use of synthetic data also brings concerns about manipulation and misuse. Sanchez warns, “Synthetic data itself might be used to insert false information into the training set, intentionally misleading the AI models… This is particularly concerning when applying it to sensitive applications like fraud detection, where bad actors could use the synthetic data to train models that overlook certain fraudulent patterns.”
Blockchain technology offers potential solutions for mitigating the risks of synthetic data, according to Médard. She emphasizes the importance of making data tamper-proof rather than unchangeable. “When updating data, you don’t do it willy-nilly—you change a bit and observe. When people talk about immutability, they really mean durability, but the full framework matters.”