Wikipedia Tests New Way to Keep AI Bots Away
The Wikimedia Foundation has partnered with Google-owned Kaggle to provide Wikipedia content in a “machine-readable format” to prevent AI bots from scraping the site. This move aims to address the issue of AI bots consuming a significant portion of Wikipedia’s bandwidth.
AI bots are causing more problems than the average human user, as they tend to scrape even the most obscure corners of Wikipedia. The foundation reported that bandwidth for downloading multimedia grew by 50% since January 2024, primarily due to automated programs downloading openly licensed images to feed AI models.
To tackle this issue, the Foundation and Kaggle have made Wikipedia content available in English and French in a developer-friendly format. “Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content,” the foundation explained. This format is ideal for training models, building features, and testing NLP pipelines.
Kaggle describes the offering, currently in beta, as “immediately usable for modeling, benchmarking, alignment, fine-tuning, and exploratory analysis.” AI developers using the dataset will get access to “high-utility elements” such as article abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections.
All the content is derived from Wikipedia and is freely licensed under two open-source licenses: the Creative Commons Attribution-ShareAlike 4.0 and the GNU Free Documentation License (GFDL). Some content may be licensed under alternative terms or be in the public domain.
This collaborative approach differs from other organizations’ methods of dealing with AI bots. For instance, Reddit has implemented stricter controls to prevent bots from accessing its platform, while The New York Times has sued OpenAI over alleged unauthorized scraping of its articles to train AI models.
By providing a structured and accessible format for Wikipedia content, the Wikimedia Foundation and Kaggle aim to reduce the strain caused by AI bots and promote more collaborative and transparent AI development.