Wikipedia Offers AI Developers Article Data on Kaggle
The Wikimedia Foundation, the organization behind Wikipedia, has made its article data available on Kaggle, a data science and machine learning community owned by Google LLC. This move aims to discourage AI companies and large language model trainers from scraping Wikipedia’s website.
By providing a structured dataset, Wikimedia allows developers to access Wikipedia content in a more convenient and usable format. The dataset includes abstracts, short descriptions, infobox key-value data, image links, and segmented article sections. It is available in both English and French language editions.

The dataset is designed to reduce the need for web scraping, which can put a significant burden on Wikipedia’s web servers. Web scraping is an automated process that extracts content from websites, often causing additional load on web servers beyond normal human traffic. By providing clean, pre-parsed data, Wikimedia and Kaggle aim to short-circuit this scraping behavior.
“Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content,” Wikimedia said in the announcement. This makes the data ideal for training models, building features, and testing NLP pipelines.
The dataset is licensed under the Creative Commons and GNU Free Documentation License, allowing for sharing, adapting, and remixing content. Kaggle is welcoming feedback and discussions about the dataset from the community.
This move is part of a larger effort to provide high-quality datasets for AI and machine learning. Kaggle hosts over 461,000 freely accessible datasets covering various topics, including health, finance, and social sciences. The addition of Wikipedia’s dataset will further enrich the available data for developers and researchers.