Close Menu
Breaking News in Technology & Business – Tech Geekwire

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025
    Facebook X (Twitter) Instagram
    Breaking News in Technology & Business – Tech GeekwireBreaking News in Technology & Business – Tech Geekwire
    • New
      • Amazon
      • Digital Health Technology
      • Microsoft
      • Startup
    • AI
    • Corporation
    • Crypto
    • Event
    Facebook X (Twitter) Instagram
    Breaking News in Technology & Business – Tech Geekwire
    Home ยป Wikipedia Offers AI Developers Article Data on Kaggle to Reduce Automated Scraping
    AI

    Wikipedia Offers AI Developers Article Data on Kaggle to Reduce Automated Scraping

    techgeekwireBy techgeekwireApril 25, 2025No Comments2 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email

    Wikipedia Offers AI Developers Article Data on Kaggle

    The Wikimedia Foundation, the organization behind Wikipedia, has made its article data available on Kaggle, a data science and machine learning community owned by Google LLC. This move aims to discourage AI companies and large language model trainers from scraping Wikipedia’s website.

    By providing a structured dataset, Wikimedia allows developers to access Wikipedia content in a more convenient and usable format. The dataset includes abstracts, short descriptions, infobox key-value data, image links, and segmented article sections. It is available in both English and French language editions.

    Wikipedia logo
    Wikipedia logo

    The dataset is designed to reduce the need for web scraping, which can put a significant burden on Wikipedia’s web servers. Web scraping is an automated process that extracts content from websites, often causing additional load on web servers beyond normal human traffic. By providing clean, pre-parsed data, Wikimedia and Kaggle aim to short-circuit this scraping behavior.

    “Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content,” Wikimedia said in the announcement. This makes the data ideal for training models, building features, and testing NLP pipelines.

    The dataset is licensed under the Creative Commons and GNU Free Documentation License, allowing for sharing, adapting, and remixing content. Kaggle is welcoming feedback and discussions about the dataset from the community.

    This move is part of a larger effort to provide high-quality datasets for AI and machine learning. Kaggle hosts over 461,000 freely accessible datasets covering various topics, including health, finance, and social sciences. The addition of Wikipedia’s dataset will further enrich the available data for developers and researchers.

    AI development Data Scraping Kaggle Wikimedia Foundation Wikipedia
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    techgeekwire
    • Website

    Related Posts

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025

    Andreessen Horowitz Backs Controversial Startup Cluely Despite ‘Rage-Bait’ Marketing

    July 4, 2025

    Invesco QQQ ETF Hits All-Time High as Tech Stocks Continue to Soar

    July 4, 2025

    ContractPodAi Partners with Microsoft to Advance Legal AI Automation

    July 4, 2025
    Leave A Reply Cancel Reply

    Top Reviews
    Editors Picks

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025

    Andreessen Horowitz Backs Controversial Startup Cluely Despite ‘Rage-Bait’ Marketing

    July 4, 2025
    Advertisement
    Demo
    About Us
    About Us

    A rich source of news about the latest technologies in the world. Compiled in the most detailed and accurate manner in the fastest way globally. Please follow us to receive the earliest notification

    We're accepting new partnerships right now.

    Email Us: info@example.com
    Contact: +1-320-0123-451

    Our Picks

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025
    Categories
    • AI (2,696)
    • Amazon (1,056)
    • Corporation (990)
    • Crypto (1,130)
    • Digital Health Technology (1,079)
    • Event (523)
    • Microsoft (1,230)
    • New (9,568)
    • Startup (1,164)
    © 2025 TechGeekWire. Designed by TechGeekWire.
    • Home

    Type above and press Enter to search. Press Esc to cancel.