Skip to content

Wikipedia scraper specifically designed to collect articles from Wikipedia. It processes the articles and creates a dataset that can be pushed to the Hugging Face Hub.

Notifications You must be signed in to change notification settings

Elma-dev/wikiscrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Wikipedia Scraper Logo

WikiScraper ...

A tool to scrape and process Wikipedia articles for dataset creation

πŸ“ Description

This project is a Wikipedia scraper specifically designed to collect articles from Wikipedia. It processes the articles and creates a dataset that can be pushed to the Hugging Face Hub.

πŸš€ Features

  • Scrapes articles from Wikipedia
  • Removes Wiki markup using mwparserfromhell
  • Automatically pushes data to Hugging Face Hub
  • Processes articles in batches of 1000
  • Maintains article titles and content

πŸ› οΈ Installation

To install and use the Wikipedia Scraper, follow these steps:

  1. Clone the repository:

    git clone https://github.com/yourusername/wikipedia-scraper.git
    cd wikipedia-scraper
  2. Create and activate a virtual environment (optional but recommended):

    python3 -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  3. Install the required dependencies:

    pip install -r requirements.txt
  4. Set up environment variables: Create a .env file in the root directory of the project and add your Hugging Face repository and token:

    HF_DS_REPO="your_huggingface_repo"
    HF_WRITE_TOKEN="your_huggingface_token"
  5. Run the scraper:

    python wikiscrapper.py

This will start the scraping process and push the collected articles to your Hugging Face repository in batches of 1000 articles.

About

Wikipedia scraper specifically designed to collect articles from Wikipedia. It processes the articles and creates a dataset that can be pushed to the Hugging Face Hub.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages