This project is a Wikipedia scraper specifically designed to collect articles from Wikipedia. It processes the articles and creates a dataset that can be pushed to the Hugging Face Hub.
- Scrapes articles from Wikipedia
- Removes Wiki markup using
mwparserfromhell - Automatically pushes data to Hugging Face Hub
- Processes articles in batches of 1000
- Maintains article titles and content
To install and use the Wikipedia Scraper, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/wikipedia-scraper.git cd wikipedia-scraper -
Create and activate a virtual environment (optional but recommended):
python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the root directory of the project and add your Hugging Face repository and token:HF_DS_REPO="your_huggingface_repo" HF_WRITE_TOKEN="your_huggingface_token"
-
Run the scraper:
python wikiscrapper.py
This will start the scraping process and push the collected articles to your Hugging Face repository in batches of 1000 articles.
