WikiScraper ...

A tool to scrape and process Wikipedia articles for dataset creation

📝 Description

This project is a Wikipedia scraper specifically designed to collect articles from Wikipedia. It processes the articles and creates a dataset that can be pushed to the Hugging Face Hub.

🚀 Features

Scrapes articles from Wikipedia
Removes Wiki markup using mwparserfromhell
Automatically pushes data to Hugging Face Hub
Processes articles in batches of 1000
Maintains article titles and content

🛠️ Installation

To install and use the Wikipedia Scraper, follow these steps:

Clone the repository:

git clone https://github.com/yourusername/wikipedia-scraper.git
cd wikipedia-scraper

Create and activate a virtual environment (optional but recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```
Set up environment variables: Create a .env file in the root directory of the project and add your Hugging Face repository and token:
```
HF_DS_REPO="your_huggingface_repo"
HF_WRITE_TOKEN="your_huggingface_token"
```
Run the scraper:
```
python wikiscrapper.py
```

This will start the scraping process and push the collected articles to your Hugging Face repository in batches of 1000 articles.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
img		img
.env.example		.env.example
Readme.md		Readme.md
requirements.txt		requirements.txt
wikiscrapper.py		wikiscrapper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiScraper ...

📝 Description

🚀 Features

🛠️ Installation

About

Uh oh!

Releases

Packages

Languages

Elma-dev/wikiscrapper

Folders and files

Latest commit

History

Repository files navigation

WikiScraper ...

📝 Description

🚀 Features

🛠️ Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages