Skip to content

This is a powerful script that creates a Hugging Face ready dataset from your Text corpus like .CSV .TXT .JSON...

License

Notifications You must be signed in to change notification settings

ababiyaworku/HuggingFace_dataset_creator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Unified Dataset Processor

HFdatasetBanner

A comprehensive Python tool for processing Amharic language datasets and splitting datasets for machine learning tasks. This tool combines corpus cleaning, deduplication, and dataset splitting into one unified interface.

Features

๐Ÿงน Corpus Processing (Amharic-focused)

  • Multi-format support: TXT, CSV, JSON, PDF, DOCX
  • Advanced text cleaning:
    • Removes URLs, emails, mentions
    • Filters emojis and decorative symbols
    • Eliminates long numeric sequences (6+ digits)
    • Preserves Amharic script and punctuation
  • Smart sentence filtering:
    • Word count filtering (6-25 words)
    • Numeric density filtering (rejects >40% digits)
    • Sentence boundary detection for Amharic
  • Global deduplication: Ensures unique sentences across all files
  • Export formats: TXT, CSV, Parquet

โœ‚๏ธ Dataset Splitting

  • Flexible splits: Train/Validation/Test with customizable ratios
  • Multiple format support: TXT, JSON, JSONL, CSV, Parquet
  • Reproducible: Seed-based randomization
  • Batch processing: Split multiple files at once
  • Preserves structure: Maintains original file format

๐Ÿ”„ Automated Workflow

  • Process AND Split mode: Automatically processes corpus and splits results
  • One-stop solution: From raw data to training-ready datasets

Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Required Dependencies

Create a requirements.txt file:

pandas>=1.3.0
pyarrow>=5.0.0

Optional Dependencies

For additional file format support, add these to your requirements.txt:

# For PDF support
PyPDF2>=2.0.0

# For DOCX support
python-docx>=0.8.11

Install Dependencies

pip install -r requirements.txt

Or install manually:

# Core dependencies
pip install pandas pyarrow

# Optional: PDF and DOCX support
pip install PyPDF2 python-docx

Usage

Quick Start

  1. Clone or download the repository
  2. Install dependencies
  3. Run the script:
python unified_dataset_processor.py

Mode 1: Process Amharic Corpus

Cleans and processes Amharic text files with advanced filtering:

Select mode:
1. Process Amharic Corpus (Clean, Deduplicate, Export)

Enter choice: 1
Enter directory path: /path/to/your/corpus

Output structure:

your_corpus/
โ”œโ”€โ”€ Processed_Output/
โ”‚   โ”œโ”€โ”€ file1_clean.csv
โ”‚   โ”œโ”€โ”€ file2_clean.csv
โ”‚   โ””โ”€โ”€ Hugging_Face_Upload/
โ”‚       โ”œโ”€โ”€ dataset.txt
โ”‚       โ”œโ”€โ”€ dataset.csv
โ”‚       โ””โ”€โ”€ dataset.parquet

Mode 2: Split Dataset

Splits existing datasets into train/validation/test sets:

Select mode:
2. Split Dataset (Train/Valid/Test)

Enter choice: 2
Enter the path to your corpus file or folder: /path/to/dataset.csv

Default split ratios: Train=80%, Valid=10%, Test=10%
Use custom ratios? (y/n, default=n): n
Random seed for reproducibility (default=42): 42

Output structure:

your_dataset_folder/
โ””โ”€โ”€ HF_upload/
    โ”œโ”€โ”€ train.csv
    โ”œโ”€โ”€ valid.csv
    โ””โ”€โ”€ test.csv

Mode 3: Process AND Split (Automated)

Combines both operations automatically:

Select mode:
3. Process AND Split (Do both automatically)

Enter choice: 3
Enter directory path: /path/to/your/corpus

This will:

  1. Process all corpus files
  2. Clean and deduplicate text
  3. Export to multiple formats
  4. Automatically split the processed datasets
  5. Generate train/valid/test sets

Supported File Formats

Input Formats (Processing)

  • .txt - Plain text files
  • .csv - CSV files
  • .json - JSON files (with "messages" structure)
  • .pdf - PDF documents (requires PyPDF2)
  • .docx - Word documents (requires python-docx)

Input/Output Formats (Splitting)

  • .txt - Plain text (line-delimited)
  • .json - JSON arrays
  • .jsonl - JSON Lines
  • .csv - CSV with headers
  • .parquet - Parquet files

Text Cleaning Details

The processor applies the following transformations:

  1. Number Removal: Removes sequences of 6+ digits
  2. URL/Email Cleaning: Strips all web links and email addresses
  3. Symbol Filtering: Removes emojis and decorative Unicode symbols
  4. Script Preservation: Keeps Amharic script (U+1200-U+137F) and Amharic punctuation (แขแฃแคแฅ)
  5. Sentence Filtering:
    • Minimum: 6 words
    • Maximum: 25 words
    • Rejects sentences with >40% numeric characters

Configuration

Custom Split Ratios

When prompted, you can specify custom ratios:

Training ratio (0-1): 0.7
Validation ratio (0-1): 0.15
Test ratio (0-1): 0.15

Note: Ratios must sum to 1.0

Random Seed

For reproducible splits, use the same seed value:

Random seed for reproducibility (default=42): 12345

Output Examples

Processed Corpus Output

dataset.txt:

แ‹จแŠ แˆ›แˆญแŠ› แ‰‹แŠ•แ‰‹ แ‰ แŠขแ‰ตแ‹ฎแŒตแ‹ซ แ‹แˆตแŒฅ แ‰ แˆšแˆŠแ‹ฎแŠ–แ‰ฝ แ‹จแˆšแ‰†แŒ แˆฉ แˆฐแ‹Žแ‰ฝ แ‹จแˆšแŒ แ‰€แˆ™แ‰ แ‰ต แ‰‹แŠ•แ‰‹ แАแ‹แข
แ‰ตแˆแˆ…แˆญแ‰ต แ‹จแˆแˆ‰แˆ แˆแŒ†แ‰ฝ แˆ˜แ‰ฅแ‰ต แАแ‹แข
...

dataset.csv:

text
"แ‹จแŠ แˆ›แˆญแŠ› แ‰‹แŠ•แ‰‹ แ‰ แŠขแ‰ตแ‹ฎแŒตแ‹ซ แ‹แˆตแŒฅ แ‰ แˆšแˆŠแ‹ฎแŠ–แ‰ฝ แ‹จแˆšแ‰†แŒ แˆฉ แˆฐแ‹Žแ‰ฝ แ‹จแˆšแŒ แ‰€แˆ™แ‰ แ‰ต แ‰‹แŠ•แ‰‹ แАแ‹แข"
"แ‰ตแˆแˆ…แˆญแ‰ต แ‹จแˆแˆ‰แˆ แˆแŒ†แ‰ฝ แˆ˜แ‰ฅแ‰ต แАแ‹แข"
...

Split Dataset Output

After splitting, you'll have three files with the specified ratios:

  • train.csv (80% of data)
  • valid.csv (10% of data)
  • test.csv (10% of data)

Requirements.txt

Minimal (required):

pandas>=1.3.0
pyarrow>=5.0.0

Full (with all features):

pandas>=1.3.0
pyarrow>=5.0.0
PyPDF2>=2.0.0
python-docx>=0.8.11

Troubleshooting

Parquet Export Issues

If you see: โš ๏ธ Could not save Parquet

pip install pyarrow

PDF Processing Issues

If PDF files aren't being read:

pip install PyPDF2

DOCX Processing Issues

If Word documents aren't being processed:

pip install python-docx

Memory Issues with Large Files

For very large datasets, consider:

  1. Processing files individually instead of all at once
  2. Increasing available system memory
  3. Splitting large files before processing

Project Structure

.
โ”œโ”€โ”€ unified_dataset_processor.py    # Main script
โ”œโ”€โ”€ requirements.txt                # Python dependencies
โ”œโ”€โ”€ README.md                       # This file
โ””โ”€โ”€ your_data/                      # Your corpus directory
    โ”œโ”€โ”€ file1.txt
    โ”œโ”€โ”€ file2.pdf
    โ”œโ”€โ”€ file3.csv
    โ””โ”€โ”€ Processed_Output/           # Generated by script
        โ”œโ”€โ”€ file1_clean.csv
        โ”œโ”€โ”€ file2_clean.csv
        โ””โ”€โ”€ Hugging_Face_Upload/
            โ”œโ”€โ”€ dataset.txt
            โ”œโ”€โ”€ dataset.csv
            โ”œโ”€โ”€ dataset.parquet
            โ””โ”€โ”€ HF_upload/          # Generated after splitting
                โ”œโ”€โ”€ train.csv
                โ”œโ”€โ”€ valid.csv
                โ””โ”€โ”€ test.csv

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is open source and available under the Apache-2.0 License.

Acknowledgments

  • Built for Amharic NLP dataset preparation
  • Designed for Hugging Face dataset uploads
  • Optimized for machine learning workflows

Version History

  • v1.0.0 - Initial release with corpus processing and dataset splitting
  • Unified interface with automated workflow
  • Support for multiple file formats
  • Global deduplication
  • Parquet export support

Contact & Support

For issues, questions, or contributions, please open an issue on GitHub.


Author

Created by @AbabiyaWorku

Happy Dataset Processing! ๐Ÿš€

About

This is a powerful script that creates a Hugging Face ready dataset from your Text corpus like .CSV .TXT .JSON...

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published