A comprehensive Python tool for processing Amharic language datasets and splitting datasets for machine learning tasks. This tool combines corpus cleaning, deduplication, and dataset splitting into one unified interface.
- Multi-format support: TXT, CSV, JSON, PDF, DOCX
- Advanced text cleaning:
- Removes URLs, emails, mentions
- Filters emojis and decorative symbols
- Eliminates long numeric sequences (6+ digits)
- Preserves Amharic script and punctuation
- Smart sentence filtering:
- Word count filtering (6-25 words)
- Numeric density filtering (rejects >40% digits)
- Sentence boundary detection for Amharic
- Global deduplication: Ensures unique sentences across all files
- Export formats: TXT, CSV, Parquet
- Flexible splits: Train/Validation/Test with customizable ratios
- Multiple format support: TXT, JSON, JSONL, CSV, Parquet
- Reproducible: Seed-based randomization
- Batch processing: Split multiple files at once
- Preserves structure: Maintains original file format
- Process AND Split mode: Automatically processes corpus and splits results
- One-stop solution: From raw data to training-ready datasets
- Python 3.7 or higher
- pip package manager
Create a requirements.txt file:
pandas>=1.3.0
pyarrow>=5.0.0For additional file format support, add these to your requirements.txt:
# For PDF support
PyPDF2>=2.0.0
# For DOCX support
python-docx>=0.8.11pip install -r requirements.txtOr install manually:
# Core dependencies
pip install pandas pyarrow
# Optional: PDF and DOCX support
pip install PyPDF2 python-docx- Clone or download the repository
- Install dependencies
- Run the script:
python unified_dataset_processor.pyCleans and processes Amharic text files with advanced filtering:
Select mode:
1. Process Amharic Corpus (Clean, Deduplicate, Export)
Enter choice: 1
Enter directory path: /path/to/your/corpus
Output structure:
your_corpus/
โโโ Processed_Output/
โ โโโ file1_clean.csv
โ โโโ file2_clean.csv
โ โโโ Hugging_Face_Upload/
โ โโโ dataset.txt
โ โโโ dataset.csv
โ โโโ dataset.parquet
Splits existing datasets into train/validation/test sets:
Select mode:
2. Split Dataset (Train/Valid/Test)
Enter choice: 2
Enter the path to your corpus file or folder: /path/to/dataset.csv
Default split ratios: Train=80%, Valid=10%, Test=10%
Use custom ratios? (y/n, default=n): n
Random seed for reproducibility (default=42): 42
Output structure:
your_dataset_folder/
โโโ HF_upload/
โโโ train.csv
โโโ valid.csv
โโโ test.csv
Combines both operations automatically:
Select mode:
3. Process AND Split (Do both automatically)
Enter choice: 3
Enter directory path: /path/to/your/corpus
This will:
- Process all corpus files
- Clean and deduplicate text
- Export to multiple formats
- Automatically split the processed datasets
- Generate train/valid/test sets
.txt- Plain text files.csv- CSV files.json- JSON files (with "messages" structure).pdf- PDF documents (requires PyPDF2).docx- Word documents (requires python-docx)
.txt- Plain text (line-delimited).json- JSON arrays.jsonl- JSON Lines.csv- CSV with headers.parquet- Parquet files
The processor applies the following transformations:
- Number Removal: Removes sequences of 6+ digits
- URL/Email Cleaning: Strips all web links and email addresses
- Symbol Filtering: Removes emojis and decorative Unicode symbols
- Script Preservation: Keeps Amharic script (U+1200-U+137F) and Amharic punctuation (แขแฃแคแฅ)
- Sentence Filtering:
- Minimum: 6 words
- Maximum: 25 words
- Rejects sentences with >40% numeric characters
When prompted, you can specify custom ratios:
Training ratio (0-1): 0.7
Validation ratio (0-1): 0.15
Test ratio (0-1): 0.15
Note: Ratios must sum to 1.0
For reproducible splits, use the same seed value:
Random seed for reproducibility (default=42): 12345
dataset.txt:
แจแ แแญแ แแแ แ แขแตแฎแตแซ แแตแฅ แ แแแฎแแฝ แจแแแ แฉ แฐแแฝ แจแแ แแแ แต แแแ แแแข
แตแแ
แญแต แจแแแ แแแฝ แแฅแต แแแข
...
dataset.csv:
text
"แจแ แแญแ แแแ แ แขแตแฎแตแซ แแตแฅ แ แแแฎแแฝ แจแแแ แฉ แฐแแฝ แจแแ แแแ แต แแแ แแแข"
"แตแแ
แญแต แจแแแ แแแฝ แแฅแต แแแข"
...After splitting, you'll have three files with the specified ratios:
train.csv(80% of data)valid.csv(10% of data)test.csv(10% of data)
Minimal (required):
pandas>=1.3.0
pyarrow>=5.0.0Full (with all features):
pandas>=1.3.0
pyarrow>=5.0.0
PyPDF2>=2.0.0
python-docx>=0.8.11If you see: โ ๏ธ Could not save Parquet
pip install pyarrowIf PDF files aren't being read:
pip install PyPDF2If Word documents aren't being processed:
pip install python-docxFor very large datasets, consider:
- Processing files individually instead of all at once
- Increasing available system memory
- Splitting large files before processing
.
โโโ unified_dataset_processor.py # Main script
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
โโโ your_data/ # Your corpus directory
โโโ file1.txt
โโโ file2.pdf
โโโ file3.csv
โโโ Processed_Output/ # Generated by script
โโโ file1_clean.csv
โโโ file2_clean.csv
โโโ Hugging_Face_Upload/
โโโ dataset.txt
โโโ dataset.csv
โโโ dataset.parquet
โโโ HF_upload/ # Generated after splitting
โโโ train.csv
โโโ valid.csv
โโโ test.csv
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source and available under the Apache-2.0 License.
- Built for Amharic NLP dataset preparation
- Designed for Hugging Face dataset uploads
- Optimized for machine learning workflows
- v1.0.0 - Initial release with corpus processing and dataset splitting
- Unified interface with automated workflow
- Support for multiple file formats
- Global deduplication
- Parquet export support
For issues, questions, or contributions, please open an issue on GitHub.
Created by @AbabiyaWorku
Happy Dataset Processing! ๐
