A sentence-aligned multilingual corpus of South African government cabinet statements in 11 official languages
🤗 Dataset • 📄 Paper • 🗂️ Zenodo • 📊 Explore Data • 📝 Feedback
- About
- Dataset
- Quick Start
- Using the Dataset
- Development
- Pipeline Architecture
- Alignment Statistics
- Citation
- License
- Contributors
This repository contains sentence-aligned parallel text from South African government cabinet statements in 11 official languages. The data is scraped from gov.za/cabinet-statements, maintained by the Government Communication and Information System (GCIS).
- 🌐 724,694 aligned sentence pairs across 55 language combinations
- 🔗 Sentence-level alignment using LASER embeddings
- 📈 High-quality alignments with confidence scores (cosine similarity ≥ 0.65)
- 🎯 Ready-to-use splits (train/test/eval) for machine learning
- 🚀 Automated pipeline with GitHub Actions for continuous updates
- 🤗 Available on Hugging Face for easy integration
|
|
The sentence-aligned dataset is available on Hugging Face for easy integration with ML workflows:
from datasets import load_dataset
# Load a specific language pair
dataset = load_dataset("dsfsi/govza-sa-cabinet-statements-sentence-aligned", "afr-eng")
# Access splits
train_data = dataset["train"] # ~70% of data
test_data = dataset["test"] # ~15% of data
eval_data = dataset["eval"] # ~15% of data
# Iterate through examples
for example in train_data:
print(f"Afrikaans: {example['afr']}")
print(f"English: {example['eng']}")
print(f"Alignment Score: {example['score']:.2f}")Available configurations: 55 language pair combinations (e.g., afr-eng, xho-zul, eng-nso)
The raw multilingual cabinet statements are also available:
- Full dataset:
data/govza-cabinet-statements.json(all languages combined) - Per-language CSVs:
data/interim/govza-cabinet-statements-{lang}.csv - Aligned outputs:
data/sentence_align_output/(JSONL format with scores)
- For scraping: Python 3.8+ (any OS)
- For sentence alignment: Ubuntu 20.04 + Python 3.8 (required for fairseq)
Option 1: Automated Setup (Recommended)
./setup.shThis interactive script will:
- ✅ Verify Python 3.8 installation
- ✅ Install system dependencies (Ubuntu)
- ✅ Install Python packages
- ✅ Fix LASER script line endings
- ✅ Set up development tools (optional)
Option 2: Using Makefile
make setupOption 3: Manual Setup
# Install dependencies
pip install -r requirements.txt
# For Ubuntu: Install system packages
sudo apt-get update
sudo apt-get install build-essential cmake zip
# For development
pip install -r requirements-dev.txt
pre-commit install# Scrape new cabinet statements
make scrape
# Run sentence alignment
make align
# View all commands
make helpfrom datasets import load_dataset
from transformers import MarianMTModel, MarianTokenizer, Trainer
# Load Afrikaans-English data
dataset = load_dataset(
"dsfsi/govza-sa-cabinet-statements-sentence-aligned",
"afr-eng"
)
# Your training code here
# ...# Load dataset
dataset = load_dataset("dsfsi/govza-sa-cabinet-statements-sentence-aligned", "afr-eng")
# Filter high-quality alignments (score ≥ 0.8)
high_quality = dataset["train"].filter(lambda x: x["score"] >= 0.8)
print(f"High-quality pairs: {len(high_quality)}")See the Alignment Statistics section below for detailed statistics on alignment quality and coverage across language pairs.
gov-za-multilingual/
├── data/ # Dataset files
│ ├── govza-cabinet-statements.json
│ ├── interim/ # Per-language CSVs
│ ├── sentence_align_output/ # Aligned pairs (JSONL)
│ └── opt_aligned_out/ # Optimized alignments
├── src/
│ ├── gov_cab_statements_scrape/ # Web scraper
│ ├── sentence_alignment/ # LASER-based alignment
│ └── scripts/ # Utility scripts
├── huggingface_dataset/ # HF dataset preparation
├── notebooks/ # Jupyter notebooks
├── .github/workflows/ # Automated CI/CD
├── setup.sh # Automated setup script
├── Makefile # Common commands
└── README.md # This file
- Make changes to scraping or alignment code
- Run tests (when available):
pytest - Check code quality:
make lint # Run flake8 pre-commit run --all # Run all pre-commit hooks
- Commit changes: The pre-commit hooks will automatically format code
We welcome contributions! Please:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and commit:
git commit -m 'Add amazing feature' - Push to your fork:
git push origin feature/amazing-feature - Open a Pull Request
See also:
- CLAUDE.md - Architecture and development guide
- QOL_IMPROVEMENTS.md - Recent improvements
graph LR
A[gov.za/cabinet-statements] -->|Scrape| B[Raw HTML]
B -->|Parse| C[Multilingual Statements]
C -->|Save| D[JSON + CSVs]
Scraper (src/gov_cab_statements_scrape/):
- Runs weekly via GitHub Actions (Fridays at 2pm UTC)
- Extracts statements in all 11 languages
- Outputs to
data/govza-cabinet-statements.json
graph LR
A[Multilingual Statements] -->|Tokenize| B[Sentences]
B -->|LASER Embed| C[Embeddings]
C -->|Cosine Similarity| D[Aligned Pairs]
D -->|Filter ≥0.65| E[High-Quality Alignments]
Alignment Pipeline (src/sentence_alignment/):
- Tokenization: NLTK sentence tokenization
- Preprocessing: Language-specific cleaning (URLs, titles, etc.)
- Embedding: LASER (Language-Agnostic SEntence Representations)
- Alignment: Cosine similarity of sentence embeddings
- Filtering: Keep pairs with score ≥ 0.65
Two GitHub Actions workflows maintain the dataset:
-
Weekly Scraping (
.github/workflows/update_cab_statements.yml)- Runs every Friday at 14:00 UTC
- Scrapes new statements
- Commits data files
-
Sentence Alignment (
.github/workflows/sentence_alignment_build.yml)- Triggers after scraping completes
- Aligns new statements across all language pairs
- Uses Python 3.8 + pip 24.0 (fairseq compatibility)
- Total aligned pairs: 724,694
- Language pairs: 55 combinations
- Average alignment score: 0.78 (median: 0.81)
- Quality threshold: Cosine similarity ≥ 0.65
Click to expand
| Source | Target | Aligned Pairs | Avg Score |
|---|---|---|---|
| nbl | ven | 18,984 | 0.75 |
| nso | ssw | 18,697 | 0.82 |
| zul | ssw | 18,563 | 0.84 |
| xho | ssw | 18,387 | 0.83 |
| xho | zul | 18,145 | 0.85 |
| xho | nso | 18,110 | 0.83 |
| xho | tso | 17,954 | 0.81 |
| ssw | tso | 17,880 | 0.82 |
| zul | tso | 17,789 | 0.82 |
| zul | nso | 17,630 | 0.83 |
| nso | tso | 17,617 | 0.81 |
| tsn | tso | 16,681 | 0.80 |
| xho | tsn | 16,571 | 0.81 |
| xho | eng | 16,537 | 0.84 |
| zul | tsn | 16,482 | 0.81 |
| tsn | ssw | 16,386 | 0.81 |
| nso | tsn | 16,179 | 0.80 |
| nbl | sot | 16,163 | 0.73 |
| zul | eng | 16,149 | 0.84 |
| tso | eng | 16,068 | 0.82 |
- Excellent (≥ 0.9): ~15% of pairs
- Good (0.8-0.89): ~45% of pairs
- Fair (0.7-0.79): ~30% of pairs
- Acceptable (0.65-0.69): ~10% of pairs
If you use this dataset in your research, please cite:
@inproceedings{lastrucci-etal-2023-preparing,
title = "Preparing the Vuk{'}uzenzele and {ZA}-gov-multilingual {S}outh {A}frican multilingual corpora",
author = "Lastrucci, Richard and Dzingirai, Isheanesu and Rajab, Jenalea and
Madodonga, Andani and Shingange, Matimba and Njini, Daniel and Marivate, Vukosi",
booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.rail-1.3",
pages = "18--25"
}@dataset{marivate_vukosi_2023_7635168,
author = {Marivate, Vukosi and Shingange, Matimba and Lastrucci, Richard and
Dzingirai, Isheanesu and Rajab, Jenalea},
title = {The South African Gov-ZA multilingual corpus},
month = feb,
year = 2023,
publisher = {Zenodo},
version = {1.0},
doi = {10.5281/zenodo.7635168},
url = {https://doi.org/10.5281/zenodo.7635168}
}- Vukosi Marivate (@vukosi) - Project Lead
- Matimba Shingange - Data Collection & Processing
- Richard Lastrucci - Sentence Alignment Pipeline
- Isheanesu Joseph Dzingirai - Infrastructure & Automation
- Jenalea Rajab - Data Analysis & Validation
Data Science for Social Impact (DSFSI) Research Group University of Pretoria, South Africa
This dataset contains machine-readable data extracted from online cabinet statements from the South African government (GCIS). While efforts were made to ensure accuracy and completeness, there may be errors or discrepancies. No warranties or guarantees are provided. Users should verify information before making decisions based on this data.
- 🤗 Hugging Face Dataset: dsfsi/govza-sa-cabinet-statements-sentence-aligned
- 📄 Paper: arXiv:2303.03750
- 🗂️ Zenodo Archive: DOI: 10.5281/zenodo.7635168
- 📊 Interactive Explorer: Datasette Lite
- 🌐 DSFSI: dsfsi.github.io
Made with ❤️ by the DSFSI Research Group