One of the largest high-quality Arabic text datasets for Large Language Model training and fine-tuning
This repository contains a comprehensive Arabic text dataset specifically curated for training and fine-tuning Large Language Models (LLMs). The dataset represents one of the largest collections of high-quality Arabic text available for machine learning research and commercial applications.
Collected by RightNow AI team - RightNow AI is the first GPU-powered AI code editor, providing 180x more powerful AI assistance for your entire codebase.
| Metric | Value |
|---|---|
| Total Articles | 743,288 |
| Total Words | 244,153,780 |
| Total Sentences | 12,392,064 |
| Unique Words | 1,529,064 |
| Vocabulary Richness | 0.0063 |
| Average Words/Article | 328.5 |
| Average Sentences/Article | 16.7 |
| Average Words/Sentence | 19.7 |
| High Quality Articles | 185,351 (≥70% quality score) |
| Dataset Size | 8.7GB (JSONL) |
- Unprecedented Scale: 744K articles with 244M words - one of the largest Arabic datasets available
- Production-Ready Quality: Advanced cleaning pipeline removes artifacts, references, and templates
- LLM-Optimized Format: JSONL structure designed specifically for modern language model training
- Comprehensive Coverage: Diverse content spanning history, science, literature, politics, and more
- Linguistic Excellence: UTF-8 encoded with proper Arabic text normalization and validation
- Rich Metadata: Complete article information including titles, URLs, and processing timestamps
├── README.md # This file
├── arabic_wikipedia_cleaned.jsonl # Main dataset (8.7GB)
├── arabic_wikipedia_cleaned.txt # Human-readable format (8.5GB)
├── dataset_metadata.json # Dataset metadata
├── dataset/ # Individual cleaned files
│ ├── arabic_text_*.jsonl # 11,880 individual files
│ └── ...
└── analysis_reports/ # Comprehensive analysis
├── dataset_analysis_*.json # Detailed JSON analysis
├── dataset_report_*.txt # Human-readable report
├── dataset_summary_*.csv # Summary statistics
└── dataset_documentation_*.md # Markdown documentation
| Topic | Articles | Percentage |
|---|---|---|
| General | 316,527 | 42.6% |
| History | 228,884 | 30.8% |
| Geography | 170,062 | 22.9% |
| Science | 118,536 | 15.9% |
| Religion | 104,378 | 14.0% |
| Politics | 87,366 | 11.8% |
| Arts | 78,915 | 10.6% |
| Literature | 76,566 | 10.3% |
| Sports | 71,171 | 9.6% |
| Quality Level | Articles | Percentage |
|---|---|---|
| Excellent (≥80%) | 130,373 | 17.5% |
| Good (60-80%) | 306,526 | 41.2% |
| Fair (40-60%) | 117,976 | 15.9% |
| Filtered Out (<40%) | 188,413 | 25.3% |
Average Quality Score: 58.3%
Python (Recommended)
import json
# Load the main dataset
articles = []
with open('arabic_wikipedia_cleaned.jsonl', 'r', encoding='utf-8') as f:
for line in f:
article = json.loads(line)
articles.append(article)
print(f"Loaded {len(articles)} articles")Hugging Face Datasets
from datasets import load_dataset
# Load from local files
dataset = load_dataset('json', data_files='arabic_wikipedia_cleaned.jsonl')Each line in the JSONL file contains:
{
"id": "unique_article_id",
"title": "Article Title",
"text": "Clean Arabic text content...",
"url": "source_url",
"hash": "content_hash",
"metadata": {
"language": "ar",
"source": "Multiple Sources",
"cleaned": true,
"processing_date": "2025-01-23T01:00:00"
}
}This dataset is perfect for:
- Arabic LLM Training: GPT, BERT, T5, LLaMA models
- Fine-tuning: Domain-specific Arabic models
- Text Generation: Content generation systems
- NLP Research: Arabic language processing research
- Language Modeling: Statistical language models
- Transfer Learning: Pre-trained model adaptation
Our comprehensive processing pipeline ensures high-quality data:
- Source Collection: Aggregated from multiple high-quality Arabic sources
- Content Cleaning: Removed artifacts, references, templates
- Quality Filtering: Applied strict quality criteria (≥70% Arabic content)
- Length Filtering: Removed very short or overly long content
- Deduplication: Eliminated duplicate and near-duplicate content
- Validation: Comprehensive format and encoding validation
- Analysis: Detailed statistical analysis and quality assessment
- Article Length: 7 - 20,757 words (median: 106)
- Sentence Length: 1 - 3,131 words (average: 21)
- Word Length: 2 - 137 characters (average: 4.9)
- Arabic Content: ≥70% Arabic characters per article
- Encoding: UTF-8 with proper Arabic support
- Text Quality: Professional cleaning and normalization
- Format: JSONL (JSON Lines)
- Encoding: UTF-8
- Language: Arabic (ar)
- Size: 8.7GB compressed
- Articles: 743,288 unique articles
- Vocabulary: 1.5M unique words
- Processing Date: January 2025
This dataset is released under the Apache License 2.0. See the LICENSE file for the full license text.
We welcome contributions to improve the dataset quality and coverage:
- Quality Issues: Report data quality problems
- Enhancement Suggestions: Propose improvements
- Additional Sources: Suggest new high-quality Arabic sources
- Processing Improvements: Contribute to cleaning algorithms
RightNow AI - The first GPU-powered AI code editor, providing 180x more powerful AI assistance
- Website: https://rightnowai.co/
- Documentation: https://docs.rightnowai.co/
- Discord: Join our community
- Twitter/X: @rightnowai_co
- Issues: GitHub Issues
This corpus was collected through the collaborative efforts of:
- RightNow AI - The first GPU-powered AI code editor, providing 180x more powerful AI assistance for your entire codebase
Special thanks to:
- The Arabic language community for content creation
- Open source contributors for processing tools
- The machine learning community for feedback and validation
- RightNow AI for their dedication to advancing Arabic NLP research
If you use this dataset in your research, please cite:
@dataset{rightnow_arabic_llm_corpus_2025,
title={RightNow Arabic LLM Corpus},
author={RightNow AI team},
year={2025},
publisher={GitHub},
url={https://github.com/RightNow-AI/rightnow-arabic-llm-corpus},
note={One of the largest high-quality Arabic text datasets for LLM training}
}If this dataset helps your research or project, please star this repository!
Built for the Arabic AI community by RightNow AI
