A Python toolkit dedicated to the Iraqi Arabic dialect โ tokenization, normalization, sentiment analysis, dialect detection, and more.
Features โข Install โข Quick Start โข API โข ุงูุนุฑุจูุฉ
Modern Standard Arabic (MSA) tools fail spectacularly on Iraqi dialect. Words like "ุดูููู", "ุงูู/ู ุงูู", "ููุงูุฉ", and "ุดูุฏ" are completely missed by standard tokenizers and sentiment models.
This is the first toolkit purpose-built for Iraqi Arabic โ handling:
- โ Iraqi-specific stopwords (over 200 curated)
- โ Dialect-vs-MSA detection
- โ Normalization (Arabic numerals, punctuation, diacritics, kashida)
- โ Sentiment analysis trained on Iraqi text
- โ Word segmentation with prefix/suffix stripping (ูุ ุงูุ ููุ ููุง...)
- โ Text cleaning utilities for social media (Twitter/X, Facebook, WhatsApp dumps)
| Module | Description |
|---|---|
tokenize |
Word & sentence tokenization aware of Iraqi conjunctions |
normalize |
Unify forms of letters (ุฃ/ุฅ/ุข โ ุง), strip diacritics, normalize numerals |
dialect |
Detect Iraqi vs MSA vs other Arabic dialects |
sentiment |
Rule-based + model-based sentiment scoring |
segment |
Affix-aware morphological segmentation |
clean |
Remove URLs, mentions, emojis, repeated chars (ุดูููููุฑุง โ ุดูุฑุง) |
stopwords |
Curated Iraqi + MSA stopword sets |
pip install iraqi-arabic-nlpOr from source:
git clone https://github.com/kasimmj/iraqi-arabic-nlp.git
cd iraqi-arabic-nlp
pip install -e .from iraqi_nlp import normalize, tokenize, sentiment, dialect
text = "ุดูููู ุตุฏูููุ ุงูู ููุงูุฉ ุงูู ุฒูู ุจูุงูู
ุทุนู
!!! ๐"
# Normalize: unify letter forms, strip diacritics, fix elongations
clean = normalize(text)
# โ "ุดูููู ุตุฏููู ุงูู ููุงูุฉ ุงูู ุฒูู ุจูุงูู
ุทุนู
"
# Tokenize with Iraqi-aware splitting
tokens = tokenize(clean)
# โ ['ุดูููู', 'ุตุฏููู', 'ุงูู', 'ููุงูุฉ', 'ุงูู', 'ุฒูู', 'ุจูุงูู
ุทุนู
']
# Detect dialect
print(dialect(text))
# โ {'dialect': 'iraqi', 'confidence': 0.94}
# Sentiment
print(sentiment(text))
# โ {'polarity': 'positive', 'score': 0.82}Applies the full normalization pipeline:
from iraqi_nlp import normalize
normalize("ุงููุฃูุฎู ุดูููุฑููู") # โ "ุงูุงุฎ ุดูุฑู"
normalize("ุดูููููุฑุง") # โ "ุดูุฑุง"
normalize("ูกูขูฃ") # โ "123"from iraqi_nlp import tokenize
tokenize("ุดูููู ุฒูู
ุฉ", mode='word')
# โ ['ุดูููู', 'ุฒูู
ุฉ']
tokenize("ููุง. ุดุฎุจุงุฑูุ ุฒูู ูุงููู.", mode='sentence')
# โ ['ููุง.', 'ุดุฎุจุงุฑูุ', 'ุฒูู ูุงููู.']Classifies between Iraqi, MSA, Gulf, Egyptian, and Levantine.
from iraqi_nlp import dialect
dialect("ุดูุฏ ุณุงุนุฉ ุงูุญูู")
# โ {'dialect': 'iraqi', 'confidence': 0.97}
dialect("ู
ุง ูู ุงูููุช ุงูุขู")
# โ {'dialect': 'msa', 'confidence': 0.91}Rule + lexicon based sentiment for Iraqi text.
from iraqi_nlp import sentiment
sentiment("ููุงูุฉ ุฒูู ูุงูุดู ู
ุงูู ู
ุซูู")
# โ {'polarity': 'positive', 'score': 0.88}
sentiment("ู
ุงุฎูุดุ ูุฏู
ุช ูู
ุง ุงุดุชุฑูุชู")
# โ {'polarity': 'negative', 'score': -0.74}Strip affixes to get the stem.
from iraqi_nlp import segment
segment("ูุจูุชุงุจูู
")
# โ {'prefix': 'ู+ุจ', 'stem': 'ูุชุงุจ', 'suffix': 'ูู
'}- Sentiment monitoring of Iraqi social media campaigns
- Chatbots that understand "ุงูู/ู ุงูู" instead of failing
- Search engines that match "ุดูููู" โ "ุดฺููู" โ "ุดฺูููู"
- Academic research on Iraqi corpora
- EdTech apps localized for Iraqi students
Iraqi dialect is deeply rich and varies between Baghdad, Basra, Mosul, and more. We welcome contributions:
- ๐ Adding regional vocabulary
- ๐ Annotating training data
- ๐ Filing dialect edge cases
- ๐ Adding more sub-dialects
See CONTRIBUTING.md.
If you use this in research, please cite:
@software{kasim2026iraqi,
author = {Mohammed, Kasim},
title = {Iraqi Arabic NLP: A Toolkit for the Iraqi Dialect},
year = {2026},
url = {https://github.com/kasimmj/iraqi-arabic-nlp}
}MIT ยฉ 2026 Kasim Mohammed
ุฃูู ู ูุชุจุฉ NLP ู ุฎุตุตุฉ ูููุฌุฉ ุงูุนุฑุงููุฉ
ุฃุฏูุงุช ู ุซู tokenizers ู sentiment models ุงูุชูููุฏูุฉ ุชูุดู ู ุน ููู ุงุช ู ุซู "ุดูููู"ุ "ุงูู/ู ุงูู"ุ "ููุงูุฉ"ุ ู"ุดูุฏ". ูุฐู ุงูู ูุชุจุฉ ุชุนุงูุฌ ุงูููุฌุฉ ุงูุนุฑุงููุฉ ุชุญุฏูุฏุงู.
- ๐ค ุชุทุจูุน ุงููุตูุต (ูู ุฒุงุชุ ุชุดูููุ ุฃุฑูุงู )
- โ๏ธ ุชูุทูุน ููู ุงุช ูููู ุงูููุฌุฉ
- ๐ง ุชุญููู ุงูู ุดุงุนุฑ ุงูุนุฑุงูู
- ๐ ุชู ููุฒ ุงูููุฌุงุช (ุนุฑุงูู/ู ุตุฑู/ุฎููุฌู/ุดุงู ู/ูุตุญู)
- ๐๏ธ ููู ุงุช ุฅููุงู ุนุฑุงููุฉ (+200 ููู ุฉ)
from iraqi_nlp import sentiment
sentiment("ููุงูุฉ ุฒูู ูุงูุดู")
# โ {'polarity': 'positive', 'score': 0.88}Built with โค๏ธ in Iraq ๐ฎ๐ถ by Kasim Mohammed