GitHub - kasimmj/iraqi-arabic-nlp: 🇮🇶 The first NLP toolkit for Iraqi Arabic dialect — tokenization, normalization, sentiment, and dialect detection.

A Python toolkit dedicated to the Iraqi Arabic dialect — tokenization, normalization, sentiment analysis, dialect detection, and more.

Features • Install • Quick Start • API • العربية

✨ Why Iraqi Arabic NLP?

Modern Standard Arabic (MSA) tools fail spectacularly on Iraqi dialect. Words like "شلونك", "اكو/ماكو", "هواية", and "شكد" are completely missed by standard tokenizers and sentiment models.

This is the first toolkit purpose-built for Iraqi Arabic — handling:

✅ Iraqi-specific stopwords (over 200 curated)
✅ Dialect-vs-MSA detection
✅ Normalization (Arabic numerals, punctuation, diacritics, kashida)
✅ Sentiment analysis trained on Iraqi text
✅ Word segmentation with prefix/suffix stripping (و، ال، ـك، ـها...)
✅ Text cleaning utilities for social media (Twitter/X, Facebook, WhatsApp dumps)

🚀 Features

Module	Description
`tokenize`	Word & sentence tokenization aware of Iraqi conjunctions
`normalize`	Unify forms of letters (أ/إ/آ → ا), strip diacritics, normalize numerals
`dialect`	Detect Iraqi vs MSA vs other Arabic dialects
`sentiment`	Rule-based + model-based sentiment scoring
`segment`	Affix-aware morphological segmentation
`clean`	Remove URLs, mentions, emojis, repeated chars (شكككككرا → شكرا)
`stopwords`	Curated Iraqi + MSA stopword sets

📦 Installation

pip install iraqi-arabic-nlp

Or from source:

git clone https://github.com/kasimmj/iraqi-arabic-nlp.git
cd iraqi-arabic-nlp
pip install -e .

⚡ Quick Start

from iraqi_nlp import normalize, tokenize, sentiment, dialect

text = "شلونك صديقي؟ اكو هواية اكل زين بهالمطعم!!! 😋"

# Normalize: unify letter forms, strip diacritics, fix elongations
clean = normalize(text)
# → "شلونك صديقي اكو هواية اكل زين بهالمطعم"

# Tokenize with Iraqi-aware splitting
tokens = tokenize(clean)
# → ['شلونك', 'صديقي', 'اكو', 'هواية', 'اكل', 'زين', 'بهالمطعم']

# Detect dialect
print(dialect(text))
# → {'dialect': 'iraqi', 'confidence': 0.94}

# Sentiment
print(sentiment(text))
# → {'polarity': 'positive', 'score': 0.82}

📖 API Reference

`normalize(text: str) -> str`

Applies the full normalization pipeline:

from iraqi_nlp import normalize

normalize("الْأَخُ شَكَرَكَ")        # → "الاخ شكرك"
normalize("شكككككرا")                # → "شكرا"
normalize("١٢٣")                     # → "123"

`tokenize(text: str, mode: str = 'word') -> list[str]`

from iraqi_nlp import tokenize

tokenize("شلونك زلمة", mode='word')
# → ['شلونك', 'زلمة']

tokenize("هلا. شخبارك؟ زين والله.", mode='sentence')
# → ['هلا.', 'شخبارك؟', 'زين والله.']

`dialect(text: str) -> dict`

Classifies between Iraqi, MSA, Gulf, Egyptian, and Levantine.

from iraqi_nlp import dialect

dialect("شكد ساعة الحين")
# → {'dialect': 'iraqi', 'confidence': 0.97}

dialect("ما هو الوقت الآن")
# → {'dialect': 'msa', 'confidence': 0.91}

`sentiment(text: str) -> dict`

Rule + lexicon based sentiment for Iraqi text.

from iraqi_nlp import sentiment

sentiment("هواية زين هالشي ماكو مثله")
# → {'polarity': 'positive', 'score': 0.88}

sentiment("ماخوش، ندمت لما اشتريته")
# → {'polarity': 'negative', 'score': -0.74}

`segment(token: str) -> dict`

Strip affixes to get the stem.

from iraqi_nlp import segment

segment("وبكتابهم")
# → {'prefix': 'و+ب', 'stem': 'كتاب', 'suffix': 'هم'}

🎯 Real-World Use Cases

Sentiment monitoring of Iraqi social media campaigns
Chatbots that understand "اكو/ماكو" instead of failing
Search engines that match "شلونك" ↔ "شلونچ" ↔ "شلونچن"
Academic research on Iraqi corpora
EdTech apps localized for Iraqi students

🤝 Contributing

Iraqi dialect is deeply rich and varies between Baghdad, Basra, Mosul, and more. We welcome contributions:

🆕 Adding regional vocabulary
📊 Annotating training data
🐛 Filing dialect edge cases
🌍 Adding more sub-dialects

See CONTRIBUTING.md.

📚 Citation

If you use this in research, please cite:

@software{kasim2026iraqi,
  author = {Mohammed, Kasim},
  title = {Iraqi Arabic NLP: A Toolkit for the Iraqi Dialect},
  year = {2026},
  url = {https://github.com/kasimmj/iraqi-arabic-nlp}
}

📜 License

🇮🇶 بالعربية

أول مكتبة NLP مخصصة للهجة العراقية

أدوات مثل tokenizers و sentiment models التقليدية تفشل مع كلمات مثل "شلونك"، "اكو/ماكو"، "هواية"، و"شكد". هذي المكتبة تعالج اللهجة العراقية تحديداً.

المميزات:

🔤 تطبيع النصوص (همزات، تشكيل، أرقام)
✂️ تقطيع كلمات يفهم اللهجة
🧠 تحليل المشاعر العراقي
🌍 تمييز اللهجات (عراقي/مصري/خليجي/شامي/فصحى)
🗂️ كلمات إيقاف عراقية (+200 كلمة)

مثال:

from iraqi_nlp import sentiment
sentiment("هواية زين هالشي")
# → {'polarity': 'positive', 'score': 0.88}

Built with ❤️ in Iraq 🇮🇶 by Kasim Mohammed

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
examples		examples
iraqi_nlp		iraqi_nlp
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ Why Iraqi Arabic NLP?

🚀 Features

📦 Installation

⚡ Quick Start

📖 API Reference

`normalize(text: str) -> str`

`tokenize(text: str, mode: str = 'word') -> list[str]`

`dialect(text: str) -> dict`

`sentiment(text: str) -> dict`

`segment(token: str) -> dict`

🎯 Real-World Use Cases

🤝 Contributing

📚 Citation

📜 License

🇮🇶 بالعربية

المميزات:

مثال:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

✨ Why Iraqi Arabic NLP?

🚀 Features

📦 Installation

⚡ Quick Start

📖 API Reference

normalize(text: str) -> str

tokenize(text: str, mode: str = 'word') -> list[str]

dialect(text: str) -> dict

sentiment(text: str) -> dict

segment(token: str) -> dict

🎯 Real-World Use Cases

🤝 Contributing

📚 Citation

📜 License

🇮🇶 بالعربية

المميزات:

مثال:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`normalize(text: str) -> str`

`tokenize(text: str, mode: str = 'word') -> list[str]`

`dialect(text: str) -> dict`

`sentiment(text: str) -> dict`

`segment(token: str) -> dict`

Packages