Skip to content

kasimmj/iraqi-arabic-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Iraqi Arabic NLP

A Python toolkit dedicated to the Iraqi Arabic dialect โ€” tokenization, normalization, sentiment analysis, dialect detection, and more.

Features โ€ข Install โ€ข Quick Start โ€ข API โ€ข ุงู„ุนุฑุจูŠุฉ


โœจ Why Iraqi Arabic NLP?

Modern Standard Arabic (MSA) tools fail spectacularly on Iraqi dialect. Words like "ุดู„ูˆู†ูƒ", "ุงูƒูˆ/ู…ุงูƒูˆ", "ู‡ูˆุงูŠุฉ", and "ุดูƒุฏ" are completely missed by standard tokenizers and sentiment models.

This is the first toolkit purpose-built for Iraqi Arabic โ€” handling:

  • โœ… Iraqi-specific stopwords (over 200 curated)
  • โœ… Dialect-vs-MSA detection
  • โœ… Normalization (Arabic numerals, punctuation, diacritics, kashida)
  • โœ… Sentiment analysis trained on Iraqi text
  • โœ… Word segmentation with prefix/suffix stripping (ูˆุŒ ุงู„ุŒ ู€ูƒุŒ ู€ู‡ุง...)
  • โœ… Text cleaning utilities for social media (Twitter/X, Facebook, WhatsApp dumps)

๐Ÿš€ Features

Module Description
tokenize Word & sentence tokenization aware of Iraqi conjunctions
normalize Unify forms of letters (ุฃ/ุฅ/ุข โ†’ ุง), strip diacritics, normalize numerals
dialect Detect Iraqi vs MSA vs other Arabic dialects
sentiment Rule-based + model-based sentiment scoring
segment Affix-aware morphological segmentation
clean Remove URLs, mentions, emojis, repeated chars (ุดูƒูƒูƒูƒูƒุฑุง โ†’ ุดูƒุฑุง)
stopwords Curated Iraqi + MSA stopword sets

๐Ÿ“ฆ Installation

pip install iraqi-arabic-nlp

Or from source:

git clone https://github.com/kasimmj/iraqi-arabic-nlp.git
cd iraqi-arabic-nlp
pip install -e .

โšก Quick Start

from iraqi_nlp import normalize, tokenize, sentiment, dialect

text = "ุดู„ูˆู†ูƒ ุตุฏูŠู‚ูŠุŸ ุงูƒูˆ ู‡ูˆุงูŠุฉ ุงูƒู„ ุฒูŠู† ุจู‡ุงู„ู…ุทุนู…!!! ๐Ÿ˜‹"

# Normalize: unify letter forms, strip diacritics, fix elongations
clean = normalize(text)
# โ†’ "ุดู„ูˆู†ูƒ ุตุฏูŠู‚ูŠ ุงูƒูˆ ู‡ูˆุงูŠุฉ ุงูƒู„ ุฒูŠู† ุจู‡ุงู„ู…ุทุนู…"

# Tokenize with Iraqi-aware splitting
tokens = tokenize(clean)
# โ†’ ['ุดู„ูˆู†ูƒ', 'ุตุฏูŠู‚ูŠ', 'ุงูƒูˆ', 'ู‡ูˆุงูŠุฉ', 'ุงูƒู„', 'ุฒูŠู†', 'ุจู‡ุงู„ู…ุทุนู…']

# Detect dialect
print(dialect(text))
# โ†’ {'dialect': 'iraqi', 'confidence': 0.94}

# Sentiment
print(sentiment(text))
# โ†’ {'polarity': 'positive', 'score': 0.82}

๐Ÿ“– API Reference

normalize(text: str) -> str

Applies the full normalization pipeline:

from iraqi_nlp import normalize

normalize("ุงู„ู’ุฃูŽุฎู ุดูŽูƒูŽุฑูŽูƒูŽ")        # โ†’ "ุงู„ุงุฎ ุดูƒุฑูƒ"
normalize("ุดูƒูƒูƒูƒูƒุฑุง")                # โ†’ "ุดูƒุฑุง"
normalize("ูกูขูฃ")                     # โ†’ "123"

tokenize(text: str, mode: str = 'word') -> list[str]

from iraqi_nlp import tokenize

tokenize("ุดู„ูˆู†ูƒ ุฒู„ู…ุฉ", mode='word')
# โ†’ ['ุดู„ูˆู†ูƒ', 'ุฒู„ู…ุฉ']

tokenize("ู‡ู„ุง. ุดุฎุจุงุฑูƒุŸ ุฒูŠู† ูˆุงู„ู„ู‡.", mode='sentence')
# โ†’ ['ู‡ู„ุง.', 'ุดุฎุจุงุฑูƒุŸ', 'ุฒูŠู† ูˆุงู„ู„ู‡.']

dialect(text: str) -> dict

Classifies between Iraqi, MSA, Gulf, Egyptian, and Levantine.

from iraqi_nlp import dialect

dialect("ุดูƒุฏ ุณุงุนุฉ ุงู„ุญูŠู†")
# โ†’ {'dialect': 'iraqi', 'confidence': 0.97}

dialect("ู…ุง ู‡ูˆ ุงู„ูˆู‚ุช ุงู„ุขู†")
# โ†’ {'dialect': 'msa', 'confidence': 0.91}

sentiment(text: str) -> dict

Rule + lexicon based sentiment for Iraqi text.

from iraqi_nlp import sentiment

sentiment("ู‡ูˆุงูŠุฉ ุฒูŠู† ู‡ุงู„ุดูŠ ู…ุงูƒูˆ ู…ุซู„ู‡")
# โ†’ {'polarity': 'positive', 'score': 0.88}

sentiment("ู…ุงุฎูˆุดุŒ ู†ุฏู…ุช ู„ู…ุง ุงุดุชุฑูŠุชู‡")
# โ†’ {'polarity': 'negative', 'score': -0.74}

segment(token: str) -> dict

Strip affixes to get the stem.

from iraqi_nlp import segment

segment("ูˆุจูƒุชุงุจู‡ู…")
# โ†’ {'prefix': 'ูˆ+ุจ', 'stem': 'ูƒุชุงุจ', 'suffix': 'ู‡ู…'}

๐ŸŽฏ Real-World Use Cases

  • Sentiment monitoring of Iraqi social media campaigns
  • Chatbots that understand "ุงูƒูˆ/ู…ุงูƒูˆ" instead of failing
  • Search engines that match "ุดู„ูˆู†ูƒ" โ†” "ุดู„ูˆู†ฺ†" โ†” "ุดู„ูˆู†ฺ†ู†"
  • Academic research on Iraqi corpora
  • EdTech apps localized for Iraqi students

๐Ÿค Contributing

Iraqi dialect is deeply rich and varies between Baghdad, Basra, Mosul, and more. We welcome contributions:

  • ๐Ÿ†• Adding regional vocabulary
  • ๐Ÿ“Š Annotating training data
  • ๐Ÿ› Filing dialect edge cases
  • ๐ŸŒ Adding more sub-dialects

See CONTRIBUTING.md.


๐Ÿ“š Citation

If you use this in research, please cite:

@software{kasim2026iraqi,
  author = {Mohammed, Kasim},
  title = {Iraqi Arabic NLP: A Toolkit for the Iraqi Dialect},
  year = {2026},
  url = {https://github.com/kasimmj/iraqi-arabic-nlp}
}

๐Ÿ“œ License

MIT ยฉ 2026 Kasim Mohammed


๐Ÿ‡ฎ๐Ÿ‡ถ ุจุงู„ุนุฑุจูŠุฉ

ุฃูˆู„ ู…ูƒุชุจุฉ NLP ู…ุฎุตุตุฉ ู„ู„ู‡ุฌุฉ ุงู„ุนุฑุงู‚ูŠุฉ

ุฃุฏูˆุงุช ู…ุซู„ tokenizers ูˆ sentiment models ุงู„ุชู‚ู„ูŠุฏูŠุฉ ุชูุดู„ ู…ุน ูƒู„ู…ุงุช ู…ุซู„ "ุดู„ูˆู†ูƒ"ุŒ "ุงูƒูˆ/ู…ุงูƒูˆ"ุŒ "ู‡ูˆุงูŠุฉ"ุŒ ูˆ"ุดูƒุฏ". ู‡ุฐูŠ ุงู„ู…ูƒุชุจุฉ ุชุนุงู„ุฌ ุงู„ู„ู‡ุฌุฉ ุงู„ุนุฑุงู‚ูŠุฉ ุชุญุฏูŠุฏุงู‹.

ุงู„ู…ู…ูŠุฒุงุช:

  • ๐Ÿ”ค ุชุทุจูŠุน ุงู„ู†ุตูˆุต (ู‡ู…ุฒุงุชุŒ ุชุดูƒูŠู„ุŒ ุฃุฑู‚ุงู…)
  • โœ‚๏ธ ุชู‚ุทูŠุน ูƒู„ู…ุงุช ูŠูู‡ู… ุงู„ู„ู‡ุฌุฉ
  • ๐Ÿง  ุชุญู„ูŠู„ ุงู„ู…ุดุงุนุฑ ุงู„ุนุฑุงู‚ูŠ
  • ๐ŸŒ ุชู…ูŠูŠุฒ ุงู„ู„ู‡ุฌุงุช (ุนุฑุงู‚ูŠ/ู…ุตุฑูŠ/ุฎู„ูŠุฌูŠ/ุดุงู…ูŠ/ูุตุญู‰)
  • ๐Ÿ—‚๏ธ ูƒู„ู…ุงุช ุฅูŠู‚ุงู ุนุฑุงู‚ูŠุฉ (+200 ูƒู„ู…ุฉ)

ู…ุซุงู„:

from iraqi_nlp import sentiment
sentiment("ู‡ูˆุงูŠุฉ ุฒูŠู† ู‡ุงู„ุดูŠ")
# โ†’ {'polarity': 'positive', 'score': 0.88}

Built with โค๏ธ in Iraq ๐Ÿ‡ฎ๐Ÿ‡ถ by Kasim Mohammed

Releases

No releases published

Packages

 
 
 

Contributors

Languages