Welcome to the Tamazight Open Dataset (TODa), a groundbreaking open-source project dedicated to preserving and advancing the Tamazight language. With its extensive collection of linguistic data, TODa stands as a pioneering collaborative project for Tamazight <=> Englis translation, specifically designed for Natural Language Processing applications.
TODa's unique approach combines both semantic and syntactic categorization methods, offering a rich representation of words in their various contexts and forms. The dataset encompasses a comprehensive collection of linguistic elements, including detailed verb conjugations across different tenses, noun variations, and an extensive compilation of translated expressions that capture the language's nuances.
What sets TODa apart is its inclusive approach to Tamazight's writing systems. The dataset thoughtfully incorporates Latin alphabets, acknowledging and preserving the diverse writing traditions practiced across Amazigh communities. This dual-script approach ensures broader accessibility and cultural authenticity.
Our vision is to establish TODa as the cornerstone resource for Tamazight Natural Language Processing. Through this meticulously curated dataset, we strive to empower developers and researchers to create innovative NLP solutions that authentically serve the Amazigh-speaking community.
We take pride in our current progress, yet acknowledge that language documentation is an evolving journey. We actively encourage participation from the Amazigh technology community to contribute their expertise in expanding and refining the dataset. Through collaborative effort, we can create a robust foundation for technological innovations that honor and advance Amazigh linguistic heritage.
The Tamazight alphabet is fundamental to proper pronunciation and daily conversation. Below is a comprehensive guide to the alphabet, including English sound equivalents and pronunciation examples:
Letter | Sound | Example |
---|---|---|
a | [a] | as in "father" |
b | [b] | as in "bay" |
c | [ʨ] | ch as in "chay" |
d | [d] | as in "day" |
e | [ɛ] | as in "elephant" |
f | [f] | as in "fine" |
g | [ɡ] | as in "gold" |
h | [h] | as in "house" |
i | [i] | 'ee' as in "meat" |
j | [ʥ] | as in "job" |
k | [k] | as in "kitchen" |
l | [l] | as in "life" |
m | [m] | as in "man" |
n | [n] | as in "nice" |
o | [o] | as in "olive" |
p | [p] | as in "pool" |
q | [k] | as in "kiss" |
r | [r] | as in "rice" |
s | [s] | as in "smile" |
t | [t] | as in "time" |
u | [u] | 'oo' as in "mood" |
v | [f] | f as in "free" |
w | [w] | as in "wind" |
x | [ks] | as in "wax" |
y | [j] | as in "year" |
z | [z] | as in "Zulu" |
Special Combinations:
Letter | Sound | Example |
---|---|---|
ng | eng | as in "hanging" |
ny | nye | as in "mañana" |
kh | kha | as in German "bach" |
sy | sya | as in "shield" |
nng | nng | as in "bingo" |
Understanding these sounds and their combinations is essential for achieving clear and accurate pronunciation in Tamazight.
The dataset is organized into the following components:
Category | Description | Format | Size | Quality Control |
---|---|---|---|---|
Basic Grammar | • Pronouns • Articles • Adjectives • Adverbs |
CSV | 170+ entries | Native speaker verification |
Numbers & Counting | • Cardinal numbers • Ordinal numbers • Numerical expressions |
CSV | 45+ entries | Native speaker verification |
Gender & Plurality | • Feminine forms • Plural formations • Gender-specific terms |
CSV | 100+ entries | Native speaker verification |
Questions & Negation | • Question structures • Negative constructions • Interrogative patterns |
CSV | 40+ entries | Native speaker verification |
Common Phrases | • Everyday expressions • Greetings • Basic conversations |
CSV | 180+ entries | Native speaker verification |
Sentences | • Translated sentences | CSV | 700+ entries / +40k need translation | Native speaker verification |
We welcome contributions from the community to help expand and improve TODa. Here's how you can contribute:
- Fork the repository
- Create a new branch for your contribution
- Add your data following our CSV format guidelines
- Submit a pull request with a clear description of your additions
- Contribute directly to sentence translation in the Google Sheet at the following link: https://docs.google.com/spreadsheets/d/1GNIdVlno6HcNtYlb0LHkp5vdp8k6c9h6y8T76IDNmJ8/edit?usp=sharing
- Ensure translations are accurate and verified by native speakers
- Follow the established CSV format structure
- Include both Latin script and English translations
- Provide context and usage examples where applicable
- Use the GitHub Issues tab to report any errors or inconsistencies
- Clearly describe the issue and provide examples
- Tag issues appropriately (e.g., 'data correction', 'new entries', 'formatting')
- Help improve documentation by suggesting clarifications
- Add usage examples and explanations
- Translate documentation into other languages
-
Research and Personal Use: Feel free to use TODa for research, personal projects, or educational purposes—it's completely free as long as you follow the open-source license.
-
Commercial Use: If you're planning to use TODa for commercial purposes or in ways not covered by the open-source license, just get in touch with me, Abdeljalil. We'd be happy to discuss licensing options and permissions with you.