TODa: Tamazight Open Dataset

Welcome to the Tamazight Open Dataset (TODa), a groundbreaking open-source project dedicated to preserving and advancing the Tamazight language. With its extensive collection of linguistic data, TODa stands as a pioneering collaborative project for Tamazight <=> Englis translation, specifically designed for Natural Language Processing applications.

TODa's unique approach combines both semantic and syntactic categorization methods, offering a rich representation of words in their various contexts and forms. The dataset encompasses a comprehensive collection of linguistic elements, including detailed verb conjugations across different tenses, noun variations, and an extensive compilation of translated expressions that capture the language's nuances.

What sets TODa apart is its inclusive approach to Tamazight's writing systems. The dataset thoughtfully incorporates Latin alphabets, acknowledging and preserving the diverse writing traditions practiced across Amazigh communities. This dual-script approach ensures broader accessibility and cultural authenticity.

Our vision is to establish TODa as the cornerstone resource for Tamazight Natural Language Processing. Through this meticulously curated dataset, we strive to empower developers and researchers to create innovative NLP solutions that authentically serve the Amazigh-speaking community.

We take pride in our current progress, yet acknowledge that language documentation is an evolving journey. We actively encourage participation from the Amazigh technology community to contribute their expertise in expanding and refining the dataset. Through collaborative effort, we can create a robust foundation for technological innovations that honor and advance Amazigh linguistic heritage.

Tamazight Pronunciation

The Tamazight alphabet is fundamental to proper pronunciation and daily conversation. Below is a comprehensive guide to the alphabet, including English sound equivalents and pronunciation examples:

Letter	Sound	Example
a	[a]	as in "father"
b	[b]	as in "bay"
c	[ʨ]	ch as in "chay"
d	[d]	as in "day"
e	[ɛ]	as in "elephant"
f	[f]	as in "fine"
g	[ɡ]	as in "gold"
h	[h]	as in "house"
i	[i]	'ee' as in "meat"
j	[ʥ]	as in "job"
k	[k]	as in "kitchen"
l	[l]	as in "life"
m	[m]	as in "man"
n	[n]	as in "nice"
o	[o]	as in "olive"
p	[p]	as in "pool"
q	[k]	as in "kiss"
r	[r]	as in "rice"
s	[s]	as in "smile"
t	[t]	as in "time"
u	[u]	'oo' as in "mood"
v	[f]	f as in "free"
w	[w]	as in "wind"
x	[ks]	as in "wax"
y	[j]	as in "year"
z	[z]	as in "Zulu"

Special Combinations:

Letter	Sound	Example
ng	eng	as in "hanging"
ny	nye	as in "mañana"
kh	kha	as in German "bach"
sy	sya	as in "shield"
nng	nng	as in "bingo"

Understanding these sounds and their combinations is essential for achieving clear and accurate pronunciation in Tamazight.

Data

The dataset is organized into the following components:

Category	Description	Format	Size	Quality Control
Basic Grammar	• Pronouns • Articles • Adjectives • Adverbs	CSV	170+ entries	Native speaker verification
Numbers & Counting	• Cardinal numbers • Ordinal numbers • Numerical expressions	CSV	45+ entries	Native speaker verification
Gender & Plurality	• Feminine forms • Plural formations • Gender-specific terms	CSV	100+ entries	Native speaker verification
Questions & Negation	• Question structures • Negative constructions • Interrogative patterns	CSV	40+ entries	Native speaker verification
Common Phrases	• Everyday expressions • Greetings • Basic conversations	CSV	180+ entries	Native speaker verification
Sentences	• Translated sentences	CSV	700+ entries / +40k need translation	Native speaker verification

How to contribute

We welcome contributions from the community to help expand and improve TODa. Here's how you can contribute:

Adding New Data

Fork the repository
Create a new branch for your contribution
Add your data following our CSV format guidelines
Submit a pull request with a clear description of your additions
Contribute directly to sentence translation in the Google Sheet at the following link: https://docs.google.com/spreadsheets/d/1GNIdVlno6HcNtYlb0LHkp5vdp8k6c9h6y8T76IDNmJ8/edit?usp=sharing

Quality Guidelines

Ensure translations are accurate and verified by native speakers
Follow the established CSV format structure
Include both Latin script and English translations
Provide context and usage examples where applicable

Reporting Issues

Use the GitHub Issues tab to report any errors or inconsistencies
Clearly describe the issue and provide examples
Tag issues appropriately (e.g., 'data correction', 'new entries', 'formatting')

Documentation

Help improve documentation by suggesting clarifications
Add usage examples and explanations
Translate documentation into other languages

Usage Terms

Research and Personal Use: Feel free to use TODa for research, personal projects, or educational purposes—it's completely free as long as you follow the open-source license.
Commercial Use: If you're planning to use TODa for commercial purposes or in ways not covered by the open-source license, just get in touch with me, Abdeljalil. We'd be happy to discuss licensing options and permissions with you.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
img		img
sntences		sntences
syntactic categories		syntactic categories
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TODa: Tamazight Open Dataset

Tamazight Pronunciation

Data

How to contribute

Adding New Data

Quality Guidelines

Reporting Issues

Documentation

Usage Terms

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TODa: Tamazight Open Dataset

Tamazight Pronunciation

Data

How to contribute

Adding New Data

Quality Guidelines

Reporting Issues

Documentation

Usage Terms

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages