Skip to content

Elma-dev/TODa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TODa: Tamazight Open Dataset

Huggingface

Welcome to the Tamazight Open Dataset (TODa), a groundbreaking open-source project dedicated to preserving and advancing the Tamazight language. With its extensive collection of linguistic data, TODa stands as a pioneering collaborative project for Tamazight <=> Englis translation, specifically designed for Natural Language Processing applications.

TODa's unique approach combines both semantic and syntactic categorization methods, offering a rich representation of words in their various contexts and forms. The dataset encompasses a comprehensive collection of linguistic elements, including detailed verb conjugations across different tenses, noun variations, and an extensive compilation of translated expressions that capture the language's nuances.

What sets TODa apart is its inclusive approach to Tamazight's writing systems. The dataset thoughtfully incorporates Latin alphabets, acknowledging and preserving the diverse writing traditions practiced across Amazigh communities. This dual-script approach ensures broader accessibility and cultural authenticity.

Our vision is to establish TODa as the cornerstone resource for Tamazight Natural Language Processing. Through this meticulously curated dataset, we strive to empower developers and researchers to create innovative NLP solutions that authentically serve the Amazigh-speaking community.

We take pride in our current progress, yet acknowledge that language documentation is an evolving journey. We actively encourage participation from the Amazigh technology community to contribute their expertise in expanding and refining the dataset. Through collaborative effort, we can create a robust foundation for technological innovations that honor and advance Amazigh linguistic heritage.

Tamazight Pronunciation

The Tamazight alphabet is fundamental to proper pronunciation and daily conversation. Below is a comprehensive guide to the alphabet, including English sound equivalents and pronunciation examples:

Letter Sound Example
a [a] as in "father"
b [b] as in "bay"
c [ʨ] ch as in "chay"
d [d] as in "day"
e [ɛ] as in "elephant"
f [f] as in "fine"
g [ɡ] as in "gold"
h [h] as in "house"
i [i] 'ee' as in "meat"
j [ʥ] as in "job"
k [k] as in "kitchen"
l [l] as in "life"
m [m] as in "man"
n [n] as in "nice"
o [o] as in "olive"
p [p] as in "pool"
q [k] as in "kiss"
r [r] as in "rice"
s [s] as in "smile"
t [t] as in "time"
u [u] 'oo' as in "mood"
v [f] f as in "free"
w [w] as in "wind"
x [ks] as in "wax"
y [j] as in "year"
z [z] as in "Zulu"

Special Combinations:

Letter Sound Example
ng eng as in "hanging"
ny nye as in "mañana"
kh kha as in German "bach"
sy sya as in "shield"
nng nng as in "bingo"

Understanding these sounds and their combinations is essential for achieving clear and accurate pronunciation in Tamazight.

Data

The dataset is organized into the following components:

Category Description Format Size Quality Control
Basic Grammar • Pronouns
• Articles
• Adjectives
• Adverbs
CSV 170+ entries Native speaker verification
Numbers & Counting • Cardinal numbers
• Ordinal numbers
• Numerical expressions
CSV 45+ entries Native speaker verification
Gender & Plurality • Feminine forms
• Plural formations
• Gender-specific terms
CSV 100+ entries Native speaker verification
Questions & Negation • Question structures
• Negative constructions
• Interrogative patterns
CSV 40+ entries Native speaker verification
Common Phrases • Everyday expressions
• Greetings
• Basic conversations
CSV 180+ entries Native speaker verification
Sentences • Translated sentences CSV 700+ entries / +40k need translation Native speaker verification

How to contribute

We welcome contributions from the community to help expand and improve TODa. Here's how you can contribute:

Adding New Data

  1. Fork the repository
  2. Create a new branch for your contribution
  3. Add your data following our CSV format guidelines
  4. Submit a pull request with a clear description of your additions
  5. Contribute directly to sentence translation in the Google Sheet at the following link: https://docs.google.com/spreadsheets/d/1GNIdVlno6HcNtYlb0LHkp5vdp8k6c9h6y8T76IDNmJ8/edit?usp=sharing

Quality Guidelines

  • Ensure translations are accurate and verified by native speakers
  • Follow the established CSV format structure
  • Include both Latin script and English translations
  • Provide context and usage examples where applicable

Reporting Issues

  • Use the GitHub Issues tab to report any errors or inconsistencies
  • Clearly describe the issue and provide examples
  • Tag issues appropriately (e.g., 'data correction', 'new entries', 'formatting')

Documentation

  • Help improve documentation by suggesting clarifications
  • Add usage examples and explanations
  • Translate documentation into other languages

Usage Terms

  • Research and Personal Use: Feel free to use TODa for research, personal projects, or educational purposes—it's completely free as long as you follow the open-source license.

  • Commercial Use: If you're planning to use TODa for commercial purposes or in ways not covered by the open-source license, just get in touch with me, Abdeljalil. We'd be happy to discuss licensing options and permissions with you.

About

TODa: Tamazight Open Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published