This project was developed as part of the Mistral AI fine-tuning hackathon, which took place from June 5 - 30, 2024. The primary goal was to use Mistral's fine-tuning API to build a robust translation system for Sango, the lingua franca of the Central African Republic. Sango is a language with limited online resources, and this project aims to bridge the digital language gap, empowering Sango speakers across the region, fostering education, information access, and a stronger sense of community.
The dataset used for this project was manually built and consists of 38,000 pairs of French-Sango translations. The sources for the dataset include:
- The French-Sango dictionary
- Personal translations
- Sentences from learning websites
Building this dataset was a time-consuming process due to the scarcity of online resources. The manual effort to compile and verify translations ensures a high level of accuracy and relevance.
The model was trained on the 38,000 translation pairs over 200 steps and tested on 100 example pairs taken from the FLORES-200 benchmark. Due to limited credits, the training was constrained to 200 steps. Multiple models were trained to ensure a comprehensive approach within the given resources.
The model used for fine-tuning was the open-mistral-7b.
The performance of the model was evaluated using several metrics. Here are the results:
- BLEU: 0.005
- ROUGE-1: 0.250
- ROUGE-2: 0.037
- ROUGE-L: 0.182
- METEOR: 0.076
- TER: 95.782
The model shows promising results for some translations but struggles with complex sentences. The corpus lacked a wide range of sentences to help the model learn the nuances of the Sango language.
-
Good Translation:
- French: S'il t'offre une bière, refuse-la. (If he offers you a beer, refuse it.)
- Result (Sango): Mû biëre na mo, kîri nî.
- Expected (Sango): Tongana lo mu samba na ala, ala ken
-
Poor Translation:
- French: Les gens l'achètent seulement. (People just buy it.)
- Result (Sango): Âzo ayeke vo yê hîo hîo. (Literal: People buy it quickly.)
- Expected (Sango): Azo avo gi vongo.
To improve the Sango translation model, the following steps are recommended:
- Building a Larger Dataset: Expanding the dataset with more diverse sentences to cover a broader spectrum of the language.
- Improving Data Pre-processing: Enhancing the quality and consistency of the dataset through better pre-processing techniques.
- Training for Longer: Allocating more time and resources to train the model for a longer duration to improve its learning capabilities.
- Fine-tuning Hyperparameters: Experimenting with different hyperparameters to optimize the model's performance.
- Incorporating Feedback: Utilizing user feedback to refine and improve the translation quality.
The resources used for creating the dataset are open access and can be used without restriction. The primary sources include:
- French-Sango dictionary
- Various Sango learning websites :
- Personal translations
The resources used to create the dataset are believed to be open-source and freely usable. However, if there are any licensing concerns, please contact me, and the data will be promptly removed from the training set.
- Mistral AI for organizing the fine-tuning hackathon and providing free API access
- Contributors to the French-Sango dictionary and language learning resources
- The FLORES-200 benchmark team for providing evaluation data
This project is open-source and available under Apache 2.0. The training data is believed to be from open-access sources. If you identify any licensing issues, please contact the maintainer immediately.
For any questions or concerns regarding this project, please feel free to reach out to me at habib.adoum01@gmail.com