Skip to content

This project explores the use of Mistral's fine-tuning API to build a machine translation system for Sango, a Central African language with limited online resources.

License

Notifications You must be signed in to change notification settings

habibadoum/lingua-franca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mistral AI Hackathon Project - LLM Fine-tuning for Sango Translation

Table of Contents

  1. Project Description
  2. Dataset
  3. Model Training
  4. Results
  5. Next Steps
  6. Resources and Licensing
  7. Contact

Project Description

This project was developed as part of the Mistral AI fine-tuning hackathon, which took place from June 5 - 30, 2024. The primary goal was to use Mistral's fine-tuning API to build a robust translation system for Sango, the lingua franca of the Central African Republic. Sango is a language with limited online resources, and this project aims to bridge the digital language gap, empowering Sango speakers across the region, fostering education, information access, and a stronger sense of community.

Dataset

The dataset used for this project was manually built and consists of 38,000 pairs of French-Sango translations. The sources for the dataset include:

  • The French-Sango dictionary
  • Personal translations
  • Sentences from learning websites

Building this dataset was a time-consuming process due to the scarcity of online resources. The manual effort to compile and verify translations ensures a high level of accuracy and relevance.

Model Training

The model was trained on the 38,000 translation pairs over 200 steps and tested on 100 example pairs taken from the FLORES-200 benchmark. Due to limited credits, the training was constrained to 200 steps. Multiple models were trained to ensure a comprehensive approach within the given resources. The model used for fine-tuning was the open-mistral-7b.

Results

The performance of the model was evaluated using several metrics. Here are the results:

  • BLEU: 0.005
  • ROUGE-1: 0.250
  • ROUGE-2: 0.037
  • ROUGE-L: 0.182
  • METEOR: 0.076
  • TER: 95.782

Explanation of Results

The model shows promising results for some translations but struggles with complex sentences. The corpus lacked a wide range of sentences to help the model learn the nuances of the Sango language.

Example Translations:

  • Good Translation:

    • French: S'il t'offre une bière, refuse-la. (If he offers you a beer, refuse it.)
    • Result (Sango): Mû biëre na mo, kîri nî.
    • Expected (Sango): Tongana lo mu samba na ala, ala ken
  • Poor Translation:

    • French: Les gens l'achètent seulement. (People just buy it.)
    • Result (Sango): Âzo ayeke vo yê hîo hîo. (Literal: People buy it quickly.)
    • Expected (Sango): Azo avo gi vongo.

Next Steps

To improve the Sango translation model, the following steps are recommended:

  1. Building a Larger Dataset: Expanding the dataset with more diverse sentences to cover a broader spectrum of the language.
  2. Improving Data Pre-processing: Enhancing the quality and consistency of the dataset through better pre-processing techniques.
  3. Training for Longer: Allocating more time and resources to train the model for a longer duration to improve its learning capabilities.
  4. Fine-tuning Hyperparameters: Experimenting with different hyperparameters to optimize the model's performance.
  5. Incorporating Feedback: Utilizing user feedback to refine and improve the translation quality.

Resources and Licensing

The resources used for creating the dataset are open access and can be used without restriction. The primary sources include:

The resources used to create the dataset are believed to be open-source and freely usable. However, if there are any licensing concerns, please contact me, and the data will be promptly removed from the training set.

Acknowledgments

  • Mistral AI for organizing the fine-tuning hackathon and providing free API access
  • Contributors to the French-Sango dictionary and language learning resources
  • The FLORES-200 benchmark team for providing evaluation data

License

This project is open-source and available under Apache 2.0. The training data is believed to be from open-access sources. If you identify any licensing issues, please contact the maintainer immediately.

Contact

For any questions or concerns regarding this project, please feel free to reach out to me at habib.adoum01@gmail.com

About

This project explores the use of Mistral's fine-tuning API to build a machine translation system for Sango, a Central African language with limited online resources.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors