Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research πΈπ³ cf Survey Paper
This work presents a comprehensive survey of natural language processing (NLP) for six major Senegalese national languages: Wolof, Pulaar, Sereer, Mandinka, SoninkΓ©, and Joola. We provide an overview of the current state of research across key NLP tasks and highlight persistent challenges related to data scarcity, orthographic variation, and linguistic diversity. In addition, we introduce this centralized and openly accessible repository that compiles existing datasets, benchmarks, and tools available for these languages. The repository is designed as a living resource to be periodically expanded through community contributions. Our objective is to map existing efforts, identify critical research gaps, and encourage the development of sustainable, inclusive NLP research for Senegalβs national languages.
If you are interested in the state of the art of NLP research in African languages more broadly, you can take a look at this comprehensive, two-decade survey of AfricaNLP research (2005β2025), analyzing publications, authors, affiliations, supporters, NLP topics, and tasks. The major linguistic and sociopolitical challenges that hinder the development of NLP technologies for African languages are discussed in this Afrocentric NLP paper.
We drew inspiration from the taxonomy proposed in the Awesome AI Papers repository to propose this one. We chose subjective limits in terms of number of citations and use a set of icons to highlight which paper meets which criteria.
β Important Paper : more than 50 citations and state of the art results.
β« Trend : 1 to 50 citations, innovative paper with growing adoption.
π° Important Article : decisive work that was not accompanied by a research paper.
An added π icons means that no open version (locked access) for this article was found. The π icon means the paper has been plubished by non-senegalese African authors and the icon π indicates papers published by foreign authors (outside of Africa). We consider the paper to be regional π or international π if:
- the first author paper is an African or a foreigner;
- the project leading to the paper was launched by Africans or foreigners.
e.g. The MasakhaPOS paper has a senegalese as a first author but the overall research project has been launched by Masakhane.
Finally, since Senegal is a French-speaking country, some of the articles were written in French. We thus added the π«π· icon to highlight those papers.
A paper may also appear in multiple sections if it covers various domains, tasks, and/or modalities.
- The Online Wolof Data repository tracks and centralizes all
openly accessible datasetsas well as potentialdata sourceson theWoloflanguage. - We extended this Wolof repository to the
other 05 national languagesin the Datasets file.
| Name | Covered tasks | Languages supported |
|---|---|---|
| Wolof keyboards | Keyboards for MacOS, Android and Apple mobile | Wolof |
| Stanza (Qi et al., 2020) | Part-Of-Speech (POS) and Morphological features tagging dependency parsing | Wolof |
| MorphScore (Arnett et al., 2025) | Morphological alignment evaluation | Wolof |
| Wolof | Fill_mask (Masked Language Modeling) | Wolof |
| Common Voice (Ardila et al., 2019) , DVoice (Allak et al., 2021) | Speech Data Collection | Wolof |
| AfroLID, GlotLID, AfroScope | Language Identification | Wolof |
For some articles, we were also unable to find open versions highlighted by the π icon.
If you need access to some of the locked papers, feel free to reach out at
derguenembaye[at]esp[dot]sn.
- β« 05/2020: Digraph of Senegal s local languages: issues, challenges and prospects of their transliteration
- β«π«π· 05/2020: Digraphie des langues ouest africaines : Latin2Ajami : un algorithme de translitteration automatique
- β« 01/2025: The Best of Both Worlds: Exploring Wolofal in the Context of NLP
- β«ππ«π· 06/2025: RΓ©habiliter lβΓ©criture Ajami : un levier technologique pour lβalphabΓ©tisation en Afrique
- β« 05/2012: A Morphological Analyzer For Wolof Using Finite-State Techniques
- β«π 06/2013: Handling Wolof clitics in LFG
- β« 08/2013: ParGramBank: The ParGram Parallel Treebank
- β« 05/2014: Pruning the Search Space of the Wolof LFG Grammar Using a Probabilistic and a Constraint Grammar Parser
- β« 08/2014: LFG parse disambiguation for Wolof
- β« 11/2017: Finite-State Tokenization for a Deep Wolof LFG Grammar
- β« 08/2019: Developing Universal Dependencies for Wolof
- β« 05/2020: Implementation and Evaluation of an LFG-based Parser for Wolof
- β« 12/2020: From LFG To UD: A Combined Approach
- β« 08/2021: Multilingual Dependency Parsing for Low-Resource African Languages: Case Studies on Bambara, Wolof, and Yoruba
- β«π 07/2025: Evaluating Morphological Alignment of Tokenizers in 70 Languages
- β«π 10/2022: AfroLID: A Neural Language Identification Tool for African Languages
- βπ 12/2023: GlotLID: Language Identification for Low-Resource Languages
- β«π 01/2026: AfroScope: A Framework for Studying the Linguistic Landscape of Africa π
- βπ 11/2020: Extending Multilingual BERT to Low-Resource Languages π
- βπ 07/2025: Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter
- β«π 01/2026: Can Embedding Similarity Predict Cross-Lingual Transfer? A Systematic Study on African Languages
- β« 02/2026: Cross-lingual Matryoshka Representation Learning across Speech and Text π
- β«π 03/2026: Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech π
- β« 05/2010: Design and Development of Part-of-Speech-Tagging Resources for Wolof (Niger-Congo, spoken in Senegal)
- β«π 07/2023: MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages
- β«π 03/2021: MasakhaNER: Named Entity Recognition for African Languages
- β«π 12/2022: MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition
- β« 06/2018: A Novel Term Weighting Scheme Model
- β«π 03/2019: FWLSA-score: French and Wolof Lexicon-based for Sentiment Analysis
- β«π 12/2019: Improved Bilingual Sentiment Analysis Lexicon Using Word-level Trigram
- β«π 07/2020: SenOpinion: a new lexicon for opinion tagging in Senegalese news comments
- β«π«π· 06/2022: COMFO : Corpus Multilingue pour la Fouille dβOpinions (COMFO: Multilingual Corpus for Opinion Mining)
The English version of this article is available on Springer Nature Link.
- β«π 06/2023: Markov Model for French-Wolof Text Analysis
- β« 08/2024: A lexicon-based sentiment analysis approach using a graph structure for modeling relationships between opinion words in French and Wolof corpora
- β« 10/2025: Sentiment Analysis on the Young People's Perception About the Mobile
- β«π 06/2023: Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili
- β« 06/2025: Annotated tweet corpus of mixed Wolof-French for detecting obnoxious messages
- β«π 07/2025: Comparative Study of Machine Learning Models for the Detection of Abusive Messages: Case of Wolof-French Codes Mixing Data
- β« 09/2025: AbuseBERT-WoFr: refined BERT model for detecting abusive messages on tweets mixing Wolof-French codes
Page 10of the Proceedings of Digital Avenues for Low-Resource Languages of Sub-Saharan Africa (DASSAβ2025).
- β«π 02/2025: INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages
Note: Intent classification is generally performed with slot-filling (which is token classification) as a joint task to maximize performance in both tasks simultaneously.
- β« 09/2025: WolBanking77: Wolof Banking Speech Intent Classification Dataset
- β«ππ«π· 01/2015: DILAF : des dictionnaires africains en ligne et une mΓ©thodologie
- β«π«π· 03/2016: Dictionnaires wolof en ligne: Γtat de l'art et perspectives
- β«π«π· 03/2016: Production et mise en ligne dβun dictionnaire Γ©lectronique du wolof
- β«π«π· 03/2016: iBaatukaay : un projet de base lexicale multilingue contributive sur le web Γ structure pivot pour les langues africaines notamment sΓ©nΓ©galaises
- β«π«π· 07/2016: Correction orthographique pour la langue wolof: Γ©tat de l'art et perspectives
- β«π«π· 09/2018: Manipulation de dictionnaires d'origines diverses pour des langues peu dotΓ©es: la mΓ©thodologie iBaatukaay
- β« 05/2023: Automatic Spell Checker and Correction for Under-represented Spoken Languages: Case Study on Wolof
- β« 05/2024: Advancing language diversity and inclusion: Towards a neural network-based spell checker and correction for wolof
- β« 07/2024: Beqi: Revitalize the senegalese wolof language with a robust spelling corrector
- π°π«π· 09/2025: SenTermino - Banque Terminologique Scientifique du SΓ©nΓ©gal
- β« 03/2020: Using LSTM Networks to Translate French to Senegalese Local Languages: Wolof as a Case Study
- β« 05/2020: Sencorpus: A french-wolof parallel corpus
- β« 08/2020: Building word representations for wolof using neural networks
- βπ 10/2020: Beyond English-Centric Multilingual Machine Translation
- β« 03/2022: SenTekki: Online Platform and Restful Web Service for Translation Between Wolof and French
- β« 06/2022: Low-resource neural machine translation: Benchmarking state-of-the-art transformer for Wolof<->French
- βπ 07/2022: A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
- βπ 07/2022: No Language Left Behind: Scaling Human-Centered Machine Translation
- βπ 11/2022: NTREX-128 β News Test References for MT Evaluation of 128 Languages
- β«π 12/2022: SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages
- β« 02/2023: Low-Resourced machine translation for Senegalese Wolof language
- π° 03/2023: KΓ llaama NMT: un ensemble d'outils IA pour rendre le numΓ©rique plus inclusif en Afrique
- βπ 09/2023: MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
- π°π 06/2024: 110 new languages are coming to Google Translate
- π° 12/2024: LAfricaMobile NMT
- ββ« 02/2025: SMOL: Professionally translated parallel data for 115 under-represented languages
- π°π 11/2025: Wolof among supported languages in DeepL
- π° 12/2025: GalsenAI French-Wolof Translator
- π° 11/2025: CLAD FirilMa Traducteur
- β«π 01/2026: TranslateGemma Technical Report
- β«π 01/2026: AfriNLLB: Efficient Translation Models for African Languages π
- β«π 03/2026: Omnilingual MT: Machine Translation for 1,600 Languages π
- β«π 05/2022: AfriWOZ: Corpus for Exploiting Cross-Lingual Transfer for Dialogue Generation in Low-Resource, African Languages
- β«π«π· 06/2022: Preuve de concept dβun bot vocal dialoguant en wolof (Proof-of-Concept of a Voicebot Speaking Wolof)
- βπ 11/2022: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model π
- βπ 12/2022: AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
- π° 03/2023: Local Partnership Launches Digital Health Tool to Decrease Hypertension in Senegal
More info on: https://saytutension.sante.sn.
- β«π 05/2023: AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages
- β« 06/2023: Design, development and usability of an educational AI chatbot for People with Haemophilia in Senegal π
- β«π 07/2023: SERENGETI: Massively Multilingual Language Models for Africa
- β«π 01/2024: Cheetah: Natural Language Generation for 517 African Languages π
- β«π 06/2024: IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
- β«π 06/2024: Fumbling in Babel: An Investigation into ChatGPTβs Language Identification Ability π
- βπ 08/2024: Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
Wolof was the additional language in the Aya dataset that had to be excluded from training (ΓstΓΌn et al., 2024).
- β«π 08/2024: The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants
- β«π 08/2024: Goldfish: Monolingual Language Models for 350 Languages π
- β« 01/2025: Task-Oriented Dialog Systems for the Senegalese Wolof Language
- β«ππ 01/2025: A Comprehensive Von Willebrand Disease Awareness and Support Chatbot for Senegalese Communities
- π° 12/2024: AWA: Senegalese start-up's AI muse speaks in Wolof
A subsequent Awa-Milkyway model has also been announced but not published since then.
- π° 01/2025: Oolel: A High-Performing Open LLM for Wolof
- β«π 04/2025: MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs π
- β«π 06/2025: The State of Large Language Models for African Languages: Progress and Challenges
Reports that
AfriTevaandAfroXLMRsupport Wolof but it's not the case, might be a mistake. - β«π 07/2025: Where Are We? Evaluating LLM Performance on African Languages
- β«π 09/2025: MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder-LLM Integration in Cross-Lingual Reasoning π
- π°π 02/2026: Tiny AYA: Making Multilingual AI Accessible π
- βπ 01/2022: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
- βπ 09/2023: MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
- β«π 06/2025: FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
- β« 04/2011: Speech Recognition and Text-to-speech Solution for Vernacular Languages
- β«π 09/2015: Speech Technologies for African Languages: Example of a Multilingual Calculator for Education
- β«π 07/2016: Automatic speech recognition for African languages with vowel length contrast
- β«π 07/2016: Speed perturbation and vowel duration modeling for ASR in Hausa and Wolof languages
- β«π 06/2017: Machine Assisted Analysis of Vowel Length Contrasts in Wolof
- βπ 08/2016: Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof
- β«π 04/2021: AI4D -- African Language Program
- π° 06/2022: Wav2vec 2.0 with CTC/Attention trained on DVoice Wolof (No LM)
- π° 03/2023: KΓ llaama ASR: un ensemble d'outils IA pour rendre le numΓ©rique plus inclusif en Afrique
- βπ 05/2023: Scaling Speech Technology to 1,000+ Languages
- β«π 06/2023: Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili
- π° 08/2023: Wolof Subtitles Generator
- π° 11/2023: OpenAI Whisper and Meta MMS models on fula language
- β«π 04/2024: Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal
- β«π 04/2024: Africa-centric self-supervised pre-training for multilingual speech representation in a sub-saharan context
- β«π 04/2024: Self-supervised and multilingual learning applied to the Wolof, Swahili and Fongbe
- π° 05/2024: Senegalese startup Lengo brings AI to informal retailers
- β« 06/2024: State-of-the-Art Review on Recent Trends in Automatic Speech Recognition
- β«ππ«π· 07/2024: ReprΓ©sentation de la parole multilingue par apprentissage auto-supervisΓ© dans un contexte subsaharien
- π° 08/2024: ASR-Africa's Collections - Fula
- π° 09/2024: ASR-Africa's Collections - Wolof
- β«π 11/2024: Multilingual speech recognition initiative for African languages
- π° 11/2024: Orange to expand open-source AI models to African regional languages for digital inclusion
- π° 12/2024: LAfricaMobile STT
- π° 01/2025: Caytu Whosper-large-v2
- β«π 07/2025: Synthetic Voice Data for Automatic Speech Recognition in African Languages
- π° 09/2025: Breaking Language Barriers in African Healthcare: Fine-Tuning Speech Recognition for Wolof and Hausa in Maternal and Reproductive Health
The poster can be viewed here.
- π° 09/2025: Benchmarking Automatic Speech Recognition Models for African Languages
- β« 09/2025: WolBanking77: Wolof Banking Speech Intent Classification Dataset
- β« 09/2025: Speech Language Models for Under-Represented Languages: Insights from Wolof
- β«π 11/2025: Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
- β«π 11/2025: Voice of a Continent: Mapping Africaβs Speech Technology Frontier π
- π°π 12/2025: ElevenLabs' Scribe v2: Wolof Speech to Text Transcription π
- π°π 02/2026: PazaBench: ASR leaderboard for low-resource languages π
- β« 04/2011: Speech Recognition and Text-to-speech Solution for Vernacular Languages
- π° 10/2020: Building Wolof Text To Speech System
- β«π 07/2022: Building African Voices
- π° 03/2023: KΓ llaama TTS: un ensemble d'outils IA pour rendre le numΓ©rique plus inclusif en Afrique
- π° 09/2024: Wolof TTS
- π° 12/2024: LAfricaMobile TTS
- π° 02/2025: Adia_TTS Wolof
- π° 06/2025: TTS-WOLOF : Building Inclusive Voice AI for African Languages β The Wolof Case
- π° 03/2026: Oolel-Voice: a high-quality text-to-speech model purpose-built for Wolof π
- π° 09/2020: Keyword Spotting with African Languages
The 1st and only research project that targeted all the 06 main Senegalese languages so far.
- β«π«π· 06/2022: Preuve de concept dβun bot vocal dialoguant en wolof (Proof-of-Concept of a Voicebot Speaking Wolof)
- β«π 06/2023: Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili
- β« 09/2025: WolBanking77: Wolof Banking Speech Intent Classification Dataset
- β«π 05/2023: XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Reports that it covers
ASR,NERandMTtasks for Wolof but no Wolof training data has been found in the dataset for translation.
If this work was useful regarding your research, please cite the paper as:
@article{sen-nlp-survey,
doi = {10.20944/preprints202601.1124.v1},
url = {https://doi.org/10.20944/preprints202601.1124.v1},
year = 2026,
month = {January},
publisher = {Preprints},
author = {Derguene Mbaye and Tatiana D. P. Mbengue and Madoune R. Seye and Moussa Diallo and Mamadou L. Ndiaye and Dimitri S. Adjanohoun and Djiby Sow and Cheikh S. Wade and Jean-Claude B. Munyaka and Jerome Chenal},
title = {Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research},
journal = {Preprints}
}Feel free to also leave a star ποΈ