Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research 🇸🇳 cf Survey Paper

This work presents a comprehensive survey of natural language processing (NLP) for six major Senegalese national languages: Wolof, Pulaar, Sereer, Mandinka, Soninké, and Joola. We provide an overview of the current state of research across key NLP tasks and highlight persistent challenges related to data scarcity, orthographic variation, and linguistic diversity. In addition, we introduce this centralized and openly accessible repository that compiles existing datasets, benchmarks, and tools available for these languages. The repository is designed as a living resource to be periodically expanded through community contributions. Our objective is to map existing efforts, identify critical research gaps, and encourage the development of sustainable, inclusive NLP research for Senegal’s national languages.

If you are interested in the state of the art of NLP research in African languages more broadly, you can take a look at this comprehensive, two-decade survey of AfricaNLP research (2005–2025), analyzing publications, authors, affiliations, supporters, NLP topics, and tasks. The major linguistic and sociopolitical challenges that hinder the development of NLP technologies for African languages are discussed in this Afrocentric NLP paper.

Taxonomy

We drew inspiration from the taxonomy proposed in the Awesome AI Papers repository to propose this one. We chose subjective limits in terms of number of citations and use a set of icons to highlight which paper meets which criteria.

⭐ Important Paper : more than 50 citations and state of the art results.

⏫ Trend : 1 to 50 citations, innovative paper with growing adoption.

📰 Important Article : decisive work that was not accompanied by a research paper.

An added 🔐 icons means that no open version (locked access) for this article was found. The 🌍 icon means the paper has been plubished by non-senegalese African authors and the icon 🌐 indicates papers published by foreign authors (outside of Africa). We consider the paper to be regional 🌍 or international 🌐 if:

the first author paper is an African or a foreigner;
the project leading to the paper was launched by Africans or foreigners.

e.g. The MasakhaPOS paper has a senegalese as a first author but the overall research project has been launched by Masakhane.

Finally, since Senegal is a French-speaking country, some of the articles were written in French. We thus added the 🇫🇷 icon to highlight those papers.

A paper may also appear in multiple sections if it covers various domains, tasks, and/or modalities.

Datasets 🔃

The Online Wolof Data repository tracks and centralizes all openly accessible datasets as well as potential data sources on the Wolof language.
We extended this Wolof repository to the other 05 national languages in the Datasets file.

NLP Tools

Name	Covered tasks	Languages supported
Wolof keyboards	Keyboards for MacOS, Android and Apple mobile	Wolof
Stanza (Qi et al., 2020)	Part-Of-Speech (POS) and Morphological features tagging dependency parsing	Wolof
MorphScore (Arnett et al., 2025)	Morphological alignment evaluation	Wolof
Wolof	Fill_mask (Masked Language Modeling)	Wolof
Common Voice (Ardila et al., 2019) , DVoice (Allak et al., 2021)	Speech Data Collection	Wolof
AfroLID, GlotLID, AfroScope	Language Identification	Wolof

Publications

For some articles, we were also unable to find open versions highlighted by the 🔐 icon.

If you need access to some of the locked papers, feel free to reach out at derguenembaye[at]esp[dot]sn.

Token Classification

POS Tagging

⏫ 05/2010: Design and Development of Part-of-Speech-Tagging Resources for Wolof (Niger-Congo, spoken in Senegal)
⏫🌍 07/2023: MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages

Named Entity Recognition

⏫🌍 03/2021: MasakhaNER: Named Entity Recognition for African Languages
⏫🌍 12/2022: MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

Text Classification

Opinion Mining / Sentiment Analysis

⏫ 06/2018: A Novel Term Weighting Scheme Model
⏫🔐 03/2019: FWLSA-score: French and Wolof Lexicon-based for Sentiment Analysis
⏫🔐 12/2019: Improved Bilingual Sentiment Analysis Lexicon Using Word-level Trigram
⏫🔐 07/2020: SenOpinion: a new lexicon for opinion tagging in Senegalese news comments
⏫🇫🇷 06/2022: COMFO : Corpus Multilingue pour la Fouille d’Opinions (COMFO: Multilingual Corpus for Opinion Mining)

The English version of this article is available on Springer Nature Link.
⏫🔐 06/2023: Markov Model for French-Wolof Text Analysis
⏫ 08/2024: A lexicon-based sentiment analysis approach using a graph structure for modeling relationships between opinion words in French and Wolof corpora
⏫ 10/2025: Sentiment Analysis on the Young People's Perception About the Mobile

Hate Speech Detection

⏫🌍 06/2023: Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili
⏫ 06/2025: Annotated tweet corpus of mixed Wolof-French for detecting obnoxious messages
⏫🔐 07/2025: Comparative Study of Machine Learning Models for the Detection of Abusive Messages: Case of Wolof-French Codes Mixing Data
⏫ 09/2025: AbuseBERT-WoFr: refined BERT model for detecting abusive messages on tweets mixing Wolof-French codes

Page 10 of the Proceedings of Digital Avenues for Low-Resource Languages of Sub-Saharan Africa (DASSA’2025).

Intent Classification

⏫🌍 02/2025: INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages

Note: Intent classification is generally performed with slot-filling (which is token classification) as a joint task to maximize performance in both tasks simultaneously.
⏫ 09/2025: WolBanking77: Wolof Banking Speech Intent Classification Dataset

Lexicons and Spell Checking

⏫🌐🇫🇷 01/2015: DILAF : des dictionnaires africains en ligne et une méthodologie
⏫🇫🇷 03/2016: Dictionnaires wolof en ligne: État de l'art et perspectives
⏫🇫🇷 03/2016: Production et mise en ligne d’un dictionnaire électronique du wolof
⏫🇫🇷 03/2016: iBaatukaay : un projet de base lexicale multilingue contributive sur le web à structure pivot pour les langues africaines notamment sénégalaises
⏫🇫🇷 07/2016: Correction orthographique pour la langue wolof: état de l'art et perspectives
⏫🇫🇷 09/2018: Manipulation de dictionnaires d'origines diverses pour des langues peu dotées: la méthodologie iBaatukaay
⏫ 05/2023: Automatic Spell Checker and Correction for Under-represented Spoken Languages: Case Study on Wolof
⏫ 05/2024: Advancing language diversity and inclusion: Towards a neural network-based spell checker and correction for wolof
⏫ 07/2024: Beqi: Revitalize the senegalese wolof language with a robust spelling corrector
📰🇫🇷 09/2025: SenTermino - Banque Terminologique Scientifique du Sénégal

Machine Translation

⏫ 03/2020: Using LSTM Networks to Translate French to Senegalese Local Languages: Wolof as a Case Study
⏫ 05/2020: Sencorpus: A french-wolof parallel corpus
⏫ 08/2020: Building word representations for wolof using neural networks
⭐🌐 10/2020: Beyond English-Centric Multilingual Machine Translation
⏫ 03/2022: SenTekki: Online Platform and Restful Web Service for Translation Between Wolof and French
⏫ 06/2022: Low-resource neural machine translation: Benchmarking state-of-the-art transformer for Wolof<->French
⭐🌍 07/2022: A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
⭐🌐 07/2022: No Language Left Behind: Scaling Human-Centered Machine Translation
⭐🌐 11/2022: NTREX-128 – News Test References for MT Evaluation of 128 Languages
⏫🌐 12/2022: SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages
⏫ 02/2023: Low-Resourced machine translation for Senegalese Wolof language
📰 03/2023: Kàllaama NMT: un ensemble d'outils IA pour rendre le numérique plus inclusif en Afrique
⭐🌐 09/2023: MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
📰🌐 06/2024: 110 new languages are coming to Google Translate
📰 12/2024: LAfricaMobile NMT
⭐⏫ 02/2025: SMOL: Professionally translated parallel data for 115 under-represented languages
📰🌐 11/2025: Wolof among supported languages in DeepL
📰 12/2025: GalsenAI French-Wolof Translator
📰 11/2025: CLAD FirilMa Traducteur
⏫🌐 01/2026: TranslateGemma Technical Report
⏫🌐 01/2026: AfriNLLB: Efficient Translation Models for African Languages 🔃
⏫🌐 03/2026: Omnilingual MT: Machine Translation for 1,600 Languages 🔃

Question Answering and Dialogue Systems [+LLMs]

⏫🌍 05/2022: AfriWOZ: Corpus for Exploiting Cross-Lingual Transfer for Dialogue Generation in Low-Resource, African Languages
⏫🇫🇷 06/2022: Preuve de concept d’un bot vocal dialoguant en wolof (Proof-of-Concept of a Voicebot Speaking Wolof)
⭐🌐 11/2022: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model 🔃
⭐🌍 12/2022: AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
📰 03/2023: Local Partnership Launches Digital Health Tool to Decrease Hypertension in Senegal

More info on: https://saytutension.sante.sn.
⏫🌍 05/2023: AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages
⏫ 06/2023: Design, development and usability of an educational AI chatbot for People with Haemophilia in Senegal 🔃
⏫🌍 07/2023: SERENGETI: Massively Multilingual Language Models for Africa
⏫🌍 01/2024: Cheetah: Natural Language Generation for 517 African Languages 🔃
⏫🌍 06/2024: IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
⏫🌐 06/2024: Fumbling in Babel: An Investigation into ChatGPT’s Language Identification Ability 🔃
⭐🌐 08/2024: Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Wolof was the additional language in the Aya dataset that had to be excluded from training (Üstün et al., 2024).
⏫🌐 08/2024: The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants
⏫🌐 08/2024: Goldfish: Monolingual Language Models for 350 Languages 🔃
⏫ 01/2025: Task-Oriented Dialog Systems for the Senegalese Wolof Language
⏫🌐🔐 01/2025: A Comprehensive Von Willebrand Disease Awareness and Support Chatbot for Senegalese Communities
📰 12/2024: AWA: Senegalese start-up's AI muse speaks in Wolof

A subsequent Awa-Milkyway model has also been announced but not published since then.
📰 01/2025: Oolel: A High-Performing Open LLM for Wolof
⏫🌐 04/2025: MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs 🔃
⏫🌍 06/2025: The State of Large Language Models for African Languages: Progress and Challenges

Reports that AfriTeva and AfroXLMR support Wolof but it's not the case, might be a mistake.
⏫🌐 07/2025: Where Are We? Evaluating LLM Performance on African Languages
⏫🌐 09/2025: MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder-LLM Integration in Cross-Lingual Reasoning 🔃
📰🌐 02/2026: Tiny AYA: Making Multilingual AI Accessible 🔃

Pre-training corpus

⭐🌐 01/2022: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
⭐🌐 09/2023: MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
⏫🌐 06/2025: FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Speech Processing

Automatic Speech Recognition (ASR)

⏫ 04/2011: Speech Recognition and Text-to-speech Solution for Vernacular Languages
⏫🌐 09/2015: Speech Technologies for African Languages: Example of a Multilingual Calculator for Education
⏫🌐 07/2016: Automatic speech recognition for African languages with vowel length contrast
⏫🌐 07/2016: Speed perturbation and vowel duration modeling for ASR in Hausa and Wolof languages
⏫🌐 06/2017: Machine Assisted Analysis of Vowel Length Contrasts in Wolof
⭐🌐 08/2016: Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof
⏫🌍 04/2021: AI4D -- African Language Program
📰 06/2022: Wav2vec 2.0 with CTC/Attention trained on DVoice Wolof (No LM)
📰 03/2023: Kàllaama ASR: un ensemble d'outils IA pour rendre le numérique plus inclusif en Afrique
⭐🌐 05/2023: Scaling Speech Technology to 1,000+ Languages
⏫🌍 06/2023: Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili
📰 08/2023: Wolof Subtitles Generator
📰 11/2023: OpenAI Whisper and Meta MMS models on fula language
⏫🌐 04/2024: Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal
⏫🌐 04/2024: Africa-centric self-supervised pre-training for multilingual speech representation in a sub-saharan context
⏫🌐 04/2024: Self-supervised and multilingual learning applied to the Wolof, Swahili and Fongbe
📰 05/2024: Senegalese startup Lengo brings AI to informal retailers
⏫ 06/2024: State-of-the-Art Review on Recent Trends in Automatic Speech Recognition
⏫🌐🇫🇷 07/2024: Représentation de la parole multilingue par apprentissage auto-supervisé dans un contexte subsaharien
📰 08/2024: ASR-Africa's Collections - Fula
📰 09/2024: ASR-Africa's Collections - Wolof
⏫🌍 11/2024: Multilingual speech recognition initiative for African languages
📰 11/2024: Orange to expand open-source AI models to African regional languages for digital inclusion
📰 12/2024: LAfricaMobile STT
📰 01/2025: Caytu Whosper-large-v2
⏫🌍 07/2025: Synthetic Voice Data for Automatic Speech Recognition in African Languages
📰 09/2025: Breaking Language Barriers in African Healthcare: Fine-Tuning Speech Recognition for Wolof and Hausa in Maternal and Reproductive Health

The poster can be viewed here.
📰 09/2025: Benchmarking Automatic Speech Recognition Models for African Languages
⏫ 09/2025: WolBanking77: Wolof Banking Speech Intent Classification Dataset
⏫ 09/2025: Speech Language Models for Under-Represented Languages: Insights from Wolof
⏫🌐 11/2025: Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
⏫🌐 11/2025: Voice of a Continent: Mapping Africa’s Speech Technology Frontier 🔃
📰🌐 12/2025: ElevenLabs' Scribe v2: Wolof Speech to Text Transcription 🔃
📰🌍 02/2026: PazaBench: ASR leaderboard for low-resource languages 🔃

Speech Synthesis / Text To Speech (TTS)

⏫ 04/2011: Speech Recognition and Text-to-speech Solution for Vernacular Languages
📰 10/2020: Building Wolof Text To Speech System
⏫🌍 07/2022: Building African Voices
📰 03/2023: Kàllaama TTS: un ensemble d'outils IA pour rendre le numérique plus inclusif en Afrique
📰 09/2024: Wolof TTS
📰 12/2024: LAfricaMobile TTS
📰 02/2025: Adia_TTS Wolof
📰 06/2025: TTS-WOLOF : Building Inclusive Voice AI for African Languages – The Wolof Case
📰 03/2026: Oolel-Voice: a high-quality text-to-speech model purpose-built for Wolof 🔃

Spoken Dialogue Systems

Spoken Language Understanding (SLU)

📰 09/2020: Keyword Spotting with African Languages

The 1st and only research project that targeted all the 06 main Senegalese languages so far.
⏫🇫🇷 06/2022: Preuve de concept d’un bot vocal dialoguant en wolof (Proof-of-Concept of a Voicebot Speaking Wolof)
⏫🌍 06/2023: Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili
⏫ 09/2025: WolBanking77: Wolof Banking Speech Intent Classification Dataset

Speech Language Models (SLMs)

⏫ 09/2025: Speech Language Models for Under-Represented Languages: Insights from Wolof

Multi-task Benchmark

⏫🌐 05/2023: XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

Reports that it covers ASR, NER and MT tasks for Wolof but no Wolof training data has been found in the dataset for translation.

Citation

If this work was useful regarding your research, please cite the paper as:

@article{sen-nlp-survey,
	doi = {10.20944/preprints202601.1124.v1},
	url = {https://doi.org/10.20944/preprints202601.1124.v1},
	year = 2026,
	month = {January},
	publisher = {Preprints},
	author = {Derguene Mbaye and Tatiana D. P. Mbengue and Madoune R. Seye and Moussa Diallo and Mamadou L. Ndiaye and Dimitri S. Adjanohoun and Djiby Sow and Cheikh S. Wade and Jean-Claude B. Munyaka and Jerome Chenal},
	title = {Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research},
	journal = {Preprints}
}

Feel free to also leave a star 🌟️

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Datasets.md		Datasets.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research 🇸🇳 cf Survey Paper

Taxonomy

Datasets 🔃

NLP Tools

Publications

Digraphs

Parsing & Tokenization

Language Identification

Linguistic Similarity, Embeddings & Cross-Lingual Transfer

Token Classification

POS Tagging

Named Entity Recognition

Text Classification

Opinion Mining / Sentiment Analysis

Hate Speech Detection

Intent Classification

Lexicons and Spell Checking

Machine Translation

Question Answering and Dialogue Systems [+LLMs]

Pre-training corpus

Speech Processing

Automatic Speech Recognition (ASR)

Speech Synthesis / Text To Speech (TTS)

Spoken Dialogue Systems

Spoken Language Understanding (SLU)

Speech Language Models (SLMs)

Multi-task Benchmark

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research 🇸🇳 cf Survey Paper

Taxonomy

Datasets 🔃

NLP Tools

Publications

Digraphs

Parsing & Tokenization

Language Identification

Linguistic Similarity, Embeddings & Cross-Lingual Transfer

Token Classification

POS Tagging

Named Entity Recognition

Text Classification

Opinion Mining / Sentiment Analysis

Hate Speech Detection

Intent Classification

Lexicons and Spell Checking

Machine Translation

Question Answering and Dialogue Systems [+LLMs]

Pre-training corpus

Speech Processing

Automatic Speech Recognition (ASR)

Speech Synthesis / Text To Speech (TTS)

Spoken Dialogue Systems

Spoken Language Understanding (SLU)

Speech Language Models (SLMs)

Multi-task Benchmark

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages