Skip to content

DerXter/State-of-NLP-Research-in-Senegal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research πŸ‡ΈπŸ‡³ cf Survey Paper

This work presents a comprehensive survey of natural language processing (NLP) for six major Senegalese national languages: Wolof, Pulaar, Sereer, Mandinka, SoninkΓ©, and Joola. We provide an overview of the current state of research across key NLP tasks and highlight persistent challenges related to data scarcity, orthographic variation, and linguistic diversity. In addition, we introduce this centralized and openly accessible repository that compiles existing datasets, benchmarks, and tools available for these languages. The repository is designed as a living resource to be periodically expanded through community contributions. Our objective is to map existing efforts, identify critical research gaps, and encourage the development of sustainable, inclusive NLP research for Senegal’s national languages.

If you are interested in the state of the art of NLP research in African languages more broadly, you can take a look at this comprehensive, two-decade survey of AfricaNLP research (2005–2025), analyzing publications, authors, affiliations, supporters, NLP topics, and tasks. The major linguistic and sociopolitical challenges that hinder the development of NLP technologies for African languages are discussed in this Afrocentric NLP paper.

Taxonomy

We drew inspiration from the taxonomy proposed in the Awesome AI Papers repository to propose this one. We chose subjective limits in terms of number of citations and use a set of icons to highlight which paper meets which criteria.

⭐ Important Paper : more than 50 citations and state of the art results.

⏫ Trend : 1 to 50 citations, innovative paper with growing adoption.

πŸ“° Important Article : decisive work that was not accompanied by a research paper.

An added πŸ” icons means that no open version (locked access) for this article was found. The 🌍 icon means the paper has been plubished by non-senegalese African authors and the icon 🌐 indicates papers published by foreign authors (outside of Africa). We consider the paper to be regional 🌍 or international 🌐 if:

  • the first author paper is an African or a foreigner;
  • the project leading to the paper was launched by Africans or foreigners.

    e.g. The MasakhaPOS paper has a senegalese as a first author but the overall research project has been launched by Masakhane.

Finally, since Senegal is a French-speaking country, some of the articles were written in French. We thus added the πŸ‡«πŸ‡· icon to highlight those papers.

A paper may also appear in multiple sections if it covers various domains, tasks, and/or modalities.


Datasets πŸ”ƒ

  • The Online Wolof Data repository tracks and centralizes all openly accessible datasets as well as potential data sources on the Wolof language.
  • We extended this Wolof repository to the other 05 national languages in the Datasets file.

NLP Tools

Name Covered tasks Languages supported
Wolof keyboards Keyboards for MacOS, Android and Apple mobile Wolof
Stanza (Qi et al., 2020) Part-Of-Speech (POS) and Morphological features tagging dependency parsing Wolof
MorphScore (Arnett et al., 2025) Morphological alignment evaluation Wolof
Wolof Fill_mask (Masked Language Modeling) Wolof
Common Voice (Ardila et al., 2019) , DVoice (Allak et al., 2021) Speech Data Collection Wolof
AfroLID, GlotLID, AfroScope Language Identification Wolof

Publications

For some articles, we were also unable to find open versions highlighted by the πŸ” icon.

If you need access to some of the locked papers, feel free to reach out at derguenembaye[at]esp[dot]sn.

Digraphs

Parsing & Tokenization

Language Identification

Linguistic Similarity, Embeddings & Cross-Lingual Transfer

Token Classification

POS Tagging

Named Entity Recognition

Text Classification

Opinion Mining / Sentiment Analysis

Hate Speech Detection

Intent Classification

Lexicons and Spell Checking

Machine Translation

Question Answering and Dialogue Systems [+LLMs]

Pre-training corpus

Speech Processing

Automatic Speech Recognition (ASR)

Speech Synthesis / Text To Speech (TTS)

Spoken Dialogue Systems

Spoken Language Understanding (SLU)
Speech Language Models (SLMs)

Multi-task Benchmark

Citation

If this work was useful regarding your research, please cite the paper as:

@article{sen-nlp-survey,
	doi = {10.20944/preprints202601.1124.v1},
	url = {https://doi.org/10.20944/preprints202601.1124.v1},
	year = 2026,
	month = {January},
	publisher = {Preprints},
	author = {Derguene Mbaye and Tatiana D. P. Mbengue and Madoune R. Seye and Moussa Diallo and Mamadou L. Ndiaye and Dimitri S. Adjanohoun and Djiby Sow and Cheikh S. Wade and Jean-Claude B. Munyaka and Jerome Chenal},
	title = {Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research},
	journal = {Preprints}
}

Feel free to also leave a star 🌟️

About

First comprehensive survey of NLP work carried out in Senegalese languages covering various tasks + Applications in the social sciences.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors