Multilinguality at the Edge: Developing Language Models for the Global South

Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware-constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete. Studying these two fields together is both a need, as linguistically diverse communities often face the most severe infrastructure constraints, and an opportunity, as edge and multilingual NLP research remain largely siloed.

Competing requirements at each step of the language modelling pipeline. Edge LM deployment imposes constraints on memory, compute, and energy, which can conflict with the requirements for building capable multilingual models

To understand the state of the art and the challenges of combining the two areas, we survey 232 papers that tackle this problem across the language modelling pipeline, from data collection to development and deployment. This repository contains all the experiments and analyses for this work and serves as an accompaniment for the survey paper, "Multilinguality at the Edge: Developing Language Models for the Global South".

Installation & Usage

This project uses uv for dependency management.

git clone git@github.com:ljvmiranda921/multilingual-edge-nlp.git
cd multilingual-edge-nlp
uv sync

To generate figures, run python -m analysis.<script_name>.

To scrape papers from Semantic Scholar, run python -m scripts.s2_scrape. You can optionally set S2_API_KEY in .env for higher rate limits, but unauthenticated requests work fine for moderate volumes. Use --query_names to select specific queries (multilingual, efficient, intersection), or omit to run all three.

python -m scripts.s2_scrape --year 2020 --min_citations 5

To annotate papers using an LLM, run python -m scripts.llm_annotate. We use Azure OpenAI by default—set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY in .env, then pass --use_azure with the deployment name as --model:

python -m scripts.llm_annotate --use_azure --model gpt-4.1-mini

You can also use the OpenAI API directly by setting OPENAI_API_KEY in .env and omitting the --use_azure flag.

Data Sources

In this section, we list down the data sources to build some of the supporting figures. You should be able to replicate the figures by running python -m analysis.<script_name>.

Share of population in range of mobile network: International Telecommunication Union, processed by Our World in Data
Living languages per country: SIL International, Ethnologue (28th edition), processed by Our World in Data
ICT adoption per 100 people: International Telecommunication Union via World Bank World Development Indicators, processed by Our World in Data
World Bank income groups: World Bank Country and Lending Groups, processed by Our World in Data
Research Papers were downloaded using the Semantic Scholar API.

Acknowledgements

LJVM and AK acknowledge the support of the UKRI Frontier Grant EP/Y031350/1 (EQUATE). LJVM would also like to thank the Microsoft Research Grant for the Azure credits used to access GPT-4.1.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.claude		.claude
analysis		analysis
assets		assets
data		data
docs		docs
plot_outputs		plot_outputs
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilinguality at the Edge: Developing Language Models for the Global South

Installation & Usage

Data Sources

Acknowledgements

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multilinguality at the Edge: Developing Language Models for the Global South

Installation & Usage

Data Sources

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages