Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware-constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete. Studying these two fields together is both a need, as linguistically diverse communities often face the most severe infrastructure constraints, and an opportunity, as edge and multilingual NLP research remain largely siloed.
Competing requirements at each step of the language modelling pipeline. Edge LM deployment imposes constraints on memory, compute, and energy, which can conflict with the requirements for building capable multilingual models
To understand the state of the art and the challenges of combining the two areas, we survey 232 papers that tackle this problem across the language modelling pipeline, from data collection to development and deployment. This repository contains all the experiments and analyses for this work and serves as an accompaniment for the survey paper, "Multilinguality at the Edge: Developing Language Models for the Global South".
This project uses uv for dependency management.
git clone git@github.com:ljvmiranda921/multilingual-edge-nlp.git
cd multilingual-edge-nlp
uv syncTo generate figures, run python -m analysis.<script_name>.
To scrape papers from Semantic Scholar, run python -m scripts.s2_scrape.
You can optionally set S2_API_KEY in .env for higher rate limits, but unauthenticated requests work fine for moderate volumes.
Use --query_names to select specific queries (multilingual, efficient, intersection), or omit to run all three.
python -m scripts.s2_scrape --year 2020 --min_citations 5To annotate papers using an LLM, run python -m scripts.llm_annotate.
We use Azure OpenAI by default—set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY in .env, then pass --use_azure with the deployment name as --model:
python -m scripts.llm_annotate --use_azure --model gpt-4.1-miniYou can also use the OpenAI API directly by setting OPENAI_API_KEY in .env and omitting the --use_azure flag.
In this section, we list down the data sources to build some of the supporting figures.
You should be able to replicate the figures by running python -m analysis.<script_name>.
- Share of population in range of mobile network: International Telecommunication Union, processed by Our World in Data
- Living languages per country: SIL International, Ethnologue (28th edition), processed by Our World in Data
- ICT adoption per 100 people: International Telecommunication Union via World Bank World Development Indicators, processed by Our World in Data
- World Bank income groups: World Bank Country and Lending Groups, processed by Our World in Data
- Research Papers were downloaded using the Semantic Scholar API.
LJVM and AK acknowledge the support of the UKRI Frontier Grant EP/Y031350/1 (EQUATE). LJVM would also like to thank the Microsoft Research Grant for the Azure credits used to access GPT-4.1.
