Skip to content

ljvmiranda921/multilinguality-at-the-edge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

   

Multilinguality at the Edge: Developing Language Models for the Global South

[Paper] [Website]

Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware-constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete. Studying these two fields together is both a need, as linguistically diverse communities often face the most severe infrastructure constraints, and an opportunity, as edge and multilingual NLP research remain largely siloed.

Competing requirements in the LM Pipeline
Competing requirements at each step of the language modelling pipeline. Edge LM deployment imposes constraints on memory, compute, and energy, which can conflict with the requirements for building capable multilingual models

To understand the state of the art and the challenges of combining the two areas, we survey 232 papers that tackle this problem across the language modelling pipeline, from data collection to development and deployment. This repository contains all the experiments and analyses for this work and serves as an accompaniment for the survey paper, "Multilinguality at the Edge: Developing Language Models for the Global South".

Installation & Usage

This project uses uv for dependency management.

git clone git@github.com:ljvmiranda921/multilingual-edge-nlp.git
cd multilingual-edge-nlp
uv sync

To generate figures, run python -m analysis.<script_name>.

To scrape papers from Semantic Scholar, run python -m scripts.s2_scrape. You can optionally set S2_API_KEY in .env for higher rate limits, but unauthenticated requests work fine for moderate volumes. Use --query_names to select specific queries (multilingual, efficient, intersection), or omit to run all three.

python -m scripts.s2_scrape --year 2020 --min_citations 5

To annotate papers using an LLM, run python -m scripts.llm_annotate. We use Azure OpenAI by default—set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY in .env, then pass --use_azure with the deployment name as --model:

python -m scripts.llm_annotate --use_azure --model gpt-4.1-mini

You can also use the OpenAI API directly by setting OPENAI_API_KEY in .env and omitting the --use_azure flag.

Data Sources

In this section, we list down the data sources to build some of the supporting figures. You should be able to replicate the figures by running python -m analysis.<script_name>.

Acknowledgements

LJVM and AK acknowledge the support of the UKRI Frontier Grant EP/Y031350/1 (EQUATE). LJVM would also like to thank the Microsoft Research Grant for the Azure credits used to access GPT-4.1.

About

Multilinguality at the Edge: Developing Language Models for the Global South

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages