Evaluating LMs for Multilingual Synthetic Data Generation

Supervised finetuning (SFT) has been a dominant approach in building multilingual language models (LMs). Central to its success is the availability of high-quality multilingual datasets. However, collecting this data from native-speakers demands substantial human effort and resources, creating a bottleneck in LM development.

Synthetic data generation has been an appealing alternative to human annotation, in part due to its cost-efficiency (you just need an API call) and scalability (you can generate thousands of examples in a day). However, research is scarce on how we can leverage the synthetic data pipeline to create high-quality multilingual data.

In this work, we ask the question: "what makes a good multilingual teacher for synthetic data generation?" Specifically, we perform a comprehensive analysis of several language models and evaluate their data quality as teacher models, and the performance gain of the resulting student model on some benchmarks.

Overview of the Polyglot Score and how it fits into the distillation workflow.

Setup & Installation

Make sure that you have uv in your system (see download instructions). To install all dependencies, run the following commands:

git submodule update --init --recursive --depth 1
uv sync --dev
# When training on TPUs and developing models via tunix
# uv sync --extra tpu
# When doing evaluations via lighteval
# uv sync --extra eval
source .venv/bin/activate

Isambard is a bit different because the login node doesn't have a GPU (if you sync normally, it will install the CPU versions of pytorch which will mess up your virtual environment). Instead you should run these commands:

uv sync \
    --no-install-package triton \
    --no-install-package torch \
    --no-install-package torchaudio \
    --no-install-package torchvision \
    --no-install-package vllm \
    --no-install-package llama-cpp-python \
    --no-install-package ctranslate2 
sbatch experiments/slurm_submit.isambard experiments/jobs/sync_isambard.sh

For more information on how to use this codebase, please refer to the documentation. For information on running experiments on the cluster, see the experiment documentation.

Acknowledgements

LJVM and AK acknowledge the support of the UKRI Frontier Grant EP/Y031350/1 (EQUATE). This work was performed using joint resources provided by the Cambridge Service for Data Driven Discovery (CSD3) EP/T022159/1, the Isambard AI National AI Research Resource (AIRR) ST/AIRR/I-A-I/1023, and the Microsoft Research Grant.

Citation

@misc{miranda2026polyglotteachersevaluatinglanguage,
      title={Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation}, 
      author={Lester James V. Miranda and Ivan Vulić and Anna Korhonen},
      year={2026},
      eprint={2604.11290},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.11290}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 332 Commits
analysis		analysis
assets		assets
curator @ 3f33553		curator @ 3f33553
experiments		experiments
lighteval @ a54d904		lighteval @ a54d904
notebooks		notebooks
plot_outputs		plot_outputs
prometheus-eval @ dbbfb22		prometheus-eval @ dbbfb22
results		results
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
DOCUMENTATION.md		DOCUMENTATION.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements-isambard.txt		requirements-isambard.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating LMs for Multilingual Synthetic Data Generation

Setup & Installation

Acknowledgements

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating LMs for Multilingual Synthetic Data Generation

Setup & Installation

Acknowledgements

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages