GitHub - bootphon/discophon: The Phoneme Discovery benchmark

Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

DiscoPhon is a multilingual benchmark evaluating unsupervised phoneme discovery from discrete speech units. Given only 10 hours of speech in an unseen language, models must produce discrete units that map to a predefined phoneme inventory.

Getting started

DiscoPhon requires Python ≥ 3.12 and has no system dependencies.

Install this package:

pip install discophon            # core: data preparation and phoneme discovery
pip install discophon[abx]       # adds ABX discriminability (fastabx)
pip install discophon[baselines] # adds the baseline models

Follow the tutorials to download data, evaluate models, and prepare your submission.
Current leaderboard.

References

@misc{poli2026discophon,
  title={{DiscoPhon}: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units},
  author={Maxime Poli and Manel Khentout and Angelo Ortiz Tandazo and Ewan Dunbar and Emmanuel Chemla and Emmanuel Dupoux},
  year={2026},
  eprint={2603.18612},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.18612},
}

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github		.github
docs		docs
paper		paper
scripts		scripts
src/discophon		src/discophon
tests		tests
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
zensical.toml		zensical.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Getting started

References

About

Releases

Contributors

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Getting started

References

About

Resources

Stars

Watchers

Forks

Releases

Contributors

Languages