cv-datasheets

Datasheets are documents that can be offered to help make datasets more valuable to the developers, researchers or other people who use the dataset. We're working with the Common Voice language communities to support the creation of a datasheet for each of the languages in the Common Voice dataset.

Community contribution

The Common Voice team believes that language communities are the experts that should should steer the direction of their languages on Common Voice. We're asking members of the hundreds of language communities across Common Voice to help create datasheets for your language(s), because you're going to be the best qualified to describe the features of your language(s).

Example datasheet: The datasheet for Norsk Nynorsk — Norwegian Nynorsk (nn-NO) can be a helpful guide to what kind of information to include in a new datasheet. Information on the amount of recorded hours for this dataset, validated hours and demographics in this dataset are automatically generated.

As Common Voice Scripted Speech and Spontanous Speech are seperate datasets, seperate datasheets will need to be created for languages that collect data in both of these modes.

Right now, the information needed to generate new datasheets can be submitted via a Google Form or via a Pull Request on this GitHub repository. If you haven't used GitHub before, we recommend following the instructions below to submit a datasheet for your language via Google form.

If you can't access the Google form or the GitHub contribution process, please email the Common Voice team at [email protected] for support.

Via a form

To add a new datasheet for your language in either the Scripted Speech or Spontaneous Speech, please click the appropriate link below and follow the instructions in the form.

Currently these forms are available in English and Spanish.

Scripted/Read Speech:

(en) MCV Datasheet: Scripted speech
(es) MCV Datasheet: Habla leída

Spontaneous Speech:

(en) MCV Datasheet: Spontaneous speech
(es) MCV Datasheet: Habla espontánea

Directly on GitHub

Fork and Clone: Start by forking this repository and then cloning your fork to your local machine.
Locate your Language Datasheet: Navigate to the appropriate directory:
- scs/ for Scripted Speech datasheets.
- sps/ for Spontaneous Speech datasheets.
Edit the Draft: Open the draft .md file of your language. Follow the pointers and descriptions in the file's comments to add all required information.
Submit a Pull Request: Commint your changes to your fork project and open a pull request to add your language .md datasheet in the main repository.

Alternative Method (Email): If you are unable to use GitHub, you may download the draft file, edit it locally, and email the completed .md file to [email protected].

Please note that fields marked OPTIONAL in the guidance comments don't need to be completed as part of your datasheet submission. Fields marked AUTOMATICALLY GENERATED in the guidance comments will be generated by our systems and you can skip these sections.

Internal process

The instructions below are for use by the Common Voice team. Community members contributing datasets don't need to worry about this section.

The draft datasheets are generated from a template + the language metadata
The draft datasheets are then given to community members to edit and adapt
The editted datasheets that we receive back are added to the repository in the final/ directory

Scripts

Usage:

Generate the draft datasheets:

python3 generate_datasheet.py metadata/metadata.tsv metadata/datasheet-languages.tsv 23.0-2025-09-17 cv-corpus

Name		Name	Last commit message	Last commit date
Latest commit History 443 Commits
cv-corpus		cv-corpus
metadata		metadata
scripts		scripts
templates		templates
.gitignore		.gitignore
COPYING		COPYING
README.md		README.md
datasheet-postprocess.py		datasheet-postprocess.py
generate_datasheet.py		generate_datasheet.py
pyproject.toml		pyproject.toml
update_final_md.py		update_final_md.py
update_release.py		update_release.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cv-datasheets

Community contribution

Via a form

Directly on GitHub

Internal process

Scripts

About

Uh oh!

Releases

Packages

Contributors 26

Uh oh!

Languages

License

common-voice/cv-datasheets

Folders and files

Latest commit

History

Repository files navigation

cv-datasheets

Community contribution

Via a form

Directly on GitHub

Internal process

Scripts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 26

Uh oh!

Languages

Packages