Skip to content

common-voice/cv-datasheets

Repository files navigation

cv-datasheets

Datasheets are documents that can be offered to help make datasets more valuable to the developers, researchers or other people who use the dataset. We're working with the Common Voice language communities to support the creation of a datasheet for each of the languages in the Common Voice dataset.

Community contribution

The Common Voice team believes that language communities are the experts that should should steer the direction of their languages on Common Voice. We're asking members of the hundreds of language communities across Common Voice to help create datasheets for your language(s), because you're going to be the best qualified to describe the features of your language(s).

Example datasheet: The datasheet for Norsk Nynorsk — Norwegian Nynorsk (nn-NO) can be a helpful guide to what kind of information to include in a new datasheet. Information on the amount of recorded hours for this dataset, validated hours and demographics in this dataset are automatically generated.

As Common Voice Scripted Speech and Spontanous Speech are seperate datasets, seperate datasheets will need to be created for languages that collect data in both of these modes.

Right now, the information needed to generate new datasheets can be submitted via a Google Form or via a Pull Request on this GitHub repository. If you haven't used GitHub before, we recommend following the instructions below to submit a datasheet for your language via Google form.

If you can't access the Google form or the GitHub contribution process, please email the Common Voice team at [email protected] for support.

Via a form

To add a new datasheet for your language in either the Scripted Speech or Spontaneous Speech, please click the appropriate link below and follow the instructions in the form.

Currently these forms are available in English and Spanish.

Scripted/Read Speech:

Spontaneous Speech:

Directly on GitHub

  1. Fork and Clone: Start by forking this repository and then cloning your fork to your local machine.
  2. Locate your Language Datasheet: Navigate to the appropriate directory:
    • scs/ for Scripted Speech datasheets.
    • sps/ for Spontaneous Speech datasheets.
  3. Edit the Draft: Open the draft .md file of your language. Follow the pointers and descriptions in the file's comments to add all required information.
  4. Submit a Pull Request: Commint your changes to your fork project and open a pull request to add your language .md datasheet in the main repository.
  • Alternative Method (Email): If you are unable to use GitHub, you may download the draft file, edit it locally, and email the completed .md file to [email protected].

Please note that fields marked OPTIONAL in the guidance comments don't need to be completed as part of your datasheet submission. Fields marked AUTOMATICALLY GENERATED in the guidance comments will be generated by our systems and you can skip these sections.

Internal process

The instructions below are for use by the Common Voice team. Community members contributing datasets don't need to worry about this section.

  • The draft datasheets are generated from a template + the language metadata
  • The draft datasheets are then given to community members to edit and adapt
  • The editted datasheets that we receive back are added to the repository in the final/ directory

Scripts

Usage:

Generate the draft datasheets:

python3 generate_datasheet.py metadata/metadata.tsv metadata/datasheet-languages.tsv 23.0-2025-09-17 cv-corpus 

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 26