Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 41 additions & 3 deletions cv-corpus/scs/23.0-2025-09-05/final/en/bnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,35 @@
for Bunun (`bnn`). The dataset contains 8220 clips representing 12 hours of recorded
speech (11 hours validated) from 18 speakers.

The recordings cover the Indigenous Languages Curriculum (K-12) textbook materials, level 1 to 9.

## Language
<!-- {{LANGUAGE_DESCRIPTION}} -->
<!-- Provide a brief (1-2 paragraph) description of your language -->
Bunun is the language of the Bunun people, one of the Indigenous peoples of Taiwan.

### Variants
<!-- {{VARIANT_DESCRIPTION}} -->
<!-- @ OPTIONAL @ -->
<!-- Describe the variants (MCV variants) of your language -->

This speech corpus includes the following dialect groups:

- Takituduh (`takitudu`)
- Takibakha (`bakha`)
- Isbukun (`bubukun`)
- Takbanuaz (`banuaz`)
- Takivatan (`vatan`)

## Demographic information
The dataset includes the following distribution of age and gender.
<!-- You can get a lot of the information in this section from https://analyzer.cv-toolbox.web.tr/browse -->

### Gender
Self-declared gender information, percentage refers to the number of clips annotated with this gender.

(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.)

| Gender | Pertentage |
|-|-|
| Undefined | 18.0% |
Expand All @@ -36,6 +50,9 @@ Self-declared gender information, percentage refers to the number of clips annot

### Age
Self-declared age information, percentage refers to the number of clips annotated with this age band.

(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.)

| Age Band | Percentage |
|-|-|
| Undefined | 60.0% |
Expand Down Expand Up @@ -94,6 +111,12 @@ Asa tu makadim mas haini-anan, aupa kadukuun laupakadau a hanian
<!-- @ OPTIONAL @ -->
<!-- A list of sentence sources, can be curated to the top-N -->

The recording texts are taken from the Indigenous Languages Curriculum (K-12) textbook content (in romanization) for stages 1 to 9. They were uploaded by Wikimedia Taiwan under authorization from the K-12 Education Administration, Ministry of Education (Taiwan, ROC) at https://www.k12ea.gov.tw. Special thanks to Deputy Minister Ping-Cheng Yeh for facilitating the authorization.

For the Isbukun dialect, 118 selected verses from the Bunun Bible (see https://cb.fhl.net) are also included, thanks to authorization by The Bible Society in Taiwan.

During the recording project, we noticed that some text contained semantic mismatches or typographical/spelling mistakes. Due to Common Voice system constraints, these issues could not be fixed in advance and recordings proceeded as-is. Recorders and textbook providers collaborated closely; we note this for transparency.

### Text domains
| Domain | Count |
|-|-|
Expand Down Expand Up @@ -136,21 +159,34 @@ information.
## Get involved!

### Community links
* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/bnn/common-voice/contributors/)
<!-- {{COMMUNITY_LINKS_LIST}} -->
<!-- @ OPTIONAL @ -->
<!-- Links to community chats / fora -->

MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/

For questions, suggestions, outreach, donating text, or collaboration, please reach out via:

- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl
- Line group: https://line.me/ti/g/_PLyjCSe_8

Communities involved in the 2025 Indigenous language recording project:

- Wikimedia Taiwan: https://www.facebook.com/wikimedia.tw
- Special thanks to Bunun teacher Aping (Wu A-Hao) for her support

### Discussions
<!-- {{DISCUSSION_LINKS_LIST}} -->
<!-- @ OPTIONAL @ -->
<!-- Any links to discussions, for example on Discourse or other fora or blogs can be included here -->

* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286
* Related news: https://hackmd.io/@moztw/common-voice-news

### Contribute
* [Speak](https://commonvoice.mozilla.org/bnn/speak)
* [Write](https://commonvoice.mozilla.org/bnn/write)
* [Listen](https://commonvoice.mozilla.org/bnn/listen)
* [Review](https://commonvoice.mozilla.org/bnn/review)

<!-- {{CONTRIBUTE_LINKS_LIST}} -->
<!-- Here you can include links for how to contribute to the dataset -->

Expand All @@ -160,6 +196,8 @@ information.
<!-- {{DATASHEET_AUTHORS_LIST}} -->
<!-- A list in the format of: Your Name <[email protected]> -->

- Irvin Chen (MozTW Community Contact) <[email protected]>

### Citation guidelines
<!-- {{CITATION_DESCRIPTION}} -->
<!-- @ OPTIONAL @ -->
Expand Down
47 changes: 43 additions & 4 deletions cv-corpus/scs/23.0-2025-09-05/final/en/dru.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,37 @@
> This datasheet has been generated automatically, we would love to include more information, if you would like to help out, [get in touch](https://github.com/common-voice/common-voice/blob/main/docs/COMMUNITIES.md)!

This datasheet is for version 23.0 of the the Mozilla Common Voice *Scripted Speech* dataset
for Rukai (`dru`). The dataset contains 6692 clips representing 11 hours of recorded
for Rukai (`dru`) (including Teldreka and 'Oponoho). The dataset contains 6692 clips representing 11 hours of recorded
speech (11 hours validated) from 20 speakers.

This dataset includes 20 speakers from the Rukai community recruited with support from Payuan Classic Studio, as well as speakers from the Teldreka and Oponoho communities. The recordings cover the Indigenous Languages Curriculum (K-12) textbook materials, level 1 to 9.

## Language
<!-- {{LANGUAGE_DESCRIPTION}} -->
<!-- Provide a brief (1-2 paragraph) description of your language -->
Rukai (Drekay), including Teldreka and Oponoho, are Indigenous languages of the Rukai people in Taiwan.

### Variants
<!-- {{VARIANT_DESCRIPTION}} -->
<!-- @ OPTIONAL @ -->
<!-- Describe the variants (MCV variants) of your language -->
This speech corpus includes the following dialect groups:

- Eastern Rukai (`taromak`)
- Wutai Rukai (`veday`)
- Dawu Rukai (`labuane`)
- Oponoho (`oponoho`)
- Teldreka (`teldreka`)

## Demographic information
The dataset includes the following distribution of age and gender.
<!-- You can get a lot of the information in this section from https://analyzer.cv-toolbox.web.tr/browse -->

### Gender
Self-declared gender information, percentage refers to the number of clips annotated with this gender.

(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.)

| Gender | Pertentage |
|-|-|
| Female Feminine | 12.0% |
Expand All @@ -34,6 +47,9 @@ Self-declared gender information, percentage refers to the number of clips annot

### Age
Self-declared age information, percentage refers to the number of clips annotated with this age band.

(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.)

| Age Band | Percentage |
|-|-|
| Undefined | 88.0% |
Expand Down Expand Up @@ -91,6 +107,14 @@ Taiwan ka ’iyakay nyani madraw na egeege?
<!-- @ OPTIONAL @ -->
<!-- A list of sentence sources, can be curated to the top-N -->

The recording texts are taken from the Indigenous Languages Curriculum (K-12) textbook content (in romanization) for levels 1 to 9, uploaded by Wikimedia Taiwan under authorization from the K-12 Education Administration, Ministry of Education (Taiwan, ROC): https://www.k12ea.gov.tw. Special thanks to Deputy Minister Ping-Cheng Yeh for facilitating the authorization.

Some Wutai Rukai (`veday`) texts are provided by the NTU Corpus of Formosan Languages (Graduate Institute of Linguistics, National Taiwan University): https://corpus.linguistics.ntu.edu.tw/ (thanks to Prof. Li-May Sung for assistance).

For Tona, Oponoho and Teldreka, 59 selected verses each from the Gospel of Mark (see https://cb.fhl.net) are included, with authorization from The Bible Society in Taiwan.

During the recording project, we noticed certain semantic mismatches and typographical/spelling issues in parts of the text. Due to Common Voice system constraints, these were not corrected in advance and recordings proceeded as-is. Recorders and textbook providers collaborated closely; this note is provided for transparency.

### Text domains
| Domain | Count |
|-|-|
Expand Down Expand Up @@ -134,7 +158,20 @@ information.
## Get involved!

### Community links
* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/dru/common-voice/contributors/)

MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/

For questions, suggestions, outreach, donating text, or collaboration, please reach out via:

- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl
- Line group: https://line.me/ti/g/_PLyjCSe_8

Communities involved in the 2025 Indigenous language recording project:

- Wikimedia Taiwan: https://www.facebook.com/wikimedia.tw
- Payuan Classic: https://www.facebook.com/PayuanClassic/
- Special thanks to Kuliw from Payuan Classic for recruitment and recording support
-
<!-- {{COMMUNITY_LINKS_LIST}} -->
<!-- @ OPTIONAL @ -->
<!-- Links to community chats / fora -->
Expand All @@ -144,11 +181,12 @@ information.
<!-- @ OPTIONAL @ -->
<!-- Any links to discussions, for example on Discourse or other fora or blogs can be included here -->

* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286
* Related news: https://hackmd.io/@moztw/common-voice-news

### Contribute
* [Speak](https://commonvoice.mozilla.org/dru/speak)
* [Write](https://commonvoice.mozilla.org/dru/write)
* [Listen](https://commonvoice.mozilla.org/dru/listen)
* [Review](https://commonvoice.mozilla.org/dru/review)
<!-- {{CONTRIBUTE_LINKS_LIST}} -->
<!-- Here you can include links for how to contribute to the dataset -->

Expand All @@ -157,6 +195,7 @@ information.
### Datasheet authors
<!-- {{DATASHEET_AUTHORS_LIST}} -->
<!-- A list in the format of: Your Name <[email protected]> -->
- Irvin Chen (MozTW Community Contact) <[email protected]>

### Citation guidelines
<!-- {{CITATION_DESCRIPTION}} -->
Expand Down
51 changes: 47 additions & 4 deletions cv-corpus/scs/23.0-2025-09-05/final/en/nan-tw.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,29 @@
for Taiwanese (Minnan) (`nan-tw`). The dataset contains 31951 clips representing 24 hours of recorded
speech (21 hours validated) from 290 speakers.

Please note: This speech dataset is a Han character–speech dataset.
The text corpus primarily uses Han characters, with bracketed TL/POJ romanization as reference pronunciations.

The corpus and speakers are primarily contributed by individual volunteers in Taiwan.

## Language
<!-- {{LANGUAGE_DESCRIPTION}} -->
<!-- Provide a brief (1-2 paragraph) description of your language -->

Taiwanese (POJ: Tâi-oân-ōe; TL: Tâi-uân-uē), also known as Tai-gi (台語/臺語) or Taiwanese Minnan (臺灣閩南語), is spoken across Taiwan and the Penghu archipelago, and is one of Taiwan’s national languages.

### Variants
<!-- {{VARIANT_DESCRIPTION}} -->
<!-- @ OPTIONAL @ -->
<!-- Describe the variants (MCV variants) of your language -->

Starting from v23.0, the Taiwanese (nan-tw) locale allows (optional) selection of the following writing variants; however, most texts are still a mixture of both systems:

- POJ (`pehoeji`)
- TL (`tailo`)

If you’d like to help curate or update the existing texts, please see the Community section below.

## Demographic information
The dataset includes the following distribution of age and gender.
<!-- You can get a lot of the information in this section from https://analyzer.cv-toolbox.web.tr/browse -->
Expand Down Expand Up @@ -70,6 +84,12 @@ The text corpus contains `27277` sentences, of which `26907` are validated, `370
<!-- @ OPTIONAL @ -->
<!-- An overview of the text corpus, with information such as average length (in characters and words) of validated sentences. -->

Most of the text corpus was curated in the MozTW CC0 Sentences repository: https://github.com/moztw/cc0-sentences, primarily contributed by MozTW and g0v community members.

Due to the lack of publicly licensed Taiwanese sentences, recordings in the Taiwanese locale are currently dominated by single-word prompts.

We welcome donations of everyday sentences written in Taiwanese. Please reach out via the community channels below.

### Writing system
<!-- {{WRITING_SYSTEM_DESCRIPTION}} -->
<!-- @ OPTIONAL @ -->
Expand Down Expand Up @@ -97,6 +117,10 @@ There follows a randomly selected sample of five sentences from the corpus.
<!-- @ OPTIONAL @ -->
<!-- A list of sentence sources, can be curated to the top-N -->

The text corpus is built by Mozilla Taiwan community, the g0v community, and other open-source contributors.

Earlier Taiwanese entries were primarily sourced from the “2016-itaigi Mandarin–Taiwanese Dictionary”. See Sources and Licensing at: https://github.com/moztw/cc0-sentences/tree/master/nan-TW#%E8%B3%87%E6%96%99%E4%BE%86%E6%BA%90%E8%88%87%E6%8E%88%E6%AC%8A

### Text domains
| Domain | Count |
|-|-|
Expand All @@ -111,6 +135,10 @@ There follows a randomly selected sample of five sentences from the corpus.
<!-- @ OPTIONAL @ -->
<!-- What text domains are represented in the corpus? -->

Due to the lack of publicly licensed sentence data, recordings in Taiwanese remain predominantly single words.

We welcome more everyday sentences. If you would like to donate your own text (e.g., your original writing), please contact us via the community links below.

### Processing
<!-- {{PROCESSING_DESCRIPTION}} -->
<!-- @ OPTIONAL @ -->
Expand All @@ -121,6 +149,10 @@ There follows a randomly selected sample of five sentences from the corpus.
<!-- @ OPTIONAL @ -->
<!-- What should people do before they use the data, for example Unicode normalisation -->

Because the bracketed romanization is for reference only and a) mixes TL and POJ systems, and b) does not consistently mark all tones, it cannot be treated as an authoritative phonetic annotation for the recordings.

We recommend removing bracketed pronunciations `()` and using only the Han character portion when pre-processing the text.

### Fields
Each row of a `tsv` file represents a single audio clip, and contains the following information:

Expand All @@ -144,22 +176,30 @@ information.
## Get involved!

### Community links
* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/nan-tw/common-voice/contributors/)
* [Original language request on GitHub](https://github.com/common-voice/common-voice/issues/3194)
<!-- {{COMMUNITY_LINKS_LIST}} -->
<!-- @ OPTIONAL @ -->
<!-- Links to community chats / fora -->

MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/

For questions, suggestions, outreach, donating text, or collaboration, please reach out via:

- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl
- Line group: https://line.me/ti/g/_PLyjCSe_8

### Discussions
<!-- {{DISCUSSION_LINKS_LIST}} -->
<!-- @ OPTIONAL @ -->
<!-- Any links to discussions, for example on Discourse or other fora or blogs can be included here -->

* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286
* Related news: https://hackmd.io/@moztw/common-voice-news

### Contribute
* [Speak](https://commonvoice.mozilla.org/nan-tw/speak)
* [Write](https://commonvoice.mozilla.org/nan-tw/write)
* [Listen](https://commonvoice.mozilla.org/nan-tw/listen)
* [Review](https://commonvoice.mozilla.org/nan-tw/review)
* Donate your sentences — If you would like to donate text you own (e.g., original writing) for recording, please contact Irvin ([email protected]) or discuss in the Line/Telegram groups above.

<!-- {{CONTRIBUTE_LINKS_LIST}} -->
<!-- Here you can include links for how to contribute to the dataset -->

Expand All @@ -169,6 +209,9 @@ information.
<!-- {{DATASHEET_AUTHORS_LIST}} -->
<!-- A list in the format of: Your Name <[email protected]> -->

- Irvin Chen (MozTW Community Contact) <[email protected]>
- Dennis Chen (Common Voice Community Facilitator, Wikimedia Taiwan) <[email protected]>

### Citation guidelines
<!-- {{CITATION_DESCRIPTION}} -->
<!-- @ OPTIONAL @ -->
Expand Down
Loading