diff --git a/cv-corpus/scs/23.0-2025-09-05/final/en/bnn.md b/cv-corpus/scs/23.0-2025-09-05/final/en/bnn.md index c8b9179c..0f916b2a 100644 --- a/cv-corpus/scs/23.0-2025-09-05/final/en/bnn.md +++ b/cv-corpus/scs/23.0-2025-09-05/final/en/bnn.md @@ -5,21 +5,35 @@ for Bunun (`bnn`). The dataset contains 8220 clips representing 12 hours of recorded speech (11 hours validated) from 18 speakers. +The recordings cover the Indigenous Languages Curriculum (K-12) textbook materials, level 1 to 9. + ## Language +Bunun is the language of the Bunun people, one of the Indigenous peoples of Taiwan. ### Variants +This speech corpus includes the following dialect groups: + +- Takituduh (`takitudu`) +- Takibakha (`bakha`) +- Isbukun (`bubukun`) +- Takbanuaz (`banuaz`) +- Takivatan (`vatan`) + ## Demographic information The dataset includes the following distribution of age and gender. ### Gender Self-declared gender information, percentage refers to the number of clips annotated with this gender. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Gender | Pertentage | |-|-| | Undefined | 18.0% | @@ -36,6 +50,9 @@ Self-declared gender information, percentage refers to the number of clips annot ### Age Self-declared age information, percentage refers to the number of clips annotated with this age band. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Age Band | Percentage | |-|-| | Undefined | 60.0% | @@ -94,6 +111,12 @@ Asa tu makadim mas haini-anan, aupa kadukuun laupakadau a hanian +The recording texts are taken from the Indigenous Languages Curriculum (K-12) textbook content (in romanization) for stages 1 to 9. They were uploaded by Wikimedia Taiwan under authorization from the K-12 Education Administration, Ministry of Education (Taiwan, ROC) at https://www.k12ea.gov.tw. Special thanks to Deputy Minister Ping-Cheng Yeh for facilitating the authorization. + +For the Isbukun dialect, 118 selected verses from the Bunun Bible (see https://cb.fhl.net) are also included, thanks to authorization by The Bible Society in Taiwan. + +During the recording project, we noticed that some text contained semantic mismatches or typographical/spelling mistakes. Due to Common Voice system constraints, these issues could not be fixed in advance and recordings proceeded as-is. Recorders and textbook providers collaborated closely; we note this for transparency. + ### Text domains | Domain | Count | |-|-| @@ -136,21 +159,34 @@ information. ## Get involved! ### Community links -* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/bnn/common-voice/contributors/) +MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/ + +For questions, suggestions, outreach, donating text, or collaboration, please reach out via: + +- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl +- Line group: https://line.me/ti/g/_PLyjCSe_8 + +Communities involved in the 2025 Indigenous language recording project: + +- Wikimedia Taiwan: https://www.facebook.com/wikimedia.tw +- Special thanks to Bunun teacher Aping (Wu A-Hao) for her support + ### Discussions +* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286 +* Related news: https://hackmd.io/@moztw/common-voice-news + ### Contribute * [Speak](https://commonvoice.mozilla.org/bnn/speak) -* [Write](https://commonvoice.mozilla.org/bnn/write) * [Listen](https://commonvoice.mozilla.org/bnn/listen) -* [Review](https://commonvoice.mozilla.org/bnn/review) + @@ -160,6 +196,8 @@ information. +- Irvin Chen (MozTW Community Contact) + ### Citation guidelines diff --git a/cv-corpus/scs/23.0-2025-09-05/final/en/dru.md b/cv-corpus/scs/23.0-2025-09-05/final/en/dru.md index 01844e90..fb9e58cb 100644 --- a/cv-corpus/scs/23.0-2025-09-05/final/en/dru.md +++ b/cv-corpus/scs/23.0-2025-09-05/final/en/dru.md @@ -2,17 +2,27 @@ > This datasheet has been generated automatically, we would love to include more information, if you would like to help out, [get in touch](https://github.com/common-voice/common-voice/blob/main/docs/COMMUNITIES.md)! This datasheet is for version 23.0 of the the Mozilla Common Voice *Scripted Speech* dataset -for Rukai (`dru`). The dataset contains 6692 clips representing 11 hours of recorded +for Rukai (`dru`) (including Teldreka and 'Oponoho). The dataset contains 6692 clips representing 11 hours of recorded speech (11 hours validated) from 20 speakers. +This dataset includes 20 speakers from the Rukai community recruited with support from Payuan Classic Studio, as well as speakers from the Teldreka and Oponoho communities. The recordings cover the Indigenous Languages Curriculum (K-12) textbook materials, level 1 to 9. + ## Language +Rukai (Drekay), including Teldreka and Oponoho, are Indigenous languages of the Rukai people in Taiwan. ### Variants +This speech corpus includes the following dialect groups: + +- Eastern Rukai (`taromak`) +- Wutai Rukai (`veday`) +- Dawu Rukai (`labuane`) +- Oponoho (`oponoho`) +- Teldreka (`teldreka`) ## Demographic information The dataset includes the following distribution of age and gender. @@ -20,6 +30,9 @@ The dataset includes the following distribution of age and gender. ### Gender Self-declared gender information, percentage refers to the number of clips annotated with this gender. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Gender | Pertentage | |-|-| | Female Feminine | 12.0% | @@ -34,6 +47,9 @@ Self-declared gender information, percentage refers to the number of clips annot ### Age Self-declared age information, percentage refers to the number of clips annotated with this age band. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Age Band | Percentage | |-|-| | Undefined | 88.0% | @@ -91,6 +107,14 @@ Taiwan ka ’iyakay nyani madraw na egeege? +The recording texts are taken from the Indigenous Languages Curriculum (K-12) textbook content (in romanization) for levels 1 to 9, uploaded by Wikimedia Taiwan under authorization from the K-12 Education Administration, Ministry of Education (Taiwan, ROC): https://www.k12ea.gov.tw. Special thanks to Deputy Minister Ping-Cheng Yeh for facilitating the authorization. + +Some Wutai Rukai (`veday`) texts are provided by the NTU Corpus of Formosan Languages (Graduate Institute of Linguistics, National Taiwan University): https://corpus.linguistics.ntu.edu.tw/ (thanks to Prof. Li-May Sung for assistance). + +For Tona, Oponoho and Teldreka, 59 selected verses each from the Gospel of Mark (see https://cb.fhl.net) are included, with authorization from The Bible Society in Taiwan. + +During the recording project, we noticed certain semantic mismatches and typographical/spelling issues in parts of the text. Due to Common Voice system constraints, these were not corrected in advance and recordings proceeded as-is. Recorders and textbook providers collaborated closely; this note is provided for transparency. + ### Text domains | Domain | Count | |-|-| @@ -134,7 +158,20 @@ information. ## Get involved! ### Community links -* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/dru/common-voice/contributors/) + +MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/ + +For questions, suggestions, outreach, donating text, or collaboration, please reach out via: + +- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl +- Line group: https://line.me/ti/g/_PLyjCSe_8 + +Communities involved in the 2025 Indigenous language recording project: + +- Wikimedia Taiwan: https://www.facebook.com/wikimedia.tw +- Payuan Classic: https://www.facebook.com/PayuanClassic/ +- Special thanks to Kuliw from Payuan Classic for recruitment and recording support +- @@ -144,11 +181,12 @@ information. +* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286 +* Related news: https://hackmd.io/@moztw/common-voice-news + ### Contribute * [Speak](https://commonvoice.mozilla.org/dru/speak) -* [Write](https://commonvoice.mozilla.org/dru/write) * [Listen](https://commonvoice.mozilla.org/dru/listen) -* [Review](https://commonvoice.mozilla.org/dru/review) @@ -157,6 +195,7 @@ information. ### Datasheet authors + - Irvin Chen (MozTW Community Contact) ### Citation guidelines diff --git a/cv-corpus/scs/23.0-2025-09-05/final/en/nan-tw.md b/cv-corpus/scs/23.0-2025-09-05/final/en/nan-tw.md index a0e997b5..d8ded7a3 100644 --- a/cv-corpus/scs/23.0-2025-09-05/final/en/nan-tw.md +++ b/cv-corpus/scs/23.0-2025-09-05/final/en/nan-tw.md @@ -5,15 +5,29 @@ for Taiwanese (Minnan) (`nan-tw`). The dataset contains 31951 clips representing 24 hours of recorded speech (21 hours validated) from 290 speakers. +Please note: This speech dataset is a Han character–speech dataset. +The text corpus primarily uses Han characters, with bracketed TL/POJ romanization as reference pronunciations. + +The corpus and speakers are primarily contributed by individual volunteers in Taiwan. + ## Language +Taiwanese (POJ: Tâi-oân-ōe; TL: Tâi-uân-uē), also known as Tai-gi (台語/臺語) or Taiwanese Minnan (臺灣閩南語), is spoken across Taiwan and the Penghu archipelago, and is one of Taiwan’s national languages. + ### Variants +Starting from v23.0, the Taiwanese (nan-tw) locale allows (optional) selection of the following writing variants; however, most texts are still a mixture of both systems: + +- POJ (`pehoeji`) +- TL (`tailo`) + +If you’d like to help curate or update the existing texts, please see the Community section below. + ## Demographic information The dataset includes the following distribution of age and gender. @@ -70,6 +84,12 @@ The text corpus contains `27277` sentences, of which `26907` are validated, `370 +Most of the text corpus was curated in the MozTW CC0 Sentences repository: https://github.com/moztw/cc0-sentences, primarily contributed by MozTW and g0v community members. + +Due to the lack of publicly licensed Taiwanese sentences, recordings in the Taiwanese locale are currently dominated by single-word prompts. + +We welcome donations of everyday sentences written in Taiwanese. Please reach out via the community channels below. + ### Writing system @@ -97,6 +117,10 @@ There follows a randomly selected sample of five sentences from the corpus. +The text corpus is built by Mozilla Taiwan community, the g0v community, and other open-source contributors. + +Earlier Taiwanese entries were primarily sourced from the “2016-itaigi Mandarin–Taiwanese Dictionary”. See Sources and Licensing at: https://github.com/moztw/cc0-sentences/tree/master/nan-TW#%E8%B3%87%E6%96%99%E4%BE%86%E6%BA%90%E8%88%87%E6%8E%88%E6%AC%8A + ### Text domains | Domain | Count | |-|-| @@ -111,6 +135,10 @@ There follows a randomly selected sample of five sentences from the corpus. +Due to the lack of publicly licensed sentence data, recordings in Taiwanese remain predominantly single words. + +We welcome more everyday sentences. If you would like to donate your own text (e.g., your original writing), please contact us via the community links below. + ### Processing @@ -121,6 +149,10 @@ There follows a randomly selected sample of five sentences from the corpus. +Because the bracketed romanization is for reference only and a) mixes TL and POJ systems, and b) does not consistently mark all tones, it cannot be treated as an authoritative phonetic annotation for the recordings. + +We recommend removing bracketed pronunciations `()` and using only the Han character portion when pre-processing the text. + ### Fields Each row of a `tsv` file represents a single audio clip, and contains the following information: @@ -144,22 +176,30 @@ information. ## Get involved! ### Community links -* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/nan-tw/common-voice/contributors/) -* [Original language request on GitHub](https://github.com/common-voice/common-voice/issues/3194) +MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/ + +For questions, suggestions, outreach, donating text, or collaboration, please reach out via: + +- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl +- Line group: https://line.me/ti/g/_PLyjCSe_8 + ### Discussions +* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286 +* Related news: https://hackmd.io/@moztw/common-voice-news + ### Contribute * [Speak](https://commonvoice.mozilla.org/nan-tw/speak) -* [Write](https://commonvoice.mozilla.org/nan-tw/write) * [Listen](https://commonvoice.mozilla.org/nan-tw/listen) -* [Review](https://commonvoice.mozilla.org/nan-tw/review) +* Donate your sentences — If you would like to donate text you own (e.g., original writing) for recording, please contact Irvin (irvin@moztw.org) or discuss in the Line/Telegram groups above. + @@ -169,6 +209,9 @@ information. + - Irvin Chen (MozTW Community Contact) + - Dennis Chen (Common Voice Community Facilitator, Wikimedia Taiwan) + ### Citation guidelines diff --git a/cv-corpus/scs/23.0-2025-09-05/final/en/pwn.md b/cv-corpus/scs/23.0-2025-09-05/final/en/pwn.md index ec404878..8ba258a7 100644 --- a/cv-corpus/scs/23.0-2025-09-05/final/en/pwn.md +++ b/cv-corpus/scs/23.0-2025-09-05/final/en/pwn.md @@ -5,14 +5,23 @@ for Paiwan (`pwn`). The dataset contains 10938 clips representing 15 hours of recorded speech (15 hours validated) from 27 speakers. +This dataset includes 27 speakers recruited with the support of Payuan Classic Studio. The recordings cover the Indigenous Languages Curriculum (K-12) textbook materials, level 1 to 9. + ## Language +Paiwan (Pinayuanan) is an Indigenous language of the Paiwan people in Taiwan. ### Variants +This speech corpus includes the following dialect groups: + +- Central Paiwan (`central`) +- Eastern Paiwan (`eastern`) +- Northern Paiwan (`northern`) +- Southern Paiwan (`southern`) ## Demographic information The dataset includes the following distribution of age and gender. @@ -20,6 +29,9 @@ The dataset includes the following distribution of age and gender. ### Gender Self-declared gender information, percentage refers to the number of clips annotated with this gender. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Gender | Pertentage | |-|-| | Undefined | 49.0% | @@ -36,6 +48,9 @@ Self-declared gender information, percentage refers to the number of clips annot ### Age Self-declared age information, percentage refers to the number of clips annotated with this age band. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Age Band | Percentage | |-|-| | Undefined | 29.0% | @@ -94,6 +109,9 @@ keman a ken tu tjanu ita tua udung +The recording texts are taken from the Indigenous Languages Curriculum (K-12) textbook content (in romanization) for levels 1 to 9, uploaded by Wikimedia Taiwan under authorization from the K-12 Education Administration, Ministry of Education (Taiwan, ROC): https://www.k12ea.gov.tw. Special thanks to Deputy Minister Ping-Cheng Yeh for facilitating the authorization. + +During the recording project, we noticed certain semantic mismatches and typographical/spelling issues in parts of the text. Due to Common Voice system constraints, these were not corrected in advance and recordings proceeded as-is. Recorders and textbook providers collaborated closely; this note is provided for transparency. ### Text domains | Domain | Count | @@ -136,7 +154,18 @@ information. ## Get involved! ### Community links -* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/pwn/common-voice/contributors/) +MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/ + +For questions, suggestions, outreach, donating text, or collaboration, please reach out via: + +- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl +- Line group: https://line.me/ti/g/_PLyjCSe_8 + +Communities involved in the 2025 Indigenous language recording project: + +- Wikimedia Taiwan: https://www.facebook.com/wikimedia.tw +- Payuan Classic: https://www.facebook.com/PayuanClassic/ +- Special thanks to Kuliw for recruitment and recording support @@ -145,12 +174,13 @@ information. +* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286 +* Related news: https://hackmd.io/@moztw/common-voice-news ### Contribute * [Speak](https://commonvoice.mozilla.org/pwn/speak) -* [Write](https://commonvoice.mozilla.org/pwn/write) * [Listen](https://commonvoice.mozilla.org/pwn/listen) -* [Review](https://commonvoice.mozilla.org/pwn/review) + @@ -159,6 +189,7 @@ information. ### Datasheet authors + - Irvin Chen (MozTW Community Contact) ### Citation guidelines diff --git a/cv-corpus/scs/23.0-2025-09-05/final/en/szy.md b/cv-corpus/scs/23.0-2025-09-05/final/en/szy.md index 9fba6b22..b5a77776 100644 --- a/cv-corpus/scs/23.0-2025-09-05/final/en/szy.md +++ b/cv-corpus/scs/23.0-2025-09-05/final/en/szy.md @@ -5,9 +5,12 @@ for Sakizaya (`szy`). The dataset contains 9643 clips representing 15 hours of recorded speech (14 hours validated) from 26 speakers. +This dataset includes 26 speakers from the Hualien Sakizaya Wikimedia Association. The recordings cover the Indigenous Languages Curriculum (K-12) textbook materials, level 1 to 9. + ## Language +Sakizaya is the language of the Sakizaya people, an Indigenous people of Taiwan. ### Variants @@ -20,6 +23,9 @@ The dataset includes the following distribution of age and gender. ### Gender Self-declared gender information, percentage refers to the number of clips annotated with this gender. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Gender | Pertentage | |-|-| | Undefined | 26.0% | @@ -35,6 +41,9 @@ Self-declared gender information, percentage refers to the number of clips annot ### Age Self-declared age information, percentage refers to the number of clips annotated with this age band. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Age Band | Percentage | |-|-| | Undefined | 49.0% | @@ -95,6 +104,11 @@ micakay kaku tu sapisulit atu sasulitan +The recording texts are taken from the Indigenous Languages Curriculum (K-12) textbook content (in romanization) for levels 1 to 9, uploaded by Wikimedia Taiwan under authorization from the K-12 Education Administration, Ministry of Education (Taiwan, ROC): https://www.k12ea.gov.tw. Special thanks to Deputy Minister Ping-Cheng Yeh for facilitating the authorization. + +A small portion of the texts is provided by the NTU Corpus of Formosan Languages (Graduate Institute of Linguistics, National Taiwan University): https://corpus.linguistics.ntu.edu.tw/ (thanks to Prof. Li-May Sung for assistance). + +During the recording project, we noticed certain semantic mismatches and typographical/spelling issues in parts of the text. Due to Common Voice system constraints, these were not corrected in advance and recordings proceeded as-is. Recorders and textbook providers collaborated closely; this note is provided for transparency. ### Text domains | Domain | Count | @@ -139,7 +153,17 @@ information. ## Get involved! ### Community links -* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/szy/common-voice/contributors/) +MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/ + +For questions, suggestions, outreach, donating text, or collaboration, please reach out via: + +- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl +- Line group: https://line.me/ti/g/_PLyjCSe_8 + +Communities involved in the 2025 Indigenous language recording project: + +- Wikimedia Taiwan: https://www.facebook.com/wikimedia.tw +- Hualien County Sakizaya Wikimedia Association @@ -148,12 +172,12 @@ information. +* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286 +* Related news: https://hackmd.io/@moztw/common-voice-news ### Contribute * [Speak](https://commonvoice.mozilla.org/szy/speak) -* [Write](https://commonvoice.mozilla.org/szy/write) * [Listen](https://commonvoice.mozilla.org/szy/listen) -* [Review](https://commonvoice.mozilla.org/szy/review) @@ -162,6 +186,7 @@ information. ### Datasheet authors + - Irvin Chen (MozTW Community Contact) ### Citation guidelines diff --git a/cv-corpus/scs/23.0-2025-09-05/final/en/tay.md b/cv-corpus/scs/23.0-2025-09-05/final/en/tay.md index 544e2d84..7492c664 100644 --- a/cv-corpus/scs/23.0-2025-09-05/final/en/tay.md +++ b/cv-corpus/scs/23.0-2025-09-05/final/en/tay.md @@ -5,14 +5,25 @@ for Atayal (`tay`). The dataset contains 7857 clips representing 13 hours of recorded speech (12 hours validated) from 18 speakers. +This dataset includes 18 speakers from Atayal language promotion organizations. The recordings cover the Indigenous Languages Curriculum (K-12) textbook materials, level 1 to 9. + ## Language +Atayal (Tayal) is the language of the Atayal people, an Indigenous people of Taiwan. ### Variants +This speech corpus includes the following dialect groups: + +- Squliq Atayal (`squliq`) +- Sʼuli Atayal (`ciuli`) +- Klesan Atayal (`klesan`) +- Skikun Atayal (`cquliq`) +- Matuʼuwal Atayal (`matuuwal`) +- Plngawan Atayal (`pingawan`) ## Demographic information The dataset includes the following distribution of age and gender. @@ -20,6 +31,9 @@ The dataset includes the following distribution of age and gender. ### Gender Self-declared gender information, percentage refers to the number of clips annotated with this gender. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Gender | Pertentage | |-|-| | Undefined | 22.0% | @@ -35,6 +49,9 @@ Self-declared gender information, percentage refers to the number of clips annot ### Age Self-declared age information, percentage refers to the number of clips annotated with this age band. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Age Band | Percentage | |-|-| | Undefined | 33.0% | @@ -94,6 +111,11 @@ bali nanak ku’ muwani iwana makibaq ’i’ matiq +The recording texts are taken from the Indigenous Languages Curriculum (K-12) textbook content (in romanization) for levels 1 to 9, uploaded by Wikimedia Taiwan under authorization from the K-12 Education Administration, Ministry of Education (Taiwan, ROC): https://www.k12ea.gov.tw. Special thanks to Deputy Minister Ping-Cheng Yeh for facilitating the authorization. + +For Squliq Atayal, 115 selected verses from the Tayal Bible (see https://cb.fhl.net) are included, with authorization from The Bible Society in Taiwan. + +During the recording project, we noticed certain semantic mismatches and typographical/spelling issues in parts of the text. Due to Common Voice system constraints, these were not corrected in advance and recordings proceeded as-is. Recorders and textbook providers collaborated closely; this note is provided for transparency. ### Text domains | Domain | Count | @@ -137,7 +159,18 @@ information. ## Get involved! ### Community links -* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/tay/common-voice/contributors/) +MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/ + +For questions, suggestions, outreach, donating text, or collaboration, please reach out via: + +- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl +- Line group: https://line.me/ti/g/_PLyjCSe_8 + +Communities involved in the 2025 Indigenous language recording project: + +- Wikimedia Taiwan: https://www.facebook.com/wikimedia.tw +- Atayal language promotion organizations: https://www.kenatayal.org.tw and https://www.facebook.com/p/%E6%B3%B0%E9%9B%85%E6%97%8F%E8%AA%9E%E6%8E%A8%E5%8B%95%E7%B5%84%E7%B9%94-100064743246737/ +- Special thanks to teacher Sugiy‧Tosi for assistance @@ -146,12 +179,12 @@ information. +* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286 +* Related news: https://hackmd.io/@moztw/common-voice-news ### Contribute * [Speak](https://commonvoice.mozilla.org/tay/speak) -* [Write](https://commonvoice.mozilla.org/tay/write) * [Listen](https://commonvoice.mozilla.org/tay/listen) -* [Review](https://commonvoice.mozilla.org/tay/review) @@ -160,6 +193,7 @@ information. ### Datasheet authors + - Irvin Chen (MozTW Community Contact) ### Citation guidelines diff --git a/cv-corpus/scs/23.0-2025-09-05/final/en/trv.md b/cv-corpus/scs/23.0-2025-09-05/final/en/trv.md index b97ad3d4..fb9cd3e5 100644 --- a/cv-corpus/scs/23.0-2025-09-05/final/en/trv.md +++ b/cv-corpus/scs/23.0-2025-09-05/final/en/trv.md @@ -5,14 +5,22 @@ for Seediq (`trv`). The dataset contains 6490 clips representing 11 hours of recorded speech (10 hours validated) from 10 speakers. +This dataset includes 10 speakers from Seediq language promotion organizations. The recordings cover the Indigenous Languages Curriculum (K-12) textbook materials, level 1 to 9. + ## Language +Seediq is the language of the Seediq people, an Indigenous people of Taiwan. ### Variants +This speech corpus includes the following dialect groups: + +- Tgdaya Seediq (`tgdaya`) +- Toda Sediq (`toda`) +- Truku Seejiq (`truku`) ## Demographic information The dataset includes the following distribution of age and gender. @@ -20,6 +28,9 @@ The dataset includes the following distribution of age and gender. ### Gender Self-declared gender information, percentage refers to the number of clips annotated with this gender. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Gender | Pertentage | |-|-| | Undefined | 68.0% | @@ -36,6 +47,9 @@ Self-declared gender information, percentage refers to the number of clips annot ### Age Self-declared age information, percentage refers to the number of clips annotated with this age band. + +(The MozTW / Wikimedia Taiwan Indigenous language recording project in early 2025 did not collect this information; therefore, these figures may be relatively inaccurate.) + | Age Band | Percentage | |-|-| | Undefined | 93.0% | @@ -92,6 +106,13 @@ Iya bay tuting durang kadi probo ha! +The recording texts are taken from the Indigenous Languages Curriculum (K-12) textbook content (in romanization) for levels 1 to 9, uploaded by Wikimedia Taiwan under authorization from the K-12 Education Administration, Ministry of Education (Taiwan, ROC): https://www.k12ea.gov.tw. Special thanks to Deputy Minister Ping-Cheng Yeh for facilitating the authorization. + +For Tgdaya Seediq, 119 selected verses from the Seediq Tgdaya Bible (see https://cb.fhl.net) are included, with authorization from The Bible Society in Taiwan. + +A portion of the Tgdaya (`tgdaya`) texts is provided by the NTU Corpus of Formosan Languages (Graduate Institute of Linguistics, National Taiwan University): https://corpus.linguistics.ntu.edu.tw/ (thanks to Prof. Li-May Sung for assistance). + +During the recording project, we noticed certain semantic mismatches and typographical/spelling issues in parts of the text. Due to Common Voice system constraints, these were not corrected in advance and recordings proceeded as-is. Recorders and textbook providers collaborated closely; this note is provided for transparency. ### Text domains | Domain | Count | @@ -135,7 +156,17 @@ information. ## Get involved! ### Community links -* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/trv/common-voice/contributors/) +MozTW (Mozilla Taiwan) Common Voice project site: https://moztw.org/commonvoice/ + +For questions, suggestions, outreach, donating text, or collaboration, please reach out via: + +- Telegram group: https://t.me/+gvmHEcAtd-IwNzFl +- Line group: https://line.me/ti/g/_PLyjCSe_8 + +Communities involved in the 2025 Indigenous language recording project: + +- Seediq language promotion organizations: https://www.facebook.com/Kari3s4t/ +- Special thanks to Bilaq Watan for recruitment and recording support @@ -144,12 +175,12 @@ information. +* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286 +* Related news: https://hackmd.io/@moztw/common-voice-news ### Contribute * [Speak](https://commonvoice.mozilla.org/trv/speak) -* [Write](https://commonvoice.mozilla.org/trv/write) * [Listen](https://commonvoice.mozilla.org/trv/listen) -* [Review](https://commonvoice.mozilla.org/trv/review) @@ -158,6 +189,7 @@ information. ### Datasheet authors + - Irvin Chen (MozTW Community Contact) ### Citation guidelines diff --git a/cv-corpus/scs/23.0-2025-09-05/final/en/zh-TW.md b/cv-corpus/scs/23.0-2025-09-05/final/en/zh-TW.md index 02c6072d..7cecb48b 100644 --- a/cv-corpus/scs/23.0-2025-09-05/final/en/zh-TW.md +++ b/cv-corpus/scs/23.0-2025-09-05/final/en/zh-TW.md @@ -8,6 +8,7 @@ speech (77 hours validated) from 2291 speakers. ## Language +Taiwan Mandarin in Traditional Chinese script. ### Variants @@ -69,6 +70,16 @@ The text corpus contains `21589` sentences, of which `20748` are validated, `841 +Most of the Traditional Chinese text corpus is curated in the MozTW CC0 Sentences repository: https://github.com/moztw/cc0-sentences. + +Summary statistics (see the repo for methods): + +> There are 3573 characters in the corpus, covering about 85.6% of the MOU 2015 common chars data (MoE 2015 common characters 99.75% (3593 chars)). +> +> 1046 phonetics are covered, about 66.75% of the total phonetics in CnsPhonetic2016-08v2.cin. + +We welcome more everyday sentences in Mandarin (Traditional Chinese). Please reach out via the community links below. + ### Writing system @@ -96,6 +107,8 @@ There follows a randomly selected sample of five sentences from the corpus. +The text corpus is built by the Mozilla Taiwan community, the g0v community, and other open-source contributors. + ### Text domains | Domain | Count | |-|-| @@ -149,21 +162,31 @@ information. ## Get involved! ### Community links -* [Common Voice translators on Pontoon](https://pontoon.mozilla.org/zh-TW/common-voice/contributors/) +Mozilla 台灣社群 (MozTW) Common Voice 專案網站: [https://moztw.org/commonvoice/](https://moztw.org/commonvoice/) + +任何問題與建議、協助推廣、捐贈語料,或其他合作需求,請透過以下社群頻道與我們討論: + +- [Telegram group](https://t.me/+gvmHEcAtd-IwNzFl) +- [Line group](https://line.me/ti/g/_PLyjCSe_8) +- + ### Discussions +* Discourse forum (zh-TW): https://discourse.mozilla.org/c/voice/zh-tw/286 +* Related news: https://hackmd.io/@moztw/common-voice-news + ### Contribute * [Speak](https://commonvoice.mozilla.org/zh-TW/speak) -* [Write](https://commonvoice.mozilla.org/zh-TW/write) * [Listen](https://commonvoice.mozilla.org/zh-TW/listen) -* [Review](https://commonvoice.mozilla.org/zh-TW/review) +* Donate your sentences — If you would like to donate text you own (e.g., original writing) for recording, please contact Irvin (irvin@moztw.org) or discuss in the Line/Telegram groups above. + @@ -173,6 +196,8 @@ information. +- Irvin Chen (MozTW Community Contact) + ### Citation guidelines