-
Notifications
You must be signed in to change notification settings - Fork 239
Open
Labels
help wantedExtra attention is neededExtra attention is needed
Description
TensorFlowASR
makes it quite easy to train and deploy almost SOTA ASR models, but it provides a pretrained model only in English. On the other hand, FAIR has recently published an open and free dataset in 8 languages (see the paper). It is in the public domain and of a large size, and has the same quality as LibriSpeech. So, my suggestion is form a volunteer working group to collaborate on training ASR models in multiple languages and share them publically.
Maintainers of the repo can pin the issue and label it with help-wanted
for visibility if this idea makes sense.
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is needed
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
nglehuy commentedon Dec 30, 2020
This is the great idea 😆
monatis commentedon Dec 30, 2020
Hi @usimarit! Thanks for such a great project and your support to boost visibility of this issue 😻 So I will start by writing a helper script to automatically download MLS dataset in a given language and prepare the transcription and alphabet files and PR to add it to the repo. Then I can train a Conformer model in German by using this script as a first step 🚀
nglehuy commentedon Dec 31, 2020
Hi @monatis, just for your information, we should train using
subwords
instead ofcharacters
for performance boostmonatis commentedon Dec 31, 2020
Hi @usimarit, yeah I know that training with subwords yields a better performance, but I'm automatically generating an alphabet file in #92 for those who want to use characters anyway.
nglehuy commentedon Jan 26, 2021
Hi everyone, if you guys want to share your pretrained models, just upload the

.h5
or.pb
files along withconfig.yml
to any drive (google, dropbox, etc.), then write a section, add the link, add your contact to a subsection in that section, in theREADME.md
in the example directory that belongs to each model like in image (the one I made):And finally open a pull request to merge to the repo 😄
monatis commentedon Jan 26, 2021
@usimarit Grat. I've been quite busy for some time, but I'll be more active in this repo on following days and contribute pretrained models in other languages. Thanks
JStumpp commentedon Feb 4, 2021
@monatis, did you already train a Conformer model in German?
monatis commentedon Feb 5, 2021
Hi @JStumpp I started to train it and hope to release it next week.
christina284 commentedon Dec 15, 2022
Do you know in which subset of Librispeech is the english pretrained model trained on?