-
Notifications
You must be signed in to change notification settings - Fork 29
Description
I've been training quite a few models recently. And after getting through about 18 Common Voice languages I realised that most of the data wasn't being included. The issue surfaced when I was looking for an additional datapoint with more training data than Tatar to fill out the following graph:

It seemed odd to me that Portuguese only had 7 hours of data, but not odd enough. Then I looked at Basque.
test: Final amount of imported audio: 8:08:13 from 8:11:21.
dev: Final amount of imported audio: 7:47:56 from 7:48:58.
train: Final amount of imported audio: 10:51:34 from 10:51:44.
validated: Final amount of imported audio: 89:35:24 from 89:43:41.
The total amount of data available in the training split was a fraction of what is validated.
The obvious solution is that everyone goes and makes their own splits. But this is a bit unsatisfactory because then people's results won't be comparable. I imagine one of the desiderata of the dataset releases and splits is that they be standard and comparable.
Another option would be to have options:
--strict-speaker: One speaker only lives in one file--strict-sentence: One sentence only lives in one file--strict-audio: Only a single recording per sentence
--strict-speaker and --strict-sentence should be turned on by default, these mean that the model doesn't get to peek at either the speaker or the sentence.
--strict-audio should be turned off by default, this is more about model optimisation, e.g. you could consider having more than one recording per sentence as a kind of augmentation.
It would also be worth looking into balancing the train/dev/test by gender, but that is certainly another issue.