The single audio per sentence restriction is too strict for most languages

I've been training quite a few models recently. And after getting through about 18 Common Voice languages I realised that most of the data wasn't being included. The issue surfaced when I was looking for an additional datapoint with more training data than Tatar to fill out the following graph:
![imatge](https://user-images.githubusercontent.com/449545/113650431-18292380-9688-11eb-87be-643068f7485b.png)

It seemed odd to me that Portuguese only had 7 hours of data, but not odd enough. Then I looked at Basque.
```
test: Final amount of imported audio: 8:08:13 from 8:11:21.
dev: Final amount of imported audio: 7:47:56 from 7:48:58.
train: Final amount of imported audio: 10:51:34 from 10:51:44.
validated: Final amount of imported audio: 89:35:24 from 89:43:41.
```
The total amount of data available in the training split was a fraction of what is validated.

The obvious solution is that everyone goes and makes their own splits. But this is a bit unsatisfactory because then people's results won't be comparable. I imagine one of the desiderata of the dataset releases and splits is that they be standard and comparable.

Another option would be to have options: 
* `--strict-speaker`: One speaker only lives in one file
* `--strict-sentence`: One sentence only lives in one file
* `--strict-audio`:  Only a single recording per sentence

`--strict-speaker` and `--strict-sentence ` should be turned on by default, these mean that the model doesn't get to peek at either the speaker or the sentence. 

`--strict-audio` should be turned off by default, this is more about model optimisation, e.g. you could consider having more than one recording per sentence as a kind of augmentation. 

It would also be worth looking into balancing the train/dev/test by gender, but that is certainly another issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The single audio per sentence restriction is too strict for most languages #113

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The single audio per sentence restriction is too strict for most languages #113

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions