Model license concern, dataset sharing, language support, etc. #129

juangea · 2024-10-16T16:37:56Z

juangea
Oct 16, 2024

This license change is a pity, the CC-BY is not a problem, but the NC makes this to be in the same boat as Fish Speech, basically we are back to the starting point, which is a pity.

We don't have the resources to train a new model by ourselves, that's why we are looking for open source projects, since we are very small we have to rely on this.

Is there a possibility that you retrain a model with a non so restrictive license, using a non-restrictive data set?

There is a multilingual data set from facebook under the CC-BY 4 license that wouldn't be so restrictive, and it support several langauges:

https://huggingface.co/datasets/facebook/multilingual_librispeech

There is also this one:

https://huggingface.co/datasets/ylacombe/cml-tts

In case it's useful for a future training to avoid the licensing problem.

In any case thanks for your work!

SWivid · 2024-10-16T16:48:55Z

SWivid
Oct 16, 2024
Maintainer

We were also frustrated when we realized that Emilia is restricted with CC-BY-NC.
Yes, we have such plan to train with non-restrictive dataset, and btw would it be concerned if CC-BY-SA ?
as many codebases that might be possibly used are under GPL-3, or you are mainly looking forward to license without NC

And it would be very welcome to share dataset that we might not notice before

3 replies

SWivid Oct 16, 2024
Maintainer

CC-BY

SWivid Oct 16, 2024
Maintainer

CC-BY-SA

SWivid Oct 16, 2024
Maintainer

CC-BY-NC

juangea · 2024-10-16T16:53:51Z

juangea
Oct 16, 2024
Author

I can imagine your frustation.

The Share Alike I don't think it's a problem, in the end small business / studios like us usually use the model and the system, and the closed part is a spearate part of the code. I'm not sure about the possible implications, but I don't think it could be problematic.

The GPL-V3, since the communication usually goes throught disk or through sockets, I don't think it's problematic either, mainly because your code is MIT, and the GPL affects the model, which will be used to generate an audio file, that in the end will be a file written in a disk, and then the GPL ends there, so unless we do some direct memory communication, the GPL should not affect any commercial development in this case.

I will add what everyone adds in this situations, I'm not a lawyer, so I cannot be 100% sure, however I've been working with Blender for a long time and I investigated the GPL license quite a lot, in this case I think it won't be problematic for anyone and will ensure the model and its derivatives are kept open :)

One of the main things would be to keep the project commercially usable, that will enable small projects to be able to fight in a market where Eleven Labs is the biggest one and has a TON of resources, while we don't :)

1 reply

coconutlampshade Oct 29, 2024

I agree. 11 Labs is expensive and this is a potential solution for people with limited resources.

kylehowells · 2024-10-25T00:46:20Z

kylehowells
Oct 25, 2024

As the main source of this regarding the licence and the Emilia dataset. I think it is worth pointing out the the team behind it (who released the dataset itself) https://huggingface.co/datasets/amphion/Emilia don't seem to feel training on the dataset requires the models themselves also be cc-by-nc-4.0 as a consequence. As their own MIT licensed amphion/MaskGCT is MIT licensed and trained from the dataset.

Dataset: https://huggingface.co/datasets/amphion/Emilia
Model release: https://huggingface.co/amphion/MaskGCT

Given they are the one's who compiled the dataset and licensed it as cc-by-nc-4.0 in the first place you might just be able to email them and ask to check you are ok releasing your model trained on the dataset with the same MIT licence as they have released their model trained on the dataset.

Even if in the end that is judged to be different then the publicly listed licence, as the ones who released the dataset with that licence in the first place they have the ability to ok a different usage.

And in terms of training on "in the wild" content as a whole. Whisper is incredibly widely used, MIT licensed, and trained on in the wild audio from across the web. Llama and all open source LLMs are similarly trained on data scrapped from public web pages.

5 replies

leoiania Oct 25, 2024

@kylehowells this is a very relevant information, thank you. Would it be possible in your opinion to ask Emilia owners about an official and generic answer to this question for f5-tts finetuning? Or do you think it is more likely that they will prefer to analyze case by case for commercial usage with potential different outputs in commercial license?

platform-kit Oct 26, 2024

@kylehowells thanks for this insight. @SWivid this may change things, yeah?

SWivid Oct 26, 2024
Maintainer

Hi @leoiania @platform-kit #2

Hi @kylehowells ~

I think we should follow the basic rules, even though some people don't.

We don't actually restrict where the model can be used, the protocol is just to follow the rules. Please be assured that we will not cause you any trouble~ We'll get in contact with Amphion team, see if they agreed with MIT to F5 model ckpt.

Update: we have get in contact with Amphion team, they just missed that also lol, MaskGCT is CC-BY-NC-4.0 now... Sadge

Sadge

kylehowells Oct 26, 2024

Thanks for the clarification. That's disappointing to hear. Especially as Whisper, Llama, and every single LLM model from Meta, Mistral, etc... are all trained extensively on "in the wild" content from the web. Though I guess that sort of ambiguity is something safer for big companies to play.

VyneNave Dec 21, 2024

"Please note that Emilia does not own the copyright to the audio files; the copyright remains with the original owners of the videos or audio. Users are permitted to use this dataset only for non-commercial purposes under the CC BY-NC-4.0 license."

It seems like asking Amphion really won't make a difference. They have to include the license, because they don't own the copyright.

kylehowells · 2024-10-26T08:35:55Z

kylehowells
Oct 26, 2024

We were also frustrated when we realized that Emilia is restricted with CC-BY-NC.
Yes, we have such plan to train with non-restrictive dataset, and btw would it be concerned if CC-BY-SA ?
as many codebases that might be possibly used are under GPL-3, or you are mainly looking forward to license without NC

And it would be very welcome to share dataset that we might not notice before

Regarding open data sources: they all seem to be much smaller. With 1 exception released this month.

The fully open ones I could find are:

Public Domain - Use it however you want

Mozilla CommonVoice - Public Domain: 93,896 voice, 3,587 recorded hours
LJ-Speech-Dataset - Public Domain speech dataset consisting of 13,100 short audio clips (text sources from LibriVox)
NISQA_TRAIN_LIVE and NISQA_VAL_LIVE - live telephone and Skype calls recorded from a microphone in the room - 486 speakers "The dataset is provided under the original terms of the used source speech samples" all the samples come from Librivox (public domain).

It's unlabelled audio data, but the US library of congress website has a collection of public domain audio recordings, filterable to "vocal": https://www.loc.gov/collections/national-jukebox/?dates=1800/1922&fa=subject:vocal

cc-by-4.0 / MIT / BSD - Say you used it:

LibriSpeech - 1,000hrs English. CC BY 4.0 But data is all from [LibriVox] and so the text and audio is in the public, could just skip the dataset itself and go straight to the source.
facebook/multilingual_librispeech - This again just seems to be from LibriSpeech, which is just from LibriVox. So a collection of public domain data packaged in a CC .csv file.
TSP speech database - Simplified BSD - 1400 utterances spoken by 24 speakers
MLCommons/peoples_speech - Mostly US GOV public domain records combined into a dataset (English only, short clips - 15s - 1.5 million clips)
google/fleurs - 102 languages - 10 hours per language, different speakers in training and eval sets
Noisy speech database for training speech enhancement algorithms and TTS models
CSTR VCTK Corpus 110 English speakers each reading 400 sentences
sentences
FBK-MT/mosel multilingual 950K hours from 24 eu languages (a lot of it unlabelled)
ProgramComputer/voxceleb

cc-by-sa-4.0 - Copyleft Licence

Not as freely available.

The Spoken Wikipedia Corpora
microsoft/MS-SNSD - Collection of a few bits, including some public domain stuff

FBK-MT/mosel

The big stand out there is FBK-MT/mosel which was released within this month to solve this problem.

It is a collection of either public domain or Creative Commons BY 3 or 4.0 audio data sets.
Including any audio only datasets they could find.

From the paper it seems they filled in a lot of the extra hours using ASR on the Creative Commons audio datasets they found.

Notably Emilia states

Language	Duration (hours)
English	46,828
Chinese	49,922
German	1,590
French	1,381
Japanese	1,715
Korean	217

MOSEL states

Dataset Statistics (in hours)

Language (LangID)	Labeled	Unlabeled	Total
Bulgarian (bg)	111	17,609	17,720
Croatian (hr)	55	8,106	8,161
Czech (cs)	591	18,705	19,296
Danish (da)	20	13,600	13,620
Dutch (nl)	3,395	19,014	22,409
English (en)	437,239	84,704	521,943
Estonian (et)	60	10,604	10,664
Finnish (fi)	64	14,200	14,264
French (fr)	26,984	22,896	49,880
German (de)	9,236	23,228	32,464
Greek (el)	35	17,703	17,738
Hungarian (hu)	189	17,701	17,890
Irish (ga)	17	0	17
Italian (it)	3,756	21,933	25,689
Latvian (lv)	173	13,100	13,273
Lithuanian (lt)	36	14,400	14,436
Maltese (mt)	19	9,100	9,119
Polish (pl)	510	21,207	21,717
Portuguese (pt)	5,492	17,526	23,018
Romanian (ro)	121	17,906	18,021
Slovak (sk)	61	12,100	12,161
Slovenian (sl)	32	11,300	11,332
Spanish (es)	17,471	21,526	38,997
Swedish (sv)	58	16300	16,358
Total	505,725	444,467	950,192

Their intro/abstract to their paper makes it clear this sort of problem was exactly what they set out to solve, as they complain about the licences of other data sets not being really open several times in the opening section.

https://arxiv.org/abs/2410.01036

4 replies

kylehowells Oct 26, 2024

I don't know of an equivalent for Chinese, and as I don't understand it it's harder for me to search Chinese sources to find an equivalent. But hopefully someone who does speak it can add a comment with the largest openly usable Chinese (or Japanese, or Korean, the languages in Emilia but not in mosel) dataset they know of.

SWivid Oct 26, 2024
Maintainer

@kylehowells thanks, we have noticed mosel. Just one thing is that mosel is based on voxpopuli and librilight which are all 16khz
while there exists e.g. cml-tts mls which are 24khz but are separate sets (will takes some time to merge)

For ZH JA KR, lol, no large-scale (several 10K) publicly available ones under ccby yet I thought.

platform-kit Oct 26, 2024

@SWivid could you use an audio super resolution model to convert the 16khz samples to a different sample rate?

SWivid Oct 27, 2024
Maintainer

@platform-kit If that is needed, wouldn't it be better to do it at user-end?
It is of 100K++ hours samples in train set. We feel like we're doing open source stuff instead of ...

AlexM4H · 2024-12-12T09:12:45Z

AlexM4H
Dec 12, 2024

Which training datasets are currently eligible for retraining? Will these then support European languages?

0 replies

ScottHarris24 · 2025-01-14T07:36:30Z

ScottHarris24
Jan 14, 2025

I am new to AI TTS overall and F5. I have looked through the license and I am a little confused on where the lines are drawn.

I do some content creation (just a side thing) and wanted to use AI to make a voice other than my own for some shorts, announcements and things like that. I used one of the StyleTTS2 (Pinokio install) voices to generate about 25 seconds of voice. Its OK but I really like what I see F5 doing for multi-speech, emotional response and more natural sounding voices.

If I use F5 to generate that 25 seconds on a monetized video does that break the NC part of the rule? When I read the license it sounded like if I am using F5 or the models/dataset directly as part of what I am doing that is breaking the rule. Was not sure about what is produced as output from those models is breaking the rule as well.

In addition I have been trying to find "training prompts" for the multi-speech to provide some of that more "emotional content". The best I have found is at https://www.microsoft.com/en-us/research/project/e2-tts/#:~:text=Changing%20the%20speech%20rate in the RAVDESS section. It has multiple male and female prompts that the site says is for demo purposes. If I download those and use them for my "training prompts" to produce the output does that break any rules as well?

If this is a problem are there places to get some of those "emotional training prompts" for the multi-speech with different voices (male, female, younger, older, with accents, etc.) that are open source. I really like what I see I just want to make sure I am not breaking any of the rules and can do it for free (for now anyway).

4 replies

SWivid Jan 14, 2025
Maintainer

the nc licenses are from the datasets:
emilia is nc
ravdess is nc-sa
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196391

our code here is under mit license,
so if you somehow you know get your own model checkpoint and with some dataset not-nc trained, you could just use it to do what you want

ScottHarris24 Jan 14, 2025

Thanks, I think I understand.

If I can find a model/dataset that is just CC or MIT trained I should be able to use the output with out an issue. I guess the easiest thing to do would not to monetize the videos using it then there is no "commercial" element involved or just sick with the StyleTTS2 version.

It looks like the same maybe true for using those "emotional training prompts" on the MS site but I need to read through that link in more details to be sure.

I just generated a test with F5 and it is not doing as well as the other examples I have seen out there. May have something configured wrong

SWivid Jan 14, 2025
Maintainer

If I can find a model/dataset that is just CC or MIT trained I should be able to use the output with out an issue.

most codebase are mit, but the checkpoint/model weight are limited with dataset license (the above link is just a reference for ravdess license, it says nc-sa)

if you announce in your video that emilia pretrained model checkpoint is used, then it should be non-commercial

what i mean is: if you somehow get your own model and use it to generate audio sample without license limit, you could just use freely. you don't need to announce in your video if this model is cc, you know

ScottHarris24 Jan 14, 2025

thank you. I'll see if I can find something that does not have NC in it and figure out how to get into F5. I am just leaning how all this works so it may take a little time

ban1989ban · 2025-02-04T20:00:52Z

ban1989ban
Feb 4, 2025

@SWivid You mentioned here you are also planning an CC-BY models. Is this coming sometime in near future? Would be really helpful for us

0 replies

ScottHarris24 · 2025-02-04T20:06:41Z

ScottHarris24
Feb 4, 2025

Thanks for reaching out and asking about this. I did not end up using F5 at this point. License issue aside I was having problems getting the results I was hoping for. I am new to the AI world so I suspect I was not understanding things and need to learn a little more on how everything works. I will probably revisit this in the future. Just not sure when at this point Scott Harris (214) 938-0337

…

________________________________ From: Ankit Bansal ***@***.***> Sent: Tuesday, February 4, 2025 2:01 PM To: SWivid/F5-TTS ***@***.***> Cc: ScottHarris24 ***@***.***>; Comment ***@***.***> Subject: Re: [SWivid/F5-TTS] Model license concern, dataset sharing, language support, etc. (Discussion #129) @SWivid<https://github.com/SWivid> You mentioned here you are also planning an CC-BY models. Is this coming sometime in near future? Would be really helpful for us — Reply to this email directly, view it on GitHub<#129 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BNPXMB5Z2WU6VPWLQXCXZL32OEMAZAVCNFSM6AAAAABQB44LYGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEMBVHEYTSOA>. You are receiving this because you commented.Message ID: ***@***.***>

0 replies

Model license concern, dataset sharing, language support, etc. #129

Uh oh!

Uh oh!

Replies: 8 comments · 17 replies

Uh oh!

Uh oh!

SWivid Oct 16, 2024 Maintainer

Uh oh!

SWivid Oct 16, 2024 Maintainer

Uh oh!

SWivid Oct 16, 2024 Maintainer

Uh oh!

SWivid Oct 16, 2024 Maintainer

Uh oh!

Uh oh!

juangea Oct 16, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SWivid Oct 26, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Public Domain - Use it however you want

cc-by-4.0 / MIT / BSD - Say you used it:

cc-by-sa-4.0 - Copyleft Licence

FBK-MT/mosel

Uh oh!

Uh oh!

Uh oh!

SWivid Oct 26, 2024 Maintainer

Uh oh!

Uh oh!

SWivid Oct 27, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SWivid Jan 14, 2025 Maintainer

Uh oh!

Uh oh!

SWivid Jan 14, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Replies: 8 comments 17 replies

SWivid
Oct 16, 2024
Maintainer

SWivid Oct 16, 2024
Maintainer

SWivid Oct 16, 2024
Maintainer

SWivid Oct 16, 2024
Maintainer

juangea
Oct 16, 2024
Author

SWivid Oct 26, 2024
Maintainer

SWivid Oct 26, 2024
Maintainer

SWivid Oct 27, 2024
Maintainer

SWivid Jan 14, 2025
Maintainer

SWivid Jan 14, 2025
Maintainer