You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Download the official KiSing dataset [here](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/). The file structure looks like below:
In this recipe, we will show how to train [VITS](https://arxiv.org/abs/2106.06103) using Amphion's infrastructure. VITS is an end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning.
6
+
In this recipe, we will show how to train VITS using Amphion's infrastructure. [VITS](https://arxiv.org/abs/2106.06103) is an end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning.
7
7
8
8
There are four stages in total:
9
9
@@ -20,7 +20,7 @@ There are four stages in total:
20
20
## 1. Data Preparation
21
21
22
22
### Dataset Download
23
-
You can use the commonly used TTS dataset to train TTS model, e.g., LJSpeech, VCTK, LibriTTS, etc. We strongly recommend you use LJSpeech to train TTS model for the first time. How to download dataset is detailed [here](../../datasets/README.md).
23
+
You can use the commonly used TTS dataset to train TTS model, e.g., LJSpeech, VCTK, Hi-Fi TTS, LibriTTS, etc. We strongly recommend using LJSpeech to train single-speaker TTS model for the first time. While for training multi-speaker TTS model for the first time, we would recommend using Hi-Fi TTS. The process of downloading dataset has been detailed [here](../../datasets/README.md).
24
24
25
25
### Configuration
26
26
@@ -29,32 +29,41 @@ After downloading the dataset, you can set the dataset paths in `exp_config.jso
29
29
```json
30
30
"dataset": [
31
31
"LJSpeech",
32
+
//"hifitts"
32
33
],
33
34
"dataset_path": {
34
35
// TODO: Fill in your dataset path
35
36
"LJSpeech": "[LJSpeech dataset path]",
37
+
//"hifitts": "[Hi-Fi TTS dataset path]
36
38
},
37
39
```
38
40
39
41
## 2. Features Extraction
40
42
41
43
### Configuration
42
44
43
-
Specify the `processed_dir` and the `log_dir`and for saving the processed data and the checkpoints in `exp_config.json`:
45
+
In `exp_config.json`, specify the `log_dir`for saving the checkpoints and logs, and specify the `processed_dir`for saving processed data. For preprocessing the multi-speaker TTS dataset, set`extract_audio` and `use_spkid` to `true`:
44
46
45
47
```json
46
48
// TODO: Fill in the output log path. The default value is "Amphion/ckpts/tts"
47
49
"log_dir": "ckpts/tts",
48
50
"preprocess": {
51
+
//"extract_audio": true,
52
+
"use_phone": true,
53
+
// linguistic features
54
+
"extract_phone": true,
55
+
"phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
49
56
// TODO: Fill in the output data path. The default value is "Amphion/data"
50
57
"processed_dir": "data",
51
-
...
58
+
"sample_rate": 22050, //target sampling rate
59
+
"valid_file": "valid.json", //validation set
60
+
//"use_spkid": true, //use speaker ID to train multi-speaker TTS model
52
61
},
53
62
```
54
63
55
64
### Run
56
65
57
-
Run the `run.sh` as the preproces stage (set `--stage 1`):
66
+
Run the `run.sh` as the preprocess stage (set `--stage 1`):
58
67
59
68
```bash
60
69
sh egs/tts/VITS/run.sh --stage 1
@@ -66,17 +75,22 @@ sh egs/tts/VITS/run.sh --stage 1
66
75
67
76
### Configuration
68
77
69
-
We provide the default hyparameters in the `exp_config.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
78
+
We provide the default hyparameters in the `exp_config.json`. They can work on a single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
79
+
For training the multi-speaker TTS model, specify the `n_speakers` value to be greater (used fornew speaker fine-tuning) than or equal to the number of speakersin your dataset(s) and set`multi_speaker_training` to `true`.
70
80
71
-
```
72
-
"train": {
73
-
"batch_size": 16,
74
-
}
81
+
```json
82
+
"model": {
83
+
//"n_speakers": 10 //Number of speakers in the dataset(s) used. The default value is 0 if not specified.
84
+
},
85
+
"train": {
86
+
"batch_size": 16,
87
+
//"multi_speaker_training": true,
88
+
}
75
89
```
76
90
77
91
### Train From Scratch
78
92
79
-
Run the `run.sh` as the training stage (set `--stage 2`). Specify a experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/tts/[YourExptName]`.
93
+
Run the `run.sh` as the training stage (set `--stage 2`). Specify an experimental name to run the following command. The tensorboard logs and checkpoints will be saved in`Amphion/ckpts/tts/[YourExptName]`.
80
94
81
95
```bash
82
96
sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName]
@@ -139,12 +153,35 @@ For inference, you need to specify the following configurations when running `ru
139
153
| `--infer_expt_dir` | The experimental directory which contains `checkpoint` | `Amphion/ckpts/tts/[YourExptName]` |
140
154
| `--infer_output_dir` | The output directory to save inferred audios. | `Amphion/ckpts/tts/[YourExptName]/result` |
141
155
| `--infer_mode` | The inference mode, e.g., "`single`", "`batch`". | "`single`" to generate a clip of speech, "`batch`" to generate a batch of speech at a time. |
142
-
|`--infer_dataset`| The dataset used for inference. | For LJSpeech dataset, the inference dataset would be `LJSpeech`.|
143
-
|`--infer_testing_set`| The subset of the inference dataset used for inference, e.g., train, test, golden_test | For LJSpeech dataset, the testing set would be "`test`" split from LJSpeech at the feature extraction, or "`golden_test`" cherry-picked from test set as template testing set. |
156
+
| `--infer_dataset` | The dataset used for inference. | For LJSpeech dataset, the inference dataset would be `LJSpeech`.<br> For Hi-Fi TTS dataset, the inference dataset would be `hifitts`. |
157
+
| `--infer_testing_set` | The subset of the inference dataset used for inference, e.g., train, test, golden_test | For LJSpeech dataset, the testing set would be "`test`" split from LJSpeech at the feature extraction, or "`golden_test`" cherry-picked from the test set as template testing set.<br>For Hi-Fi TTS dataset, the testing set would be "`test`" split from Hi-Fi TTS during the feature extraction process. |
144
158
| `--infer_text` | The text to be synthesized. | "`This is a clip of generated speech with the given text from a TTS model.`" |
159
+
| `--infer_speaker_name` | The target speaker's voice is to be synthesized.<br> (***Note: only applicable to multi-speaker TTS model***) | For Hi-Fi TTS dataset, the list of available speakers includes: "`hifitts_11614`", "`hifitts_11697`", "`hifitts_12787`", "`hifitts_6097`", "`hifitts_6670`", "`hifitts_6671`", "`hifitts_8051`", "`hifitts_9017`", "`hifitts_9136`", "`hifitts_92`". <br> You may find the list of available speakers from `spk2id.json` file generated in```log_dir/[YourExptName]``` that you have specified in`exp_config.json`. |
145
160
146
161
### Run
147
-
For example, if you want to generate speech of all testing set split from LJSpeech, just run:
162
+
#### Single text inference:
163
+
For the single-speaker TTS model, if you want to generate a single clip of speech from a given text, just run:
Or, if you want to generate a single clip of speech from a given text, just run:
159
-
194
+
For the multi-speaker TTS model, if you want to generate speech of all testing sets split from Hi-Fi TTS, the same procedure follows from above, with ```LJSpeech``` replaced by ```hifitts```.
--infer_text "This is a clip of generated speech with the given text from a TTS model."
199
+
--infer_mode "batch" \
200
+
--infer_dataset "hifitts" \
201
+
--infer_testing_set "test"
166
202
```
167
203
168
-
We released a pre-trained Amphion VITS model trained on LJSpeech. So you can download the pre-trained model [here](https://huggingface.co/amphion/vits-ljspeech) and generate speech following the above inference instruction.
204
+
205
+
We released a pre-trained Amphion VITS model trained on LJSpeech. So you can download the pre-trained model [here](https://huggingface.co/amphion/vits-ljspeech) and generate speech following the above inference instruction. Meanwhile, the pre-trained multi-speaker VITS model trained on Hi-Fi TTS will be released soon. Stay tuned.
169
206
170
207
171
208
```bibtex
@@ -176,4 +213,4 @@ We released a pre-trained Amphion VITS model trained on LJSpeech. So you can dow
Copy file name to clipboardExpand all lines: egs/tts/VITS/exp_config.json
+14-7
Original file line number
Diff line number
Diff line change
@@ -2,26 +2,33 @@
2
2
"base_config": "config/vits.json",
3
3
"model_type": "VITS",
4
4
"dataset": [
5
-
"LJSpeech"
5
+
"LJSpeech",
6
+
//"hifitts"
6
7
],
7
8
"dataset_path": {
8
9
// TODO: Fill in your dataset path
9
-
"LJSpeech": "[LJSpeech dataset path]"
10
+
"LJSpeech": "[LJSpeech dataset path]",
11
+
//"hifitts": "[Hi-Fi TTS dataset path]
10
12
},
11
13
// TODO: Fill in the output log path. The default value is "Amphion/ckpts/tts"
12
14
"log_dir": "ckpts/tts",
13
15
"preprocess": {
16
+
//"extract_audio":true,
14
17
"use_phone": true,
15
18
// linguistic features
16
19
"extract_phone": true,
17
-
"phone_extractor": "lexicon", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
20
+
"phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
18
21
// TODO: Fill in the output data path. The default value is "Amphion/data"
19
22
"processed_dir": "data",
20
-
21
-
"sample_rate": 22050,
22
-
"valid_file": "test.json", // validattion set
23
+
"sample_rate": 22050, // target sampling rate
24
+
"valid_file": "valid.json", // validation set
25
+
//"use_spkid": true // use speaker ID to train multi-speaker TTS model
26
+
},
27
+
"model":{
28
+
//"n_speakers": 10 // number of speakers, greater than or equal to the number of speakers in the dataset(s) used. The default value is 0 if not specified.
# [Only for Inference] The inference mode. It can be "batch" to generate speech by batch, or "single" to generage a single clip of speech.
46
+
# [Only for Inference] The inference mode. It can be "batch" to generate speech by batch, or "single" to generate a single clip of speech.
47
47
--infer_mode) shift; infer_mode=$1;shift ;;
48
-
# [Only for Inference] The inference dataset. It is only used when the inference model is "batch".
48
+
# [Only for Inference] The inference dataset. It is only used when the inference mode is "batch".
49
49
--infer_dataset) shift; infer_dataset=$1;shift ;;
50
-
# [Only for Inference] The inference testing set. It is only used when the inference model is "batch". It can be "test" set split from the dataset, or "golden_test" carefully selected from the testing set.
50
+
# [Only for Inference] The inference testing set. It is only used when the inference mode is "batch". It can be "test" set split from the dataset, or "golden_test" carefully selected from the testing set.
0 commit comments