Commit 15104cc
committed
Refactor: Integrate Whisper for speech unit extraction
This commit revamps the speech resynthesis pipeline by replacing the
mHuBERT-based self-supervised unit extraction (via textlesslib) with
a supervised Whisper-based approach. This change significantly impacts
data processing, model configuration, and training/evaluation workflows.
Key changes include:
- **Core Unit Extraction:**
- Removed `textlesslib` dependency and its associated `SpeechEncoder`
(mHuBERT + K-means).
- Integrated `WhisperFeatureExtractor` and `WhisperEncoder` from
`src.flow_matching.utils.whisper` for supervised discrete unit
extraction.
- Updated Python version from 3.9 to 3.10 and added `faiss-gpu`
to the environment.
- **Dataset and Configuration:**
- Default dataset changed to `ryota-komatsu/LibriTTS-R-whisper-large-v3-4096units`.
- Vocabulary size updated from 2000 to 4096 units.
- Replaced mHuBERT-based config files with
`whisper-large-v3-4096-bigvgan.yaml`, reflecting new model
parameters (e.g., `dim_cond_emb`, tokenizer settings) and ASR model
(`openai/whisper-large-v3`).
- **Pipeline Simplification:**
- Removed `tokenize` and `extract_features` stages from data preprocessing.
- Removed the `train_hifigan` stage and the `ConditionalFlowMatchingWithHifiGan`
model and its configuration.
- Removed the dedicated `src/flow_matching/eval.py` module and the
`evaluate` task in `main_resynth.py`.
- Removed custom ASR evaluation utilities (`src/flow_matching/utils/phi/`)
and textless utilities (`src/flow_matching/utils/textless.py`).
- Removed the entire `src/hifigan/` directory.
- **Data Handling and Training:**
- Replaced `UnitDataset` with direct loading from Hugging Face datasets
using `load_dataset` and a new `get_collate_fn` in `src/flow_matching/data.py`.
- Updated validation in `src/flow_matching/train.py` to use
`transformers.pipeline` with `openai/whisper-large-v3` for ASR and
`processor.tokenizer.normalize` for text normalization.
- Updated unit embedding in `train_flow_matching` to use
`WhisperEncoder.from_pretrained(...).quantizer`.
- **Usage and Demo:**
- Updated `README.md` and `demo.ipynb` to reflect the new setup,
usage patterns, and Whisper integration for unit encoding.
- Updated `scripts/setup.sh` to remove `textlesslib` cloning.
This refactoring aims to leverage supervised Whisper models for potentially
higher-quality unit extraction and simplifies the overall codebase by relying
more on Hugging Face's ecosystem for common ASR and data handling tasks.1 parent 1a54261 commit 15104cc
28 files changed
Lines changed: 159 additions & 4020 deletions
File tree
- configs/unit2speech
- scripts
- src
- flow_matching
- utils
- phi
- hifigan
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
14 | | - | |
15 | | - | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
| 18 | + | |
23 | 19 | | |
24 | 20 | | |
25 | | - | |
| 21 | + | |
26 | 22 | | |
27 | 23 | | |
28 | 24 | | |
29 | | - | |
30 | 25 | | |
31 | 26 | | |
| 27 | + | |
32 | 28 | | |
33 | 29 | | |
34 | 30 | | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
42 | 34 | | |
43 | 35 | | |
44 | 36 | | |
| |||
47 | 39 | | |
48 | 40 | | |
49 | 41 | | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
50 | 50 | | |
51 | | - | |
| 51 | + | |
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| |||
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
108 | | - | |
| 108 | + | |
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
| |||
129 | 129 | | |
130 | 130 | | |
131 | 131 | | |
132 | | - | |
| 132 | + | |
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
136 | 136 | | |
137 | 137 | | |
138 | 138 | | |
139 | | - | |
140 | | - | |
141 | | - | |
142 | 139 | | |
143 | | - | |
144 | 140 | | |
145 | 141 | | |
146 | 142 | | |
| |||
Lines changed: 0 additions & 102 deletions
This file was deleted.
This file was deleted.
Lines changed: 11 additions & 15 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | | - | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | 13 | | |
20 | 14 | | |
21 | 15 | | |
| |||
39 | 33 | | |
40 | 34 | | |
41 | 35 | | |
42 | | - | |
43 | | - | |
| 36 | + | |
| 37 | + | |
44 | 38 | | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
| 39 | + | |
49 | 40 | | |
50 | 41 | | |
51 | 42 | | |
52 | | - | |
| 43 | + | |
53 | 44 | | |
54 | 45 | | |
55 | 46 | | |
| |||
113 | 104 | | |
114 | 105 | | |
115 | 106 | | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
116 | 112 | | |
117 | 113 | | |
118 | 114 | | |
119 | 115 | | |
120 | 116 | | |
121 | | - | |
| 117 | + | |
0 commit comments