Skip to content

Commit 693d84a

Browse files
authored
Add Consistency-Regularized CTC (#1766)
* support consistency-regularized CTC * update arguments of cr-ctc * set default value of cr_loss_masked_scale to 1.0 * minor fix * refactor codes * update RESULTS.md
1 parent f84270c commit 693d84a

File tree

5 files changed

+556
-20
lines changed

5 files changed

+556
-20
lines changed

egs/librispeech/ASR/README.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,17 @@ We place an additional Conv1d layer right after the input embedding layer.
5050
| `conformer-ctc2` | Reworked Conformer | Use auxiliary attention head |
5151
| `conformer-ctc3` | Reworked Conformer | Streaming version + delay penalty |
5252
| `zipformer-ctc` | Zipformer | Use auxiliary attention head |
53-
| `zipformer` | Upgraded Zipformer | Use auxiliary transducer head / attention-decoder head | The latest recipe |
53+
| `zipformer` | Upgraded Zipformer | Use auxiliary transducer head / attention-decoder head (the latest recipe) |
5454

5555
# MMI
5656

5757
| | Encoder | Comment |
5858
|------------------------------|-----------|---------------------------------------------------|
5959
| `conformer-mmi` | Conformer | |
6060
| `zipformer-mmi` | Zipformer | CTC warmup + use HP as decoding graph for decoding |
61+
62+
# CR-CTC
63+
64+
| | Encoder | Comment |
65+
|------------------------------|--------------------|------------------------------|
66+
| `zipformer` | Upgraded Zipformer | Could also be an auxiliary loss to improve transducer or CTC/AED (the latest recipe) |

egs/librispeech/ASR/RESULTS.md

+310
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,315 @@
11
## Results
22

3+
### zipformer (zipformer + pruned-transducer w/ CR-CTC)
4+
5+
See <https://github.com/k2-fsa/icefall/pull/1766> for more details.
6+
7+
[zipformer](./zipformer)
8+
9+
#### Non-streaming
10+
11+
##### large-scale model, number of model parameters: 148824074, i.e., 148.8 M
12+
13+
You can find a pretrained model, training logs, decoding logs, and decoding results at:
14+
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-large-transducer-with-CR-CTC-20241019>
15+
16+
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
17+
18+
| decoding method | test-clean | test-other | comment |
19+
|--------------------------------------|------------|------------|---------------------|
20+
| greedy_search | 1.9 | 3.96 | --epoch 50 --avg 26 |
21+
| modified_beam_search | 1.88 | 3.95 | --epoch 50 --avg 26 |
22+
23+
The training command using 2 80G-A100 GPUs is:
24+
```bash
25+
export CUDA_VISIBLE_DEVICES="0,1"
26+
# for non-streaming model training:
27+
./zipformer/train.py \
28+
--world-size 2 \
29+
--num-epochs 50 \
30+
--start-epoch 1 \
31+
--use-fp16 1 \
32+
--exp-dir zipformer/exp-large-cr-ctc-rnnt \
33+
--use-cr-ctc 1 \
34+
--use-ctc 1 \
35+
--use-transducer 1 \
36+
--use-attention-decoder 0 \
37+
--num-encoder-layers 2,2,4,5,4,2 \
38+
--feedforward-dim 512,768,1536,2048,1536,768 \
39+
--encoder-dim 192,256,512,768,512,256 \
40+
--encoder-unmasked-dim 192,192,256,320,256,192 \
41+
--ctc-loss-scale 0.1 \
42+
--enable-spec-aug 0 \
43+
--cr-loss-scale 0.02 \
44+
--time-mask-ratio 2.5 \
45+
--full-libri 1 \
46+
--max-duration 1400 \
47+
--master-port 12345
48+
```
49+
50+
The decoding command is:
51+
```bash
52+
export CUDA_VISIBLE_DEVICES="0"
53+
for m in greedy_search modified_beam_search; do
54+
./zipformer/decode.py \
55+
--epoch 50 \
56+
--avg 26 \
57+
--exp-dir zipformer/exp-large-cr-ctc-rnnt \
58+
--use-cr-ctc 1 \
59+
--use-ctc 1 \
60+
--use-transducer 1 \
61+
--use-attention-decoder 0 \
62+
--num-encoder-layers 2,2,4,5,4,2 \
63+
--feedforward-dim 512,768,1536,2048,1536,768 \
64+
--encoder-dim 192,256,512,768,512,256 \
65+
--encoder-unmasked-dim 192,192,256,320,256,192 \
66+
--max-duration 300 \
67+
--decoding-method $m
68+
done
69+
```
70+
71+
### zipformer (zipformer + CR-CTC-AED)
72+
73+
See <https://github.com/k2-fsa/icefall/pull/1766> for more details.
74+
75+
[zipformer](./zipformer)
76+
77+
#### Non-streaming
78+
79+
##### large-scale model, number of model parameters: 174319650, i.e., 174.3 M
80+
81+
You can find a pretrained model, training logs, decoding logs, and decoding results at:
82+
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-large-cr-ctc-aed-20241020>
83+
84+
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
85+
86+
| decoding method | test-clean | test-other | comment |
87+
|--------------------------------------|------------|------------|---------------------|
88+
| attention-decoder-rescoring-no-ngram | 1.96 | 4.08 | --epoch 50 --avg 20 |
89+
90+
The training command using 2 80G-A100 GPUs is:
91+
```bash
92+
export CUDA_VISIBLE_DEVICES="0,1"
93+
# for non-streaming model training:
94+
./zipformer/train.py \
95+
--world-size 2 \
96+
--num-epochs 50 \
97+
--start-epoch 1 \
98+
--use-fp16 1 \
99+
--exp-dir zipformer/exp-large-cr-ctc-aed \
100+
--use-cr-ctc 1 \
101+
--use-ctc 1 \
102+
--use-transducer 0 \
103+
--use-attention-decoder 1 \
104+
--num-encoder-layers 2,2,4,5,4,2 \
105+
--feedforward-dim 512,768,1536,2048,1536,768 \
106+
--encoder-dim 192,256,512,768,512,256 \
107+
--encoder-unmasked-dim 192,192,256,320,256,192 \
108+
--ctc-loss-scale 0.1 \
109+
--attention-decoder-loss-scale 0.9 \
110+
--enable-spec-aug 0 \
111+
--cr-loss-scale 0.02 \
112+
--time-mask-ratio 2.5 \
113+
--full-libri 1 \
114+
--max-duration 1200 \
115+
--master-port 12345
116+
```
117+
118+
The decoding command is:
119+
```bash
120+
export CUDA_VISIBLE_DEVICES="0"
121+
./zipformer/ctc_decode.py \
122+
--epoch 50 \
123+
--avg 20 \
124+
--exp-dir zipformer/exp-large-cr-ctc-aed/ \
125+
--use-cr-ctc 1 \
126+
--use-ctc 1 \
127+
--use-transducer 0 \
128+
--use-attention-decoder 1 \
129+
--num-encoder-layers 2,2,4,5,4,2 \
130+
--feedforward-dim 512,768,1536,2048,1536,768 \
131+
--encoder-dim 192,256,512,768,512,256 \
132+
--encoder-unmasked-dim 192,192,256,320,256,192 \
133+
--max-duration 200 \
134+
--decoding-method attention-decoder-rescoring-no-ngram
135+
done
136+
```
137+
138+
### zipformer (zipformer + CR-CTC)
139+
140+
See <https://github.com/k2-fsa/icefall/pull/1766> for more details.
141+
142+
[zipformer](./zipformer)
143+
144+
#### Non-streaming
145+
146+
##### small-scale model, number of model parameters: 22118279, i.e., 22.1 M
147+
148+
You can find a pretrained model, training logs, decoding logs, and decoding results at:
149+
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-small-cr-ctc-20241018>
150+
151+
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
152+
153+
| decoding method | test-clean | test-other | comment |
154+
|--------------------------------------|------------|------------|---------------------|
155+
| ctc-greedy-decoding | 2.57 | 5.95 | --epoch 50 --avg 25 |
156+
157+
The training command using 2 32G-V100 GPUs is:
158+
```bash
159+
export CUDA_VISIBLE_DEVICES="0,1"
160+
# for non-streaming model training:
161+
./zipformer/train.py \
162+
--world-size 2 \
163+
--num-epochs 50 \
164+
--start-epoch 1 \
165+
--use-fp16 1 \
166+
--exp-dir zipformer/exp-small/ \
167+
--use-cr-ctc 1 \
168+
--use-ctc 1 \
169+
--use-transducer 0 \
170+
--use-attention-decoder 0 \
171+
--num-encoder-layers 2,2,2,2,2,2 \
172+
--feedforward-dim 512,768,768,768,768,768 \
173+
--encoder-dim 192,256,256,256,256,256 \
174+
--encoder-unmasked-dim 192,192,192,192,192,192 \
175+
--base-lr 0.04 \
176+
--enable-spec-aug 0 \
177+
--cr-loss-scale 0.2 \
178+
--time-mask-ratio 2.5 \
179+
--full-libri 1 \
180+
--max-duration 850 \
181+
--master-port 12345
182+
```
183+
184+
The decoding command is:
185+
```bash
186+
export CUDA_VISIBLE_DEVICES="0"
187+
for m in ctc-greedy-search; do
188+
./zipformer/ctc_decode.py \
189+
--epoch 50 \
190+
--avg 25 \
191+
--exp-dir zipformer/exp-small \
192+
--use-cr-ctc 1 \
193+
--use-ctc 1 \
194+
--use-transducer 0 \
195+
--use-attention-decoder 0 \
196+
--num-encoder-layers 2,2,2,2,2,2 \
197+
--feedforward-dim 512,768,768,768,768,768 \
198+
--encoder-dim 192,256,256,256,256,256 \
199+
--encoder-unmasked-dim 192,192,192,192,192,192 \
200+
--max-duration 600 \
201+
--decoding-method $m
202+
done
203+
```
204+
205+
##### medium-scale model, number of model parameters: 64250603, i.e., 64.3 M
206+
207+
You can find a pretrained model, training logs, decoding logs, and decoding results at:
208+
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-medium-cr-ctc-20241018>
209+
210+
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
211+
212+
| decoding method | test-clean | test-other | comment |
213+
|--------------------------------------|------------|------------|---------------------|
214+
| ctc-greedy-decoding | 2.12 | 4.62 | --epoch 50 --avg 24 |
215+
216+
The training command using 4 32G-V100 GPUs is:
217+
```bash
218+
export CUDA_VISIBLE_DEVICES="0,1,2,3"
219+
# For non-streaming model training:
220+
./zipformer/train.py \
221+
--world-size 4 \
222+
--num-epochs 50 \
223+
--start-epoch 1 \
224+
--use-fp16 1 \
225+
--exp-dir zipformer/exp \
226+
--use-cr-ctc 1 \
227+
--use-ctc 1 \
228+
--use-transducer 0 \
229+
--use-attention-decoder 0 \
230+
--enable-spec-aug 0 \
231+
--cr-loss-scale 0.2 \
232+
--time-mask-ratio 2.5 \
233+
--full-libri 1 \
234+
--max-duration 700 \
235+
--master-port 12345
236+
```
237+
238+
The decoding command is:
239+
```bash
240+
export CUDA_VISIBLE_DEVICES="0"
241+
for m in ctc-greedy-search; do
242+
./zipformer/ctc_decode.py \
243+
--epoch 50 \
244+
--avg 24 \
245+
--exp-dir zipformer/exp \
246+
--use-cr-ctc 1 \
247+
--use-ctc 1 \
248+
--use-transducer 0 \
249+
--use-attention-decoder 0 \
250+
--max-duration 600 \
251+
--decoding-method $m
252+
done
253+
```
254+
255+
##### large-scale model, number of model parameters: 147010094, i.e., 147.0 M
256+
257+
You can find a pretrained model, training logs, decoding logs, and decoding results at:
258+
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-large-cr-ctc-20241018>
259+
260+
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
261+
262+
| decoding method | test-clean | test-other | comment |
263+
|--------------------------------------|------------|------------|---------------------|
264+
| ctc-greedy-decoding | 2.03 | 4.37 | --epoch 50 --avg 26 |
265+
266+
The training command using 2 80G-A100 GPUs is:
267+
```bash
268+
export CUDA_VISIBLE_DEVICES="0,1"
269+
# For non-streaming model training:
270+
./zipformer/train.py \
271+
--world-size 2 \
272+
--num-epochs 50 \
273+
--start-epoch 1 \
274+
--use-fp16 1 \
275+
--exp-dir zipformer/exp-large \
276+
--use-cr-ctc 1 \
277+
--use-ctc 1 \
278+
--use-transducer 0 \
279+
--use-attention-decoder 0 \
280+
--num-encoder-layers 2,2,4,5,4,2 \
281+
--feedforward-dim 512,768,1536,2048,1536,768 \
282+
--encoder-dim 192,256,512,768,512,256 \
283+
--encoder-unmasked-dim 192,192,256,320,256,192 \
284+
--enable-spec-aug 0 \
285+
--cr-loss-scale 0.2 \
286+
--time-mask-ratio 2.5 \
287+
--full-libri 1 \
288+
--max-duration 1400 \
289+
--master-port 12345
290+
```
291+
292+
The decoding command is:
293+
```bash
294+
export CUDA_VISIBLE_DEVICES="0"
295+
for m in ctc-greedy-search; do
296+
./zipformer/ctc_decode.py \
297+
--epoch 50 \
298+
--avg 26 \
299+
--exp-dir zipformer/exp-large \
300+
--use-cr-ctc 1 \
301+
--use-ctc 1 \
302+
--use-transducer 0 \
303+
--use-attention-decoder 0 \
304+
--num-encoder-layers 2,2,4,5,4,2 \
305+
--feedforward-dim 512,768,1536,2048,1536,768 \
306+
--encoder-dim 192,256,512,768,512,256 \
307+
--encoder-unmasked-dim 192,192,256,320,256,192 \
308+
--max-duration 600 \
309+
--decoding-method $m
310+
done
311+
```
312+
3313
### zipformer (zipformer + CTC/AED)
4314

5315
See <https://github.com/k2-fsa/icefall/pull/1389> for more details.

0 commit comments

Comments
 (0)