|
| 1 | +# WeNet Hotword |
1 | 2 |
|
2 | | - |
3 | | -# WeNet Hotword |
4 | | - |
5 | | -**Hotword-biased decoding for the [WeNet](https://github.com/wenet-e2e/wenet) C++ runtime.** |
6 | | - |
7 | | -[](https://opensource.org/licenses/Apache-2.0) |
| 3 | +[](https://opensource.org/licenses/Apache-2.0) |
8 | 4 | [](https://en.cppreference.com/w/cpp/17) |
9 | | -[](https://pytorch.org/cppdocs/) |
10 | | - |
11 | | -[**Eval Writeup**](runtime/libtorch/eval_runs/HOTWORD_EVAL.md) |
12 | 5 |
|
| 6 | +**Hotword-biased decoding for the [WeNet](https://github.com/wenet-e2e/wenet) C++ runtime.** |
13 | 7 |
|
| 8 | +> **Based On**: |
| 9 | +> **Model**: `wenet/u2pp_conformer-asr-cn-16k-online` |
| 10 | +> **Tune**: `AISHELL-1 hotword test` (235 utts, 187 hotwords) |
14 | 11 |
|
15 | | -**tune set** (235 utts): recall ↑ 5.6× CER ↓ 55% |
16 | | -<br> |
17 | | -**test set** (115 utts): recall ↑ 3.5× CER ↓ 47% |
18 | | - |
19 | | -| | baseline (tune) | baseline (test) | ours (tune) | ours (test) | |
20 | | -|--|:--:|:--:|:--:|:--:| |
21 | | -| hotword recall | 15.96% | 25.93% | **90.07%** | **91.11%** | |
22 | | -| CER | 14.20% | 13.76% | **6.32%** | **7.33%** | |
23 | | - |
24 | | -<sub>Model: `wenet/u2pp_conformer-asr-cn-16k-online`</sub> |
25 | | -<br> |
26 | | -<sub>Tune: `AISHELL-1 hotword test` </sub> |
27 | | -<br> |
28 | | -<sub>Test: `aishell1_indep_hotword`</sub> |
| 12 | +| | Baseline | Ours (Ultra) | |
| 13 | +|--|--|--| |
| 14 | +| **CER** | 5.14% | **4.82%** | |
| 15 | +| **Recall** | 81.08% | **95.95%** | |
| 16 | +| **Precision** | **95.24%** | 93.01% | |
| 17 | +| **F1** | 87.59% | **94.46%** | |
29 | 18 |
|
| 19 | +**Test**: `AISHELL-2 iOS eval` (1000 utts, 301 hotwords) |
30 | 20 |
|
31 | | -## 🌟 Features |
| 21 | +| | Baseline | Ours (Ultra) | |
| 22 | +|--|--|--| |
| 23 | +| **CER** | 5.14% | **4.83%** | |
| 24 | +| **Recall** | 42.03% | **88.41%** | |
| 25 | +| **Precision** | **100.00%** | 92.42% | |
| 26 | +| **F1** | 59.18% | **90.37%** | |
32 | 27 |
|
33 | | -- **Phoneme Corrector** — fuzzy hotword matching via G2P phoneme edit-distance on the n-best. |
34 | | -- **Confidence-Weighted Match Bonus** — per-hotword reward scaled by acoustic confidence. |
35 | | -- **LRU Hotword Cache** — recurring hotwords get a lowered fuzzy threshold in streaming. |
36 | | -- **Multi-Objective Autotuner** — Optuna TPE over decoder + hotword knobs, optimizing recall and CER jointly with early-exit stagnation detection. |
| 28 | +**Test**: `AISHELL-2 iOS eval` (1000 utts, 27 hard hotwords) |
37 | 29 |
|
38 | | ---- |
| 30 | +## Highlights |
39 | 31 |
|
40 | | -## 🚀 Quick Start |
| 32 | +* **Phoneme Corrector** — fuzzy hotword matching via G2P phoneme edit-distance on the n-best |
| 33 | +* **Confidence-Weighted Match Bonus** — per-hotword reward scaled by acoustic confidence |
| 34 | +* **Multi-Objective Autotuner** — 2D/3D Pareto over decoder + hotword knobs, with early-exit stagnation detection |
41 | 35 |
|
42 | | -### 1. Install Python deps |
| 36 | +## Install |
43 | 37 |
|
44 | 38 | ```bash |
45 | | -cd /path/to/wenet-main |
46 | | - |
47 | | -# Create and activate virtual environment |
| 39 | +# Python environment |
48 | 40 | uv venv .venv --python 3.12 |
49 | 41 | source .venv/bin/activate |
| 42 | +uv pip install torch torchaudio pyyaml dacite optuna soundfile pypinyin jieba modelscope |
50 | 43 |
|
51 | | -# Install PyTorch (adjust CUDA version as needed) |
52 | | -uv pip install torch torchaudio \ |
53 | | - --index-url https://download.pytorch.org/whl/cu121 \ |
54 | | - --extra-index-url https://pypi.tuna.tsinghua.edu.cn/simple |
55 | | - |
56 | | -# Install remaining dependencies |
57 | | -uv pip install pyyaml dacite optuna soundfile pypinyin \ |
58 | | - -i https://pypi.tuna.tsinghua.edu.cn/simple |
| 44 | +# C++ runtime (requires cmake >= 3.14) |
| 45 | +cd runtime/libtorch |
| 46 | +cmake -B build -DGRAPH_TOOLS=ON -DTORCH=ON |
| 47 | +cmake --build build -j --target decoder_main |
| 48 | +cd ../.. |
59 | 49 | ``` |
60 | 50 |
|
61 | | -### 2. Download model + test set |
| 51 | +## Quick Start |
| 52 | + |
| 53 | +### 1. Download Model |
62 | 54 |
|
63 | 55 | ```bash |
64 | 56 | modelscope download --model wenet/u2pp_conformer-asr-cn-16k-online \ |
65 | 57 | --local_dir ~/userspace/wenet/models/u2pp_conformer-asr-cn-16k-online |
66 | | -bash tools/prepare_aishell_hotwords.sh ~/userspace/wenet/aishell_test |
67 | 58 | ``` |
68 | 59 |
|
69 | | -> **Other models (optional)** |
70 | | -> |
71 | | -> Verified models and download commands: |
72 | | -> |
73 | | -> | Model | ModelScope ID | |
74 | | -> |------|--------------| |
75 | | -> | `u2pp_conformer-asr-cn-16k-online` (default) | `wenet/u2pp_conformer-asr-cn-16k-online` | |
76 | | -> | `multi_cn` | `wenet/multi_cn` | |
77 | | -> |
78 | | -> After switching models, re-run Step 5 (confusion matrix) and Step 6 (autotune). |
| 60 | +### 2. Download Datasets |
79 | 61 |
|
80 | | -### 3. Build decoder_main |
| 62 | +preparation (downloads AISHELL-1 + AISHELL-2, builds hotword lists): |
81 | 63 |
|
82 | 64 | ```bash |
83 | | -cd runtime/libtorch |
84 | | -cmake -B build -DGRAPH_TOOLS=ON -DTORCH=ON |
85 | | -cmake --build build -j --target decoder_main |
86 | | -cd ../.. |
| 65 | +bash tools/prepare_benchmark.sh ~/userspace/wenet |
87 | 66 | ``` |
88 | 67 |
|
89 | | -### 4. Smoke test |
| 68 | +### 3. Learn Confusion Matrix (per-model, one-time) |
90 | 69 |
|
91 | 70 | ```bash |
92 | | -head -1 ~/userspace/wenet/aishell_test/wav.scp > /tmp/one.scp |
93 | | -runtime/libtorch/build/bin/decoder_main \ |
94 | | - --model_path ~/userspace/wenet/models/u2pp_conformer-asr-cn-16k-online/final.zip \ |
95 | | - --unit_path ~/userspace/wenet/models/u2pp_conformer-asr-cn-16k-online/units.txt \ |
96 | | - --wav_scp /tmp/one.scp \ |
97 | | - --hotword_path ~/userspace/wenet/aishell_test/hotwords.txt \ |
98 | | - --pinyin_dict_path runtime/libtorch/build/bin/dict \ |
99 | | - --result /dev/stdout |
| 71 | +python3 tools/learn_confusion.py \ |
| 72 | + --model_dir ~/userspace/wenet/models/u2pp_conformer-asr-cn-16k-online \ |
| 73 | + --wav_scp ~/userspace/wenet/aishell_test/wav.scp \ |
| 74 | + --text ~/userspace/wenet/aishell_test/text \ |
| 75 | + --out_csv runtime/libtorch/configs/confusion.csv \ |
| 76 | + --device cpu |
100 | 77 | ``` |
101 | 78 |
|
102 | | -### 5. Prepare confusion matrix |
| 79 | +### 4. Autotune — Four Modes |
103 | 80 |
|
104 | | -The confusion matrix is learned from **this model's** CTC posteriors and is not portable across models. |
| 81 | +| Mode | Config | Hotwords | Objective | When to use | |
| 82 | +|------|--------|----------|-----------|-------------| |
| 83 | +| **Aggressive** | `mode_aggressive.yaml` | 187 original | recall↑ + CER↓ | Hotword-dense domains | |
| 84 | +| **Balanced** | `mode_balanced.yaml` | 187 original | F1↑ + CER↓ | General voice assistant, balanced R/P | |
| 85 | +| **Conservative** | `mode_conservative.yaml` | 349 (+distractors) | F1↑ + CER↓ | Open-domain dialogue, precision matters | |
| 86 | +| **Ultra** | `mode_ultra.yaml` | 349 (+distractors) | F1↑ + CER↓ + Precision↑ | Financial/legal — false positive cost is high | |
105 | 87 |
|
106 | | -For the example model, run on a development set (e.g. WeNetSpeech dev): |
107 | | -```bash |
108 | | -python3 tools/learn_confusion.py \ |
109 | | - --model_dir ~/userspace/wenet/models/u2pp_conformer-asr-cn-16k-online \ |
110 | | - --wav_scp ~/userspace/wenet/wenetspeech_calibration/dev/wav.scp \ |
111 | | - --text ~/userspace/wenet/wenetspeech_calibration/dev/text \ |
112 | | - --out_csv runtime/libtorch/configs/confusion.csv \ |
113 | | - --device cpu |
114 | | -``` |
| 88 | +> **No free lunch**: Aggressive maximizes recall at the cost of precision (64% on 301-hotword test). Ultra trades ~3% recall for +29 precision points. Choose based on your domain's tolerance for false positives. |
115 | 89 |
|
116 | | -### 6. Autotune |
| 90 | +Run one (or all) modes: |
117 | 91 |
|
118 | 92 | ```bash |
| 93 | +# Aggressive |
119 | 94 | python3 tools/autotune.py \ |
120 | | - --config runtime/libtorch/configs/default.yaml \ |
| 95 | + --config runtime/libtorch/configs/mode_aggressive.yaml \ |
121 | 96 | --search-space runtime/libtorch/configs/search_space.yaml |
122 | | -``` |
123 | 97 |
|
124 | | -Autotune writes the best configuration to `runtime/libtorch/configs/default.tuned.yaml`. |
| 98 | +# Balanced |
| 99 | +python3 tools/autotune.py \ |
| 100 | + --config runtime/libtorch/configs/mode_balanced.yaml \ |
| 101 | + --search-space runtime/libtorch/configs/search_space.yaml |
125 | 102 |
|
126 | | -### 7. Evaluate on held-out |
| 103 | +# Conservative |
| 104 | +python3 tools/autotune.py \ |
| 105 | + --config runtime/libtorch/configs/mode_conservative.yaml \ |
| 106 | + --search-space runtime/libtorch/configs/search_space.yaml |
127 | 107 |
|
128 | | -Evaluate the tuned configuration on the **held-out test** |
129 | | -```bash |
130 | | -TUNED_YAML=runtime/libtorch/configs/default.tuned.yaml \ |
131 | | -TESTSET=~/userspace/wenet/aishell1_indep_hotword \ |
132 | | -bash runtime/libtorch/eval_runs/run_ablations.sh |
133 | | -column -ts $'\t' runtime/libtorch/eval_runs/summary.tsv |
| 108 | +# Ultra (3-objective Pareto) |
| 109 | +python3 tools/autotune.py \ |
| 110 | + --config runtime/libtorch/configs/mode_ultra.yaml \ |
| 111 | + --search-space runtime/libtorch/configs/search_space.yaml |
134 | 112 | ``` |
135 | 113 |
|
136 | | -`run_ablations.sh` automatically loads the tuned config for the **F_autotune** condition. |
137 | | - |
138 | | -## ⚙️ Configuration |
139 | | - |
140 | | -Edit `runtime/libtorch/configs/default.yaml` |
141 | | - |
142 | | -```yaml |
143 | | -paths: |
144 | | - model_dir: ~/userspace/wenet/models/u2pp_conformer-asr-cn-16k-online |
145 | | - testset_dir: ~/userspace/wenet/aishell_test |
146 | | - eval_testset_dir: ~/userspace/wenet/aishell1_indep_hotword |
147 | | - pinyin_dict_dir: runtime/libtorch/build/bin/dict |
148 | | - |
149 | | -decode: |
150 | | - chunk_size: -1 |
151 | | - ctc_weight: 0.5 |
152 | | - rescoring_weight: 1.0 |
153 | | - reverse_weight: 0.0 |
154 | | - nbest: 10 |
155 | | - |
156 | | -hotword: |
157 | | - hotword_path: hotwords.txt |
158 | | - fuzzy_threshold: 0.5 |
159 | | - max_append_path: 20 |
160 | | - use_confidence_reward: true |
161 | | - enable_hotword_cache: true |
162 | | - confusion_matrix_path: runtime/libtorch/configs/confusion.csv |
163 | | - bonus_weight: 2.0 |
164 | | - confidence_floor: 0.4 |
165 | | - neighbor_threshold: 0.5 |
166 | | - fuzzy_reject_ratio: 0.8 |
167 | | - confidence_weight_min: 0.2 |
168 | | - bonus_length_scale: 0.5 |
169 | | - |
170 | | -autotune: |
171 | | - n_trials: 100 |
172 | | - sampler: tpe |
173 | | - cer_baseline: 14.20 |
| 114 | +### 5. Copy Hotword Lists |
| 115 | + |
| 116 | +Hotword lists are shipped in `runtime/libtorch/configs/`. Copy them to your test set directory before evaluation: |
| 117 | + |
| 118 | +```bash |
| 119 | +cp runtime/libtorch/configs/hotwords_all.txt \ |
| 120 | + ~/userspace/wenet/aishell2_eval/test1000/ |
| 121 | +cp runtime/libtorch/configs/hotwords_hard.txt \ |
| 122 | + ~/userspace/wenet/aishell2_eval/test1000/ |
174 | 123 | ``` |
175 | 124 |
|
176 | | -Search space: `runtime/libtorch/configs/search_space.yaml`. |
| 125 | +### 6. Evaluate on Held-Out |
177 | 126 |
|
178 | | ---- |
| 127 | +```bash |
| 128 | +# Evaluate on 301-hotword list (mixed easy + hard) |
| 129 | +python3 tools/evaluate_modes.py \ |
| 130 | + --test-dir ~/userspace/wenet/aishell2_eval/test1000 \ |
| 131 | + --hotwords hotwords_all.txt |
| 132 | + |
| 133 | +# Evaluate on 27-hard hotword subset (baseline recall < 90%) |
| 134 | +python3 tools/evaluate_modes.py \ |
| 135 | + --test-dir ~/userspace/wenet/aishell2_eval/test1000 \ |
| 136 | + --hotwords hotwords_hard.txt |
| 137 | +``` |
179 | 138 |
|
180 | | -## 📊 Results |
| 139 | +## Results |
181 | 140 |
|
182 | | -`u2pp_conformer-asr-cn-16k-online` on AISHELL hotword test (235 utts, 187 hotwords). |
| 141 | +**Model**: `wenet/u2pp_conformer-asr-cn-16k-online` |
| 142 | +**Tune**: AISHELL-1 hotword test |
| 143 | +**Test**: AISHELL-2 iOS eval subset |
183 | 144 |
|
184 | | -| Condition | What it is | CER% | recall% | precision% | F1% | |
185 | | -|-----------|-----------|------:|--------:|-----------:|----:| |
186 | | -| A_baseline | Plain CTC + attention rescoring, no hotword | 14.20 | 15.96 | 97.83 | 27.44 | |
187 | | -| B_phoneme | + phoneme corrector (G2P + fuzzy match) | 12.62 | 32.62 | 98.92 | 49.07 | |
188 | | -| D_confidence | + confidence-weighted match bonus | 12.04 | 36.17 | 99.03 | 52.99 | |
189 | | -| E_cache | + LRU hotword cache | 12.04 | 36.17 | 99.03 | 52.99 | |
190 | | -| F_autotune | E_cache + TPE-autotuned knobs (12 params) | 6.32 | 90.07 | 96.21 | 93.04 | |
191 | | -| G_wenet_native | Upstream WeNet character-FST biasing only | 10.97 | 46.45 | 99.24 | 63.29 | |
| 145 | +### 301-Hotword Test (mixed easy + hard) |
192 | 146 |
|
| 147 | +| Mode | CER% | Recall% | Precision% | F1% | |
| 148 | +|------|------:|--------:|-----------:|----:| |
| 149 | +| Baseline (no hotword) | 5.14 | 81.08 | 95.24 | 87.59 | |
| 150 | +| **Aggressive** | 6.00 | 92.79 | 63.78 | 75.60 | |
| 151 | +| **Balanced** | 5.27 | 93.69 | 76.47 | 84.21 | |
| 152 | +| **Conservative** | 4.98 | 93.24 | 83.81 | 88.27 | |
| 153 | +| **Ultra** | **4.82** | **95.95** | **93.01** | **94.46** | |
193 | 154 |
|
194 | | -**Held-out** (`aishell1_indep_hotword`, 115 utts — never seen during tuning): |
| 155 | +### 27-Hard Hotword Test (baseline recall < 90%) |
195 | 156 |
|
196 | | -| Condition | CER% | recall% | precision% | F1% | |
197 | | -|-----------|------:|--------:|-----------:|----:| |
198 | | -| D_confidence | 11.88 | 48.15 | 98.48 | 64.68 | |
199 | | -| F_autotune | 7.33 | 91.11 | 98.40 | 94.62 | |
200 | | -| G_wenet_native | 10.49 | 59.26 | 98.77 | 74.07 | |
| 157 | +| Mode | CER% | Recall% | Precision% | F1% | |
| 158 | +|------|------:|--------:|-----------:|----:| |
| 159 | +| Baseline | 5.14 | 42.03 | 100.00 | 59.18 | |
| 160 | +| **Aggressive** | 5.10 | 98.55 | 64.76 | 78.16 | |
| 161 | +| **Balanced** | 4.92 | 98.55 | 73.91 | 84.47 | |
| 162 | +| **Conservative** | **4.68** | 94.20 | 86.67 | 90.28 | |
| 163 | +| **Ultra** | 4.83 | 88.41 | **92.42** | **90.37** | |
201 | 164 |
|
202 | | -Full write-up: [`HOTWORD_EVAL.md`](runtime/libtorch/eval_runs/HOTWORD_EVAL.md) |
| 165 | +### Key Findings |
203 | 166 |
|
204 | | ---- |
| 167 | +1. **All hotword-enhanced modes improve or maintain CER** over no-hotword baseline (5.14% → 4.68–6.00%), showing the pipeline does not harm general ASR. |
| 168 | +2. **On 27 hard-case hotwords** (foreign names the baseline misses), our method achieves **88% recall** vs baseline's **42%** — the phoneme corrector closes the gap where character-level matching fails. |
| 169 | +3. **Ultra mode is the overall best**: highest F1 (94.46% on 301-hot, 90.37% on hard-case) via 3-objective Pareto optimization — no hard-coded precision floor needed. |
| 170 | +4. **Conservative mode is the practical sweet spot**: lowest CER on hard-case (4.68%) with strong F1 (90.28%), making it suitable for precision-sensitive domains. |
205 | 171 |
|
206 | | -## 📂 Project Structure |
| 172 | +## Project Structure |
207 | 173 |
|
208 | 174 | ```text |
209 | | -wenet-main/ |
210 | | -├── runtime/core/decoder/ |
211 | | -│ ├── corrector.{cc,h} # PhonemeCorrector + fuzzy match + confusion matrix |
212 | | -│ ├── hotword_cache.{cc,h} # LRU hotword cache |
213 | | -│ ├── asr_decoder.{cc,h} # CalculateMatchBonus + n-best correction wiring |
214 | | -│ ├── params.h # gflags (bonus_weight, confidence_floor, etc.) |
215 | | -│ └── context_graph.{cc,h} # upstream WeNet character-FST context graph |
216 | | -├── runtime/core/bin/ |
217 | | -│ └── decoder_main.cc # decoder binary (+ daemon mode for autotune) |
218 | | -├── runtime/libtorch/configs/ |
219 | | -│ ├── default.yaml # base config (includes 12-knob autotune) |
220 | | -│ └── search_space.yaml # Optuna search space |
221 | | -├── runtime/libtorch/eval_runs/ |
222 | | -│ ├── run_ablations.sh # A→G ablation runner |
223 | | -│ └── HOTWORD_EVAL.md # full evaluation report |
224 | | -└── tools/ # autotune, metrics, data prep scripts |
| 175 | +runtime/core/decoder/ |
| 176 | + corrector.{cc,h} # PhonemeCorrector + fuzzy match + confusion matrix |
| 177 | + hotword_cache.{cc,h} # LRU hotword cache |
| 178 | + asr_decoder.{cc,h} # CalculateMatchBonus + n-best correction wiring |
| 179 | + params.h # gflags (bonus_weight, confidence_floor, etc.) |
| 180 | +runtime/core/bin/ |
| 181 | + decoder_main.cc # decoder binary (+ daemon mode for autotune) |
| 182 | +runtime/libtorch/configs/ |
| 183 | + mode_{aggressive,balanced,conservative,ultra}.yaml # four mode configs |
| 184 | + default.yaml # base config |
| 185 | + search_space.yaml # Optuna search space |
| 186 | +tools/ |
| 187 | + autotune.py # multi-objective Pareto tuner |
| 188 | + compute-hotword-metrics.py |
| 189 | + prepare_hotwords.py # extract 500-hot / filter hard-case |
| 190 | + evaluate_modes.py # batch evaluate all 4 tuned configs |
225 | 191 | ``` |
226 | 192 |
|
227 | | ---- |
228 | | - |
229 | | -## 🙏 Acknowledgements |
230 | | - |
231 | | -- **[WeNet](https://github.com/wenet-e2e/wenet)** — base ASR runtime. |
232 | | -- **[cpp-pinyin](https://github.com/wolfgitpr/cpp-pinyin)** — runtime G2P. |
233 | | -- **[CapsWriter-Offline](https://github.com/HaujetZhao/CapsWriter-Offline)** — inspired the corrector design. |
| 193 | +## Acknowledgements |
234 | 194 |
|
235 | | ---- |
| 195 | +* [WeNet](https://github.com/wenet-e2e/wenet) — base ASR runtime |
| 196 | +* [cpp-pinyin](https://github.com/wolfgitpr/cpp-pinyin) — runtime G2P |
| 197 | +* [CapsWriter-Offline](https://github.com/HaujetZhao/CapsWriter-Offline) — inspired the corrector design |
236 | 198 |
|
237 | | -## 📜 License |
| 199 | +## License |
238 | 200 |
|
239 | | -Apache License 2.0, inherited from upstream WeNet. |
| 201 | +Apache License 2.0 |
0 commit comments