Question
Thank you for releasing the excellent ReazonSpeech models.
I would like to fine-tune reazonspeech-k2-v2 on a domain-specific Japanese speech dataset (medical terminology) using the icefall recipe at egs/reazonspeech/ASR/zipformer. However, the Hugging Face repository reazon-research/reazonspeech-k2-v2 contains only ONNX files (encoder/decoder/joiner-epoch-99-avg-1.onnx and tokens.txt). The icefall recipe expects PyTorch checkpoints (pretrained.pt or epoch-*.pt) under zipformer/exp/, which I could not find anywhere.
What I have checked
- Hugging Face
reazon-research/* repos (only nemo-v2 ships PyTorch weights; k2-v2 and its variants are ONNX-only)
icefall/egs/reazonspeech/ASR/RESULTS.md (lists training/decoding commands but no download URL for the released checkpoint)
ReazonSpeech/pkg/k2-asr (uses sherpa-onnx for inference)
Questions
- Is the PyTorch checkpoint (
pretrained.pt / epoch-*.pt / bpe.model) used to produce the released ONNX models published anywhere?
- If not currently public, is there a plan to release it so that users can fine-tune via icefall?
- If release is not planned, do you have a recommended approach for domain adaptation of
reazonspeech-k2-v2 (e.g. contextual biasing in modified_beam_search, or LODR / shallow fusion that works in practice with this model)?
Context
- Use case: real-time Japanese ASR for medical terminology.
- Baseline measured with the released ONNX (
reazonspeech-k2-asr 3.0.0, greedy, CPU): CER 18.53% on 50 short medical utterances.
- We have tried sherpa-onnx hotwords and dictionary-based post-correction, but both regressed CER (
modified_beam_search itself degraded from 18.53% to 32.79% on this data).
- The natural next step is fine-tuning, which requires the PyTorch checkpoint.
Any guidance would be very much appreciated. Thank you!
<details>
<summary>日本語版</summary>
## 質問
ReazonSpeech モデルを公開いただきありがとうございます。
`reazonspeech-k2-v2` を医療ドメインの日本語音声でファインチューニングしたく、icefall の `egs/reazonspeech/ASR/zipformer` recipe での学習を試みていますが、Hugging Face リポジトリ [reazon-research/reazonspeech-k2-v2](https://huggingface.co/reazon-research/reazonspeech-k2-v2) には ONNX ファイル (`encoder/decoder/joiner-epoch-99-avg-1.onnx` と `tokens.txt`) のみが公開されているように見受けられます。icefall recipe が要求する PyTorch チェックポイント (`pretrained.pt` または `epoch-*.pt`) と `bpe.model` が見つからず、finetune を開始できない状況です。
### 確認済みの内容
- Hugging Face `reazon-research/*` の各リポジトリ (`nemo-v2` のみ PyTorch 重み公開、`k2-v2` 系は ONNX のみ)
- `icefall/egs/reazonspeech/ASR/RESULTS.md` (学習・デコードコマンドは記載されているが ckpt の配布 URL はなし)
- `ReazonSpeech/pkg/k2-asr` (sherpa-onnx ベースの推論実装)
### 質問
1. 公開済み ONNX の元となった PyTorch チェックポイント (`pretrained.pt` / `epoch-*.pt` / `bpe.model`) は公開されていますか?
2. 現状未公開の場合、icefall で finetune できるよう公開予定はありますか?
3. 公開予定がない場合、`reazonspeech-k2-v2` のドメイン適応に推奨される手法 (`modified_beam_search` の contextual biasing や LODR / shallow fusion 等で実用的に機能するもの) はありますか?
### 背景
- 用途: 医療用語のリアルタイム日本語 ASR。
- 公開 ONNX (`reazonspeech-k2-asr` 3.0.0, greedy, CPU) でのベースライン: 医療用語短文 50 件で CER 18.53%。
- sherpa-onnx の hotwords / 辞書ベース後補正を試したが、いずれも CER が悪化 (`modified_beam_search` 自体がこのデータでは 32.79% に劣化)。
- 次の打ち手として finetune を計画しているが、PyTorch ckpt が必要。
ご教示いただけますと幸いです。
</details>
Question
Thank you for releasing the excellent ReazonSpeech models.
I would like to fine-tune
reazonspeech-k2-v2on a domain-specific Japanese speech dataset (medical terminology) using the icefall recipe ategs/reazonspeech/ASR/zipformer. However, the Hugging Face repository reazon-research/reazonspeech-k2-v2 contains only ONNX files (encoder/decoder/joiner-epoch-99-avg-1.onnxandtokens.txt). The icefall recipe expects PyTorch checkpoints (pretrained.ptorepoch-*.pt) underzipformer/exp/, which I could not find anywhere.What I have checked
reazon-research/*repos (onlynemo-v2ships PyTorch weights;k2-v2and its variants are ONNX-only)icefall/egs/reazonspeech/ASR/RESULTS.md(lists training/decoding commands but no download URL for the released checkpoint)ReazonSpeech/pkg/k2-asr(uses sherpa-onnx for inference)Questions
pretrained.pt/epoch-*.pt/bpe.model) used to produce the released ONNX models published anywhere?reazonspeech-k2-v2(e.g. contextual biasing inmodified_beam_search, or LODR / shallow fusion that works in practice with this model)?Context
reazonspeech-k2-asr3.0.0, greedy, CPU): CER 18.53% on 50 short medical utterances.modified_beam_searchitself degraded from 18.53% to 32.79% on this data).Any guidance would be very much appreciated. Thank you!