Skip to content

Commit 9ab8ade

Browse files
authored
Merge branch 'main' into patch-2
2 parents cbce927 + 9b95719 commit 9ab8ade

35 files changed

+1830
-590
lines changed

README.md

Lines changed: 34 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@ Omnilingual ASR is an open-source speech recognition system supporting over 1,60
1818
</div>
1919

2020

21+
## December 2025 Update
22+
We release two suites of models:
23+
- Checkpoints of improved accuracy (CER) for the CTC and LLM-ASR models compared to our existing LLM-ASR model (`omniASR_{CTC,LLM}_{300M,1B,3B,7B}_v2`).
24+
- A new variant of the LLM-ASR model that supports decoding on unlimited audio length (`omniASR_LLM_Unlimited_{300M,1B,3B,7B}_v2`). The unlimited audio length models are briefly described in the [architecture overview section](src/omnilingual_asr/models/README.md). It's accuracy is comparable to limited audio length models, however finetuning recipies for this model are currently not supported.
25+
2126
## Documentation
2227

2328
### Quick Start
@@ -54,16 +59,15 @@ uv add omnilingual-asr
5459
```python
5560
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
5661

57-
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B")
58-
62+
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_Unlimited_7B_v2")
5963
audio_files = ["/path/to/eng_audio1.flac", "/path/to/deu_audio2.wav"]
6064
lang = ["eng_Latn", "deu_Latn"]
6165
transcriptions = pipeline.transcribe(audio_files, lang=lang, batch_size=2)
6266
```
6367

6468
More details on running specific models can be found in the [src/omnilingual_asr/models/inference](/src/omnilingual_asr/models/inference/README.md) directory.
6569

66-
> **⚠️ Important:** Currently only audio files shorter than 40 seconds are accepted for inference. We plan to add support for transcribing unlimited-length audio files shortly.
70+
> **⚠️ Important:** Currently only audio files shorter than 40 seconds are accepted for inference on CTC and LLM model suites.
6771
6872
### Supported Languages
6973

@@ -105,7 +109,7 @@ audio_data = [{"waveform": x["array"], "sample_rate": x["sampling_rate"]}
105109
for x in batch["audio"]]
106110

107111
# Run inference
108-
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B")
112+
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B_v2")
109113
transcriptions = pipeline.transcribe(audio_data, batch_size=2)
110114

111115
# Display results
@@ -117,7 +121,7 @@ for i, (transcription, original_text) in enumerate(zip(transcriptions, batch["ra
117121

118122

119123
## Model Architectures
120-
<!-- TODO : add new tokenizer, we'll get two tokenizer, add missing speed numbers-->
124+
121125
| Model Name | Features | Parameters | Download Size (FP32) | Inference VRAM¹ | Real-Time Factor¹ (relative speed)² |
122126
|---------------------|---------------|------------:|---------------:|---------------:|-----------:|
123127
| [`omniASR_W2V_300M`](https://dl.fbaipublicfiles.com/mms/omniASR-W2V-300M.pt) | SSL | 317_390_592 | 1.2 GiB | | |
@@ -128,18 +132,32 @@ for i, (transcription, original_text) in enumerate(zip(transcriptions, batch["ra
128132
| [`omniASR_CTC_1B`](https://dl.fbaipublicfiles.com/mms/omniASR-CTC-1B.pt) | ASR | 975_065_300 | 3.7 GiB | ~3 GiB | 0.002 (48x) |
129133
| [`omniASR_CTC_3B`](https://dl.fbaipublicfiles.com/mms/omniASR-CTC-3B.pt) | ASR | 3_080_423_636 | 12.0 GiB | ~8 GiB | 0.003 (32x) |
130134
| [`omniASR_CTC_7B`](https://dl.fbaipublicfiles.com/mms/omniASR-CTC-7B.pt) | ASR | 6_504_786_132 | 25.0 GiB | ~15 GiB | 0.006 (16x) |
135+
| [`omniASR_CTC_300M_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-CTC-300M-v2.pt) | ASR | 325_494_996 | 1.3 GiB | ~2 GiB | 0.001 (96x) |
136+
| [`omniASR_CTC_1B_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-CTC-1B-v2.pt) | ASR | 975_065_300 | 3.7 GiB | ~3 GiB | 0.002 (48x) |
137+
| [`omniASR_CTC_3B_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-CTC-3B-v2.pt) | ASR | 3_080_423_636 | 12.0 GiB | ~8 GiB | 0.003 (32x) |
138+
| [`omniASR_CTC_7B_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-CTC-7B-v2.pt) | ASR | 6_504_786_132 | 25.0 GiB | ~15 GiB | 0.006 (16x) |
131139
| [`omniASR_LLM_300M`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-300M.pt) | ASR with optional language conditioning | 1_627_603_584 | 6.1 GiB | ~5 GiB | 0.090 (~1x) |
132140
| [`omniASR_LLM_1B`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-1B.pt) | ASR with optional language conditioning | 2_275_710_592 | 8.5 GiB | ~6 GiB | 0.091 (~1x) |
133141
| [`omniASR_LLM_3B`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-3B.pt) | ASR with optional language conditioning | 4_376_679_040 | 17.0 GiB | ~10 GiB | 0.093 (~1x) |
134142
| [`omniASR_LLM_7B`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-7B.pt) | ASR with optional language conditioning | 7_801_041_536 | 30.0 GiB | ~17 GiB | 0.092 (~1x) |
143+
| [`omniASR_LLM_300M_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-300M-v2.pt) | ASR with optional language conditioning | 1_627_603_584 | 6.1 GiB | ~5 GiB | 0.090 (~1x) |
144+
| [`omniASR_LLM_1B_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-1B-v2.pt) | ASR with optional language conditioning | 2_275_710_592 | 8.5 GiB | ~6 GiB | 0.091 (~1x) |
145+
| [`omniASR_LLM_3B_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-3B-v2.pt) | ASR with optional language conditioning | 4_376_679_040 | 17.0 GiB | ~10 GiB | 0.093 (~1x) |
146+
| [`omniASR_LLM_7B_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-7B-v2.pt) | ASR with optional language conditioning | 7_801_041_536 | 30.0 GiB | ~17 GiB | 0.092 (~1x) |
147+
| [`omniASR_LLM_Unlimited_300M_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-Unlimited-300M-v2.pt) | omniASR_LLM_300M + unlimited audio length | 1_627_603_584 | 6.1 GiB | ~5 GiB | 0.092 (~1x) (0.206)³ |
148+
| [`omniASR_LLM_Unlimited_1B_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-Unlimited-1B-v2.pt) | omniASR_LLM_1B + unlimited audio length | 2_275_710_592 | 8.5 GiB | ~6 GiB | 0.097 (~1x) (0.207)³ |
149+
| [`omniASR_LLM_Unlimited_3B_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-Unlimited-3B-v2.pt) | omniASR_LLM_3B + unlimited audio length | 4_376_679_040 | 17.0 GiB | ~10 GiB | 0.095 (~1x) (0.208)³ |
150+
| [`omniASR_LLM_Unlimited_7B_v2`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-Unlimited-7B-v2.pt) | omniASR_LLM_7B + unlimited audio length | 7_801_041_536 | 30.0 GiB | ~17 GiB | 0.097 (~1x) (0.208)³ |
135151
| [`omniASR_LLM_7B_ZS`](https://dl.fbaipublicfiles.com/mms/omniASR-LLM-7B-ZS.pt) | Zero-Shot ASR | 7_810_900_608 | 30.0 GiB | ~20 GiB | 0.194 (~0.5x) |
136-
| [`omniASR_tokenizer`](https://dl.fbaipublicfiles.com/mms/omniASR_tokenizer.model) | Tokenizer for most of architectures (except omniASR_LLM_7B) | - | 100 KiB | - |
137-
| [`omniASR_tokenizer_v7`](https://dl.fbaipublicfiles.com/mms/omniASR_tokenizer_v7.model) | Tokenizer for omniASR_LLM_7B model | - | 100 KiB | - ||
152+
| [`omniASR_tokenizer_v1`](https://dl.fbaipublicfiles.com/mms/omniASR_tokenizer.model) | Tokenizer for all non-v2 models except omniASR_LLM_7B | - | 100 KiB | - |
153+
| [`omniASR_tokenizer_v1_variant7`](https://dl.fbaipublicfiles.com/mms/omniASR_tokenizer_v7.model) | Tokenizer for the omniASR_LLM_7B architecture | - | 100 KiB | - |
154+
| [`omniASR_tokenizer_written_v2`](https://dl.fbaipublicfiles.com/mms/omniASR_tokenizer_written_v2.model) | Tokenizer for all v2 architectures | - | 100 KiB | - ||
138155

139156
¹ (batch=1, audio_len=30s, BF16, A100)
140157

141158
² Relative speed to `omniASR_LLM_7B`
142159

160+
³ (batch=1, audio_len=15min, BF16, A100)
143161

144162
### Model Download & Storage
145163

@@ -165,12 +183,15 @@ Omnilingual ASR code and models are released under the [Apache 2.0](./LICENSE).
165183

166184
## Citation
167185

168-
If you use the omnilingual ASR model suite in your research and wish to cite us, please use the following BibTeX entry (arxiv version will be added soon)!
186+
If you use the omnilingual ASR model suite in your research and wish to cite us, please use the following BibTeX entry!
169187
```bibtex
170-
@misc{omnilingualasr2025,
171-
title={{Omnilingual ASR}: Open-Source Multilingual Speech Recognition for 1600+ Languages},
172-
author={{Omnilingual ASR Team} and Keren, Gil and Kozhevnikov, Artyom and Meng, Yen and Ropers, Christophe and Setzler, Matthew and Wang, Skyler and Adebara, Ife and Auli, Michael and Balioglu, Can and Chan, Kevin and Cheng, Chierh and Chuang, Joe and Droof, Caley and Duppenthaler, Mark and Duquenne, Paul-Ambroise and Erben, Alexander and Gao, Cynthia and Mejia Gonzalez, Gabriel and Lyu, Kehan and Miglani, Sagar and Pratap, Vineel and Sadagopan, Kaushik Ram and Saleem, Safiyyah and Turkatenko, Arina and Ventayol-Boada, Albert and Yong, Zheng-Xin and Chung, Yu-An and Maillard, Jean and Moritz, Rashel and Mourachko, Alexandre and Williamson, Mary and Yates, Shireen},
173-
year={2025},
174-
url={https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/},
188+
@misc{omnilingualasrteam2025omnilingualasropensourcemultilingual,
189+
title={Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages},
190+
author={Omnilingual ASR team and Gil Keren and Artyom Kozhevnikov and Yen Meng and Christophe Ropers and Matthew Setzler and Skyler Wang and Ife Adebara and Michael Auli and Can Balioglu and Kevin Chan and Chierh Cheng and Joe Chuang and Caley Droof and Mark Duppenthaler and Paul-Ambroise Duquenne and Alexander Erben and Cynthia Gao and Gabriel Mejia Gonzalez and Kehan Lyu and Sagar Miglani and Vineel Pratap and Kaushik Ram Sadagopan and Safiyyah Saleem and Arina Turkatenko and Albert Ventayol-Boada and Zheng-Xin Yong and Yu-An Chung and Jean Maillard and Rashel Moritz and Alexandre Mourachko and Mary Williamson and Shireen Yates},
191+
year={2025},
192+
eprint={2511.09690},
193+
archivePrefix={arXiv},
194+
primaryClass={cs.CL},
195+
url={https://arxiv.org/abs/2511.09690},
175196
}
176197
```

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@ classifiers=[
2424
"Development Status :: 4 - Beta",
2525
]
2626

27-
requires-python = ">=3.10"
27+
requires-python = ">=3.10,<3.14"
2828
dependencies = [
29-
"fairseq2[arrow]>=0.5.2,<=0.6",
29+
"fairseq2[arrow]>=0.5.2,<=0.6.0",
3030
"pyarrow>=20.0.0",
3131
"torch",
3232
"torchaudio",

src/omnilingual_asr/__init__.py

Lines changed: 16 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -8,27 +8,23 @@
88

99
__version__ = "0.1.0"
1010

11+
from omnilingual_asr.models.wav2vec2_asr.config import (
12+
register_omnilingual_asr_wav2vec2_asr_configs,
13+
)
14+
from omnilingual_asr.models.wav2vec2_llama import (
15+
WAV2VEC2_LLAMA_FAMILY,
16+
Wav2Vec2LlamaConfig,
17+
Wav2Vec2LlamaModel,
18+
apply_fsdp_to_wav2vec2_llama,
19+
convert_wav2vec2_llama_state_dict,
20+
create_wav2vec2_llama_model,
21+
register_wav2vec2_llama_configs,
22+
)
23+
from omnilingual_asr.models.wav2vec2_ssl.config import (
24+
register_omnilingual_asr_wav2vec2_ssl_configs,
25+
)
1126

12-
def setup_fairseq2_extension(container) -> None:
13-
"""Register omnilingual ASR assets and models with fairseq2."""
14-
from fairseq2.composition.assets import register_package_assets
15-
from fairseq2.composition.models import register_model_family
16-
from fairseq2.runtime.dependency import DependencyContainer
17-
18-
from omnilingual_asr.models.wav2vec2_asr.config import (
19-
register_omnilingual_asr_wav2vec2_asr_configs,
20-
)
21-
from omnilingual_asr.models.wav2vec2_llama import (
22-
WAV2VEC2_LLAMA_FAMILY,
23-
Wav2Vec2LlamaConfig,
24-
Wav2Vec2LlamaModel,
25-
convert_wav2vec2_llama_state_dict,
26-
create_wav2vec2_llama_model,
27-
register_wav2vec2_llama_configs,
28-
)
29-
from omnilingual_asr.models.wav2vec2_ssl.config import (
30-
register_omnilingual_asr_wav2vec2_ssl_configs,
31-
)
27+
__version__ = "0.2.0"
3228

3329
if not isinstance(container, DependencyContainer):
3430
raise TypeError("container must be a DependencyContainer")

src/omnilingual_asr/cards/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,27 @@ fairseq2 manages models, tokenizers, and datasets as [assets](https://facebookre
55
For example, a model definition has the following parameters:
66

77
```yaml
8-
name: omniASR_CTC_300M
8+
name: omniASR_CTC_300M_v2
99
model_family: wav2vec2_asr
1010
model_arch: 300m
1111
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-CTC-300M.pt
12-
tokenizer_ref: omniASR_tokenizer
12+
tokenizer_ref: omniASR_tokenizer_written_v2
1313
```
1414
1515
### Usage Examples
1616
1717
```python
1818
from fairseq2.models.hub import load_model
1919

20-
model = load_model("omniASR_CTC_300M")
20+
model = load_model("omniASR_CTC_300M_v2")
2121
```
2222

2323
Or in a training recipe configuration (e.g., [`/workflows/recipes/wav2vec2/asr/configs/ctc-finetune.yaml`](/workflows/recipes/wav2vec2/asr/configs/ctc-finetune.yaml)):
2424

2525
```yaml
2626

2727
model:
28-
name: "omniASR_CTC_300M"
28+
name: "omniASR_CTC_300M_v2"
2929

3030
trainer:
3131
(...)
@@ -42,7 +42,7 @@ optimizer:
4242

4343
* `model_arch`: Specific configuration for the model family (e.g., [`1b`](/src/omnilingual_asr/models/wav2vec2_llama/config.py) for `wav2vec2_llama`)
4444

45-
* `checkpoint`: Model storage URI, can be a local path (`"$HOME/.cache/"`), a direct download link (`"https://dl.fbaipublicfiles.com/mms/omniASR_LLM_300M.pt"`) or a reference to a huggingface repository (`"hg://qwen/qwen2.5-7b"`) if the model is in a `.safetensors` format.
45+
* `checkpoint`: Model storage URI, can be a local path (`"$HOME/.cache/"`), a direct download link (`"https://dl.fbaipublicfiles.com/mms/omniASR_LLM_300M_v2.pt"`) or a reference to a huggingface repository (`"hg://qwen/qwen2.5-7b"`) if the model is in a `.safetensors` format.
4646

4747
* `tokenizer_ref`: Links to tokenizer asset for training.
4848

src/omnilingual_asr/cards/datasets/example_dataset.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ name: example_dataset
88
dataset_family: mixture_parquet_asr_dataset
99
dataset_config:
1010
data: /path/to/your/dataset/version=0
11-
tokenizer_ref: omniASR_tokenizer
11+
tokenizer_ref: omniASR_tokenizer_v1

src/omnilingual_asr/cards/models/rc_models.yaml renamed to src/omnilingual_asr/cards/models/rc_models_v1.yaml

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@
44
# This source code is licensed under the BSD-style license found in the
55
# LICENSE file in the root directory of this source tree.
66

7-
name: omniASR_tokenizer
7+
name: omniASR_tokenizer_v1
88
tokenizer_family: char_tokenizer
99
tokenizer: https://dl.fbaipublicfiles.com/mms/omniASR_tokenizer.model
1010

1111
---
1212

13-
name: omniASR_tokenizer_v7
13+
name: omniASR_tokenizer_v1_variant7
1414
tokenizer_family: char_tokenizer
1515
tokenizer: https://dl.fbaipublicfiles.com/mms/omniASR_tokenizer_v7.model
1616

@@ -48,68 +48,68 @@ name: omniASR_CTC_300M
4848
model_family: wav2vec2_asr
4949
model_arch: 300m
5050
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-CTC-300M.pt
51-
tokenizer_ref: omniASR_tokenizer
51+
tokenizer_ref: omniASR_tokenizer_v1
5252

5353
---
5454

5555
name: omniASR_CTC_1B
5656
model_family: wav2vec2_asr
5757
model_arch: 1b
5858
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-CTC-1B.pt
59-
tokenizer_ref: omniASR_tokenizer
59+
tokenizer_ref: omniASR_tokenizer_v1
6060

6161
---
6262

6363
name: omniASR_CTC_3B
6464
model_family: wav2vec2_asr
6565
model_arch: 3b
6666
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-CTC-3B.pt
67-
tokenizer_ref: omniASR_tokenizer
67+
tokenizer_ref: omniASR_tokenizer_v1
6868

6969
---
7070

7171
name: omniASR_CTC_7B
7272
model_family: wav2vec2_asr
7373
model_arch: 7b
7474
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-CTC-7B.pt
75-
tokenizer_ref: omniASR_tokenizer
75+
tokenizer_ref: omniASR_tokenizer_v1
7676

7777
---
7878

7979
name: omniASR_LLM_300M
8080
model_family: wav2vec2_llama
8181
model_arch: 300m
8282
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-LLM-300M.pt
83-
tokenizer_ref: omniASR_tokenizer
83+
tokenizer_ref: omniASR_tokenizer_v1
8484

8585
---
8686

8787
name: omniASR_LLM_1B
8888
model_family: wav2vec2_llama
8989
model_arch: 1b
9090
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-LLM-1B.pt
91-
tokenizer_ref: omniASR_tokenizer
91+
tokenizer_ref: omniASR_tokenizer_v1
9292

9393
---
9494

9595
name: omniASR_LLM_3B
9696
model_family: wav2vec2_llama
9797
model_arch: 3b
9898
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-LLM-3B.pt
99-
tokenizer_ref: omniASR_tokenizer
99+
tokenizer_ref: omniASR_tokenizer_v1
100100

101101
---
102102

103103
name: omniASR_LLM_7B
104104
model_family: wav2vec2_llama
105-
model_arch: 7b
105+
model_arch: 7b_v1_variant7
106106
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-LLM-7B.pt
107-
tokenizer_ref: omniASR_tokenizer_v7
107+
tokenizer_ref: omniASR_tokenizer_v1_variant7
108108

109109
---
110110

111111
name: omniASR_LLM_7B_ZS
112112
model_family: wav2vec2_llama
113113
model_arch: 7b_zs
114114
checkpoint: https://dl.fbaipublicfiles.com/mms/omniASR-LLM-7B-ZS.pt
115-
tokenizer_ref: omniASR_tokenizer
115+
tokenizer_ref: omniASR_tokenizer_v1

0 commit comments

Comments
 (0)