Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Multilingual Distilwhisper allows for better ASR performance in target languages by adding lightweight CLSR modules on top of whisper-small.

The pre-trained weights for the paper experiments are available here.

Requirements

For training DistilWhisper, please install the requirements listed in full_requirements.yaml

For training the LoRA baseline, please install the requirements listed in lora_requirements.yaml

How to run training

Scripts are available at DistilWhisper/train/

LoRA baseline: DistilWhisper/train/train_lora.sh
Model using only CLSR trained on ASR loss (no distillation): DistilWhisper/train/train_clsr.sh
Distillation model with JS loss (DistilWhisper): DistilWhisper/train/train_js_distill.sh
Distillation model with KL loss (From the appendix available on arxiv): DistilWhisper/train/train_kl_distill.sh

How to run evaluation

Scripts are available at DistilWhisper/train/

LoRA baseline: DistilWhisper/train/eval_lora.sh
Model using only CLSR trained on ASR loss (no distillation): DistilWhisper/train/eval_clsr.sh
Distillation model with JS loss (DistilWhisper): DistilWhisper/train/eval_js_distill.sh
Distillation model with KL loss (From the appendix): DistilWhisper/train/eval_kl_distill.sh

Interative inference example

Check example at model/inference_example.py

from transformers import WhisperProcessor
from DistilWhisper import DistilWhisperForConditionalGeneration
from datasets import Audio, load_dataset

# 1. load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = DistilWhisperForConditionalGeneration.from_pretrained("naver/multilingual-distilwhisper-28k")

# 2. load language experts 
language="calatan"
model.load_experts(language)
model.set_language_task(language=language, task="transcribe")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task="transcribe")

# 3. load streaming dataset and read first audio sample
ds = load_dataset("google/fleurs", "ca_es", split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
input_speech = next(iter(ds))["audio"]
input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

# 4. generate token ids and decode it into text
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
# ["Els esports més coneguts són el futbol, el bàsquet, el voleibol, el waterpolo, l'esgrima, el rucbi, el ciclisme, hoquei sobre gel, hoquei sobre patins i l'automobilisme de Fórmula U."]

Citation

@inproceedings{ferraz2024distilwhisper,
  title={Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts},
  author={Ferraz, Thomas Palmeira and Boito, Marcely Zanon and Brun, Caroline and Nikoulina, Vassilina},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2024},
  organization={IEEE}
}

Funding

This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) funded by European Union’s Horizon Europe Research and Innovation programme under grant agreement number 101070631.

For more information please visit https://he-utter.eu/

NAVER LABS Europe: https://europe.naverlabs.com/

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
data		data
model		model
train		train
DistilWhisper License.txt		DistilWhisper License.txt
README.md		README.md
full_requirements.yml		full_requirements.yml
lora_requirements.txt		lora_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Requirements

How to run training

How to run evaluation

Interative inference example

Citation

Funding

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Requirements

How to run training

How to run evaluation

Interative inference example

Citation

Funding

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages