Skip to content

JHU-CLSP/Speech_NSF_NextGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

94 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”Ή Introduction

Icefall K2 Lhotse

The Kaldi open-source toolkit, developed in 2009 for automatic speech recognition (ASR), has played a foundational role in speech technology research. Its evolution continued in 2021 with the emergence of a follow-up trilogyβ€”Icefall, k2, and Lhotseβ€”demonstrating the rapid pace of innovation in speech algorithms and system design.

In response to the transformative changes in the fieldβ€”such as the rise of large language models (LLMs), platforms like Hugging Face, and robust deep learning frameworks like PyTorchβ€”this project aims to reimagine and modernize Kaldi’s capabilities to meet the emerging needs of the speech research community.

The primary objective of k2 is to re-implement all core functions of Kaldi natively in generic AI/deep learning frameworks, with a focus on PyTorch. This allows the seamless integration of cutting-edge developments in deep learning (e.g., novel optimization algorithms) into speech recognition research. The primary goals of Lhotse and Icefall include delivering efficient, user-friendly tools for data preparation, recipe development, and training modern ASR models.

GitHub statistics for Lhotse, Icefall, and k2. *within last month (as of September 30 2025)

GitHub statistic Lhotse Icefall k2
Watch 42 50 72
Fork 251 375 230
Star 1.1k 1.2k 1.3k
Dependent repositories 316 0 57
Merged PR* 12 4 0
Open PR* 9 1 0
Closed issue* 2 6 0
New issue* 2 5 0
Commits to master* 12 4 0
Additions* 1,658 53,242 0
Deletions* 70 8 0

Acknowledgments:

NSF Official Logo This project was supported by U.S. National Science Foundation Award Number:2120435, NSF-CCRI project: CCRI: ENS: Next Generation Tools for Spoken Language Science & Technology.

πŸ”Ή Projects

Lhotse

Lhotse develops a modern approach to speech data preparation. Its design is inspired by data libraries commonly used in the ML community, such as pandas. Lhotse's philosophy may be summarized as ''simple things should be simple, complex things should be possible.''

🎨 JHU Contributors

  • #1 Piotr Ε»elasko, 1,231 commits 🟩 112,538 ++ πŸ”΄ 43,832 --
  • #2 Desh Raj, 248 commits 🟩 29,279 ++ πŸ”΄ 12,783 --
  • #4 Jan (Yenda) Trmal, 33 commits 🟩 2,093 ++ πŸ”΄ 651 --
  • #6 Amir Hussein, 28 commits 🟩 2,747 ++ πŸ”΄ 1766 --
  • #13 Matthew Wiesner, 13 commits 🟩 3,425 ++ πŸ”΄ 627 --
  • #23 Yiming Wang, 7 commits 🟩 215 ++ πŸ”΄ 37 --
  • #38 Dominik Klement, 2 commits 🟩 1602 ++ πŸ”΄ 0 --
  • #56 Matthew Maciejewski, 1 commit 🟩 1,217 ++ πŸ”΄ 0 --
  • #76 Dongji Gao, 1 commit 🟩 5 ++ πŸ”΄ 3 --
  • #83 Henry Li Xinyuan, 1 commit 🟩 146 ++ πŸ”΄ 0 --

🎨 GPU-accelerated Guided Source Separation (by Desh Raj)

Improved implementation of GSS that leverages the power of modern GPU-based pipelines, such as batched processing of frequencies and segments. This allows us to perform detailed ablation studies over several parameters of the GSS algorithm. There are reproducible pipelines for speaker-attributed transcription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting.

🎨 Lhotse recipes


Icefall

Icefall is the project where K2 and Lhotse ''meet''. It provides the speech and language research community a comprehensive collection of recipes for training modern speech processing systems on most of the popular speech data sets.

🎨 JHU Contributors

  • #9 Desh Raj, 19 commits 🟩 39,142 ++ πŸ”΄ 22,339 --
  • #10 Piotr Ε»elasko, 18 commits 🟩 993 ++ πŸ”΄ 838 --
  • #15 Ruizhe Huang, 7 commits 🟩 95 ++ πŸ”΄ 74 --
  • #29 Amir Hussein, 3 commit 🟩 59,348 ++ πŸ”΄ 2 --
  • #31 Dongji Gao, 2 commits 🟩 9,565 ++ πŸ”΄ 9 --
  • #44 Henry Li Xinyuan, 1 commit 🟩 2,124 ++ πŸ”΄ 3 --

🎨 External Contributors

  • #2 Dan Povey, 200 commits 🟩 13,323 ++ πŸ”΄ 4,485 --

🎨 HENT-SRT (by Amir Hussein, Paola Garcia, Matthew Wiesner)

We introduced HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation ), a novel hierarchical transducer architecture for joint speech recognition and translation (ST). HENT-SRT (Icefall recipe) significantly outperforms previous state-ofthe-art transducer-based ST models and closes the gap with attention-based encoder-decoder architectures, while achieving superior ASR performance. Our approach offers substantial gains in streaming scenarios without introducing additional delays.

🎨 Continuous Streaming Multi-Talker ASR (by Desh Raj)

We investigated Streaming Unmixing and Recognition Transducer (SURT) for continuous streaming multitalker ASR, and demonstrated the effectiveness of dual-path LSTMs and Transformers for generalization to diverse session lengths (recipes for the LibriCSS, AMI and ICSI datasets).

🎨 SPGISpeech (by Desh Raj)

We developed an Icefall recipe and trained models (zipformer and stateless transducer models on Hugging Face) for SPGISpeech, a dataset consisting of 5,000 hours of recorded company earnings calls and their respective transcriptions.

🎨 MGB-2 (by Amir Hussein)

We developed an Icefall recipe and trained a model (conformer-ctc model on Hugging Face) for Multi-Dialect Broadcast News Arabic Speech Recognition (MGB-2) challenge on Arabic multi-dialect broadcast media recognition.

🎨 Contextual ASR (by Ruizhe Huang, Mahsa Yarmohammadi)

Developed recipes for Contextual ASR. This is the process by which an ASR system is provided with contextual information derived from metadata associated with the audio, typically in the form of a list of words or phrases likely to be spoken, with the goal of improving the recognition accuracy of named entities and other infrequent terms. Our work on Contextual ASR is recognized for introducing the ConEC dataset (ConEC), followed by a method for improving neural biasing beyond shallow language model fusion (Pull request - Neural Biasing).

🎨 Omni-temporal Classification (OTC) (by Dongji Gao, Paola Garcia, Matthew Wiesner)

Training ASR systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. We designed and implemented Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts (OTC w/ BPE units, OTC w/ phone units).

🎨 ASR + LID (by Amir Hussein, Paola Garcia, Matthew Wiesner)

Created a multitask learning framework that synchronizes Language Identification (LID) with ASR, utilizing a neural transducer architecture. We demonstrate the efficacy of our proposed approach on conversational multilingual (Arabic, Spanish, Mandarin) and CS (Spanish-English, Mandarin-English) test sets (Pull request - ASR SEAME Recipe).

🎨 Other recipes


K2

K2 brings data structures and algorithms from the field of finite state automata (FSA) into the world of deep learning. It provides efficient CPU and GPU implementations of commonly used FSA operations and integrates them seamlessly with PyTorch's tensor and automatic differentiation mechanisms, thus admitting - and benefiting from - the inner complexity of the speech recognition, instead of trying to remove it.

🎨 JHU Contributors

  • #12 Piotr Ε»elasko, 4 commits 🟩 9,458 ++ πŸ”΄ 276 --
  • #13 Jan "yenda" Trmal, 4 commits 🟩 9,314 ++ πŸ”΄ 267 --
  • #16 Mahsa Yarmohammadi, 3 commits 🟩 173 ++ πŸ”΄ 51 --
  • #19 Yiming Wang, 3 commits 🟩 234 ++ πŸ”΄ 67 --
  • #22 Desh Raj, 2 commits 🟩 435 ++ πŸ”΄ 29 --
  • #35 Dongji Gao, 1 commit 🟩 27 ++ πŸ”΄ 10 --

🎨 External Contributors

  • #2 Dan Povey, 214 commits 🟩 73,771 ++ πŸ”΄ 30,586 --

🎨 k2 codes

  • Test whether FSA is acyclic.
  • Fast parallel computation of longest common prefixes for efficient pattern matching (kmp-LCP).
  • Implementation of the Hybrid Autoregressive Transducer loss (HAT).

πŸ”Ή Other projects

Other

🎨

The CHiME 8 submission relied on Lhotse for data preparation, audio loading, data manipulation, and constructing the PyTorch dataloaders. Once in this format, it was possible to interface with standard Whisper training recipes in the Whisper GitHub repo or on Hugging Face.Β 

🎨 Target Speaker ASR with Whisper (by BUT in collaboration with Dominik Klement, Matthew Wiesner)

The repository contains the official implementation of the following publications: Target Speaker Whisper and DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition. It relied on Lhotse for homogenizing the data preparation across various datasets, such as AMI, Librispeech. This project is a collaboration between Alex Polok and LukΓ‘s Burget from BUT and Dominik Klement, Matthew Wiesner

🎨 Long-Form Fuzzy Speech-to-Text Alignment for 1000+ Languages (by Ruizhe Huang)

Ruizhe has introduced a long-form fuzzy speech-to-text aligner built on Torchaudio that can align hours of audio with imperfect transcriptions (e.g., lectures, earnings calls, audiobooks). As part of the implementation, the aligner relies on WFSTs via k2 to allow skipping or reordering parts of the text during alignment. The library is provided open-source and can be used out-of-the-box with users’ models for efficient alignment and segmentation of raw audio recordings and transcripts. A web demo is also available on Google Colab, demonstrating alignment of earnings calls and audiobooks in different languages with minimal pre-processing.

🎨 Hugging Face

Researchers and developers increasingly rely on the open-source platform Hugging Face for pre-trained models, datasets, and tools to efficiently build and deploy AI applications. k2-fsa is available on Hugging Face. As of now, it has published one dataset (LibriSpeech) and 18 models. Additionally, 30 HF Spaces have been released, offering inference APIs and demos for tasks such as speech recognition, text-to-speech, audio tagging, and spoken language identification using Next-gen Kaldi.

About

NSF CCRI ENS Project: Next Generation Tools for Spoken Language Science & Technology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •