🔹 Introduction

The Kaldi open-source toolkit, developed in 2009 for automatic speech recognition (ASR), has played a foundational role in speech technology research. Its evolution continued in 2021 with the emergence of a follow-up trilogy—Icefall, k2, and Lhotse—demonstrating the rapid pace of innovation in speech algorithms and system design.

In response to the transformative changes in the field—such as the rise of large language models (LLMs), platforms like Hugging Face, and robust deep learning frameworks like PyTorch—this project aims to reimagine and modernize Kaldi’s capabilities to meet the emerging needs of the speech research community.

The primary objective of k2 is to re-implement all core functions of Kaldi natively in generic AI/deep learning frameworks, with a focus on PyTorch. This allows the seamless integration of cutting-edge developments in deep learning (e.g., novel optimization algorithms) into speech recognition research. The primary goals of Lhotse and Icefall include delivering efficient, user-friendly tools for data preparation, recipe development, and training modern ASR models.

GitHub statistics for Lhotse, Icefall, and k2. ^*within last month (as of September 30 2025)

GitHub statistic	Lhotse	Icefall	k2
Watch	42	50	72
Fork	251	375	230
Star	1.1k	1.2k	1.3k
Dependent repositories	316	0	57
Merged PR^*	12	4	0
Open PR^*	9	1	0
Closed issue^*	2	6	0
New issue^*	2	5	0
Commits to master^*	12	4	0
Additions^*	1,658	53,242	0
Deletions^*	70	8	0

Acknowledgments:

This project was supported by U.S. National Science Foundation Award Number:2120435, NSF-CCRI project: CCRI: ENS: Next Generation Tools for Spoken Language Science & Technology.

🔹 Projects

Lhotse develops a modern approach to speech data preparation. Its design is inspired by data libraries commonly used in the ML community, such as pandas. Lhotse's philosophy may be summarized as ''simple things should be simple, complex things should be possible.''

🎨 JHU Contributors

#1 Piotr Żelasko, 1,231 commits 🟩 112,538 ++ 🔴 43,832 --
#2 Desh Raj, 248 commits 🟩 29,279 ++ 🔴 12,783 --
#4 Jan (Yenda) Trmal, 33 commits 🟩 2,093 ++ 🔴 651 --
#6 Amir Hussein, 28 commits 🟩 2,747 ++ 🔴 1766 --
#13 Matthew Wiesner, 13 commits 🟩 3,425 ++ 🔴 627 --
#23 Yiming Wang, 7 commits 🟩 215 ++ 🔴 37 --
#38 Dominik Klement, 2 commits 🟩 1602 ++ 🔴 0 --
#56 Matthew Maciejewski, 1 commit 🟩 1,217 ++ 🔴 0 --
#76 Dongji Gao, 1 commit 🟩 5 ++ 🔴 3 --
#83 Henry Li Xinyuan, 1 commit 🟩 146 ++ 🔴 0 --

🎨 GPU-accelerated Guided Source Separation (by Desh Raj)

Improved implementation of GSS that leverages the power of modern GPU-based pipelines, such as batched processing of frequencies and segments. This allows us to perform detailed ablation studies over several parameters of the GSS algorithm. There are reproducible pipelines for speaker-attributed transcription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting.

🎨 Lhotse recipes

recipes/aishell3.py speech/asr
recipes/atcosim.py speech/asr
recipes/but_reverb_db.py reverberation database
recipes/chime6.py speech/asr
recipes/csj.py speech/asr
recipes/cmu_kids.py speech/asr
recipes/dipco.py speech/asr
recipes/edacc.py speech/asr
recipes/gigast.py speech-translation
recipes/himia.py speaker-verification
recipes/librilight.py speech/asr
recipes/must_c.py speech-translation
recipes/speechcommands.py speech/hotword-detection
recipes/uwb_atcc.py speech/asr
recipes/xbmu_amdo31.py speech/asr
Fleurs speech/language-id
radio stations speech/asr speech/language-id database
SBCASE speech/diarization database

Icefall is the project where K2 and Lhotse ''meet''. It provides the speech and language research community a comprehensive collection of recipes for training modern speech processing systems on most of the popular speech data sets.

🎨 JHU Contributors

#9 Desh Raj, 19 commits 🟩 39,142 ++ 🔴 22,339 --
#10 Piotr Żelasko, 18 commits 🟩 993 ++ 🔴 838 --
#15 Ruizhe Huang, 7 commits 🟩 95 ++ 🔴 74 --
#29 Amir Hussein, 3 commit 🟩 59,348 ++ 🔴 2 --
#31 Dongji Gao, 2 commits 🟩 9,565 ++ 🔴 9 --
#44 Henry Li Xinyuan, 1 commit 🟩 2,124 ++ 🔴 3 --

🎨 External Contributors

#2 Dan Povey, 200 commits 🟩 13,323 ++ 🔴 4,485 --

🎨 HENT-SRT (by Amir Hussein, Paola Garcia, Matthew Wiesner)

We introduced HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation ), a novel hierarchical transducer architecture for joint speech recognition and translation (ST). HENT-SRT (Icefall recipe) significantly outperforms previous state-ofthe-art transducer-based ST models and closes the gap with attention-based encoder-decoder architectures, while achieving superior ASR performance. Our approach offers substantial gains in streaming scenarios without introducing additional delays.

🎨 Continuous Streaming Multi-Talker ASR (by Desh Raj)

We investigated Streaming Unmixing and Recognition Transducer (SURT) for continuous streaming multitalker ASR, and demonstrated the effectiveness of dual-path LSTMs and Transformers for generalization to diverse session lengths (recipes for the LibriCSS, AMI and ICSI datasets).

🎨 SPGISpeech (by Desh Raj)

We developed an Icefall recipe and trained models (zipformer and stateless transducer models on Hugging Face) for SPGISpeech, a dataset consisting of 5,000 hours of recorded company earnings calls and their respective transcriptions.

🎨 MGB-2 (by Amir Hussein)

We developed an Icefall recipe and trained a model (conformer-ctc model on Hugging Face) for Multi-Dialect Broadcast News Arabic Speech Recognition (MGB-2) challenge on Arabic multi-dialect broadcast media recognition.

🎨 Contextual ASR (by Ruizhe Huang, Mahsa Yarmohammadi)

Developed recipes for Contextual ASR. This is the process by which an ASR system is provided with contextual information derived from metadata associated with the audio, typically in the form of a list of words or phrases likely to be spoken, with the goal of improving the recognition accuracy of named entities and other infrequent terms. Our work on Contextual ASR is recognized for introducing the ConEC dataset (ConEC), followed by a method for improving neural biasing beyond shallow language model fusion (Pull request - Neural Biasing).

🎨 Omni-temporal Classification (OTC) (by Dongji Gao, Paola Garcia, Matthew Wiesner)

Training ASR systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. We designed and implemented Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts (OTC w/ BPE units, OTC w/ phone units).

🎨 ASR + LID (by Amir Hussein, Paola Garcia, Matthew Wiesner)

Created a multitask learning framework that synchronizes Language Identification (LID) with ASR, utilizing a neural transducer architecture. We demonstrate the efficacy of our proposed approach on conversational multilingual (Arabic, Spanish, Mandarin) and CS (Spanish-English, Mandarin-English) test sets (Pull request - ASR SEAME Recipe).

🎨 Other recipes

Kneser-Ney language model smoothing
Librispeech - partial contribution
Fluent Speech Commands recipe
Recipe for Geolocation dataset using Lhotse and Icefall
Dialectal IWSLT-Tunisian 2022 shared task ASR and ST recipes

K2 brings data structures and algorithms from the field of finite state automata (FSA) into the world of deep learning. It provides efficient CPU and GPU implementations of commonly used FSA operations and integrates them seamlessly with PyTorch's tensor and automatic differentiation mechanisms, thus admitting - and benefiting from - the inner complexity of the speech recognition, instead of trying to remove it.

🎨 JHU Contributors

#12 Piotr Żelasko, 4 commits 🟩 9,458 ++ 🔴 276 --
#13 Jan "yenda" Trmal, 4 commits 🟩 9,314 ++ 🔴 267 --
#16 Mahsa Yarmohammadi, 3 commits 🟩 173 ++ 🔴 51 --
#19 Yiming Wang, 3 commits 🟩 234 ++ 🔴 67 --
#22 Desh Raj, 2 commits 🟩 435 ++ 🔴 29 --
#35 Dongji Gao, 1 commit 🟩 27 ++ 🔴 10 --

🎨 External Contributors

#2 Dan Povey, 214 commits 🟩 73,771 ++ 🔴 30,586 --

🎨 k2 codes

Test whether FSA is acyclic.
Fast parallel computation of longest common prefixes for eﬃcient pattern matching (kmp-LCP).
Implementation of the Hybrid Autoregressive Transducer loss (HAT).

🔹 Other projects

🎨

The CHiME 8 submission relied on Lhotse for data preparation, audio loading, data manipulation, and constructing the PyTorch dataloaders. Once in this format, it was possible to interface with standard Whisper training recipes in the Whisper GitHub repo or on Hugging Face.

Data preparation for k2/Icefall and ESPNet.
Usage prepares CHiME-8 data lhotse manifests.
Manifest preparation for different toolkits (Datasets included: Dipco, mixer6, notsofar1, CHiME6).

🎨 Target Speaker ASR with Whisper (by BUT in collaboration with Dominik Klement, Matthew Wiesner)

The repository contains the official implementation of the following publications: Target Speaker Whisper and DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition. It relied on Lhotse for homogenizing the data preparation across various datasets, such as AMI, Librispeech. This project is a collaboration between Alex Polok and Lukás Burget from BUT and Dominik Klement, Matthew Wiesner

Data preparation (AMI, Librispeech, etc)

🎨 Long-Form Fuzzy Speech-to-Text Alignment for 1000+ Languages (by Ruizhe Huang)

Ruizhe has introduced a long-form fuzzy speech-to-text aligner built on Torchaudio that can align hours of audio with imperfect transcriptions (e.g., lectures, earnings calls, audiobooks). As part of the implementation, the aligner relies on WFSTs via k2 to allow skipping or reordering parts of the text during alignment. The library is provided open-source and can be used out-of-the-box with users’ models for efficient alignment and segmentation of raw audio recordings and transcripts. A web demo is also available on Google Colab, demonstrating alignment of earnings calls and audiobooks in different languages with minimal pre-processing.

🎨 Hugging Face

Researchers and developers increasingly rely on the open-source platform Hugging Face for pre-trained models, datasets, and tools to efficiently build and deploy AI applications. k2-fsa is available on Hugging Face. As of now, it has published one dataset (LibriSpeech) and 18 models. Additionally, 30 HF Spaces have been released, offering inference APIs and demos for tasks such as speech recognition, text-to-speech, audio tagging, and spoken language identification using Next-gen Kaldi.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
README.md		README.md
_config.yml		_config.yml
index.md		index.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔹 Introduction

Acknowledgments:

🔹 Projects

🎨 JHU Contributors

🎨 GPU-accelerated Guided Source Separation (by Desh Raj)

🎨 Lhotse recipes

🎨 JHU Contributors

🎨 External Contributors

🎨 HENT-SRT (by Amir Hussein, Paola Garcia, Matthew Wiesner)

🎨 Continuous Streaming Multi-Talker ASR (by Desh Raj)

🎨 SPGISpeech (by Desh Raj)

🎨 MGB-2 (by Amir Hussein)

🎨 Contextual ASR (by Ruizhe Huang, Mahsa Yarmohammadi)

🎨 Omni-temporal Classification (OTC) (by Dongji Gao, Paola Garcia, Matthew Wiesner)

🎨 ASR + LID (by Amir Hussein, Paola Garcia, Matthew Wiesner)

🎨 Other recipes

🎨 JHU Contributors

🎨 External Contributors

🎨 k2 codes

🔹 Other projects

🎨

🎨 Target Speaker ASR with Whisper (by BUT in collaboration with Dominik Klement, Matthew Wiesner)

🎨 Long-Form Fuzzy Speech-to-Text Alignment for 1000+ Languages (by Ruizhe Huang)

🎨 Hugging Face

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

JHU-CLSP/Speech_NSF_NextGen

Folders and files

Latest commit

History

Repository files navigation

🔹 Introduction

Acknowledgments:

🔹 Projects

🎨 JHU Contributors

🎨 GPU-accelerated Guided Source Separation (by Desh Raj)

🎨 Lhotse recipes

🎨 JHU Contributors

🎨 External Contributors

🎨 HENT-SRT (by Amir Hussein, Paola Garcia, Matthew Wiesner)

🎨 Continuous Streaming Multi-Talker ASR (by Desh Raj)

🎨 SPGISpeech (by Desh Raj)

🎨 MGB-2 (by Amir Hussein)

🎨 Contextual ASR (by Ruizhe Huang, Mahsa Yarmohammadi)

🎨 Omni-temporal Classification (OTC) (by Dongji Gao, Paola Garcia, Matthew Wiesner)

🎨 ASR + LID (by Amir Hussein, Paola Garcia, Matthew Wiesner)

🎨 Other recipes

🎨 JHU Contributors

🎨 External Contributors

🎨 k2 codes

🔹 Other projects

🎨

🎨 Target Speaker ASR with Whisper (by BUT in collaboration with Dominik Klement, Matthew Wiesner)

🎨 Long-Form Fuzzy Speech-to-Text Alignment for 1000+ Languages (by Ruizhe Huang)

🎨 Hugging Face

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages