-
Notifications
You must be signed in to change notification settings - Fork 0
Sprint Goals
Speaker diarization and speaker recognition are two important speech-processing technologies with complementary strengths that can be combined to enable powerful new applications. Speaker diarization aims to segment an audio recording into homogeneous segments according to the identity of the speaker, while speaker recognition aims to verify or identify the identity of a speaker from their voice. Individually, these technologies have limitations that can be addressed by combining them.
Speaker diarization
Speaker diarization, also known as speaker segmentation or speaker clustering, refers to the process of partitioning an input audio stream into homogeneous segments according to the identity of the speaker. The goal is to answer the question “who spoke when” without having any prior knowledge about the speakers.
Traditional speaker diarization systems have typically been comprised of numerous independent sub-modules, each serving a specific purpose in the overall process, as illustrated in the below figure
-
Voice Activity Detection (VAD): This module of speaker diarization is responsible for distinguishing speech segments from non-speech parts in audio recordings, facilitating the removal of unwanted sections such as silences and other non-speech sounds.
-
Segmentation: This step involves extracting small segments of audio continuously such that each segment usually contains only one speaker.
-
Speaker embedding: This module is responsible for creating speaker embeddings of the segmented audio chunks, which are a fixed-size vector representation of the audio clipping.
-
Clustering: This module clusters similar embeddings and assigns generic labels to them.
In the final step of speaker diarization, we generate Rich Transcription Time-Marked (RTTM) files, which provide a comprehensive summary of the speaker segmentation and labeling information obtained through the speech segmentation and clustering stages.
A key advantage of speaker diarization is the lack of need for labeled training data from enrolled speakers. Systems can be developed in an unsupervised fashion by optimizing clustering performance on unlabeled audio. This allows speaker diarization to be applied to audio containing previously unseen speakers. However, the lack of speaker labels also means speaker identities cannot be determined. Diarization provides anonymous labels like “Speaker 1” without indicating who that speaker actually is.
Speaker Recognition
Speaker recognition refers to identifying or verifying the identity of speakers from their voices. Speaker identification determines which registered speaker provides a given speech sample from a set of known speakers.
Modern speaker recognition systems extract speaker embeddings using neural network models trained on large labeled datasets. During enrollment, reference embeddings are stored for each known speaker. At test time, embeddings are extracted from an input sample and compared to the references to identify speakers or verify claimed identities.
Unlike speaker diarization, speaker recognition requires extensive labeled training data with multiple samples per enrolled speaker. However, a key advantage is that recognized speaker identities correspond to real names and profiles.
In my previous R&D project, the speaker diarization and recognition module were combined as shown in the below figure
The pipeline works by first passing the audio through the speaker diarization module to extract time boundaries and generic speaker labels for each speaker turn. These time boundaries are then used to create speaker embeddings, which are passed to the pre-trained speaker recognition module to predict speaker names. If a new speaker is encountered, the model will label them as unknown. After obtaining speaker predictions from the speaker recognition module which is based on few-shot learning, they are integrated with the diarization output to generate an rttm file that includes the time boundaries and corresponding speaker names.
The proposed model is also flexible enough to accommodate new speakers based on user feedback. For instance, if the final output contains an unknown speaker who has multiple speech turns with at least 30 seconds of data altogether, a user who recognizes the speaker can update their name. This user feedback is used by the speaker recognition module to retrain itself with the existing time boundaries and new speaker name information, allowing the model to easily learn new speakers.
But there are a few deficits in the above approach:
-
Since the speaker recognition dataset I used in the R&D project was pretty small (approx. 400 speakers), there is a high probability that the few-shot recognition module is overfitting. If we intend to scale the model then the accuracy might drop since we are using the limited data samples of each speaker.
-
Since there was limited data to perform the joint speaker diarization and speaker recognition, we used a few existing state-of-the-art diarization methodologies to infer the diarization output instead of training the model on our dataset.
-
The errors made in speaker diarization were propagated to the subsequent speaker recognition step, leading to compounding inaccuracies.
This research focuses on combining speaker diarization with speaker recognition, such that the final output not only contains the speech boundaries of individual speakers but also the actual speaker name.
The main idea of this research is to address the above deficits by proposing the below solutions:
-
Implement a speech mixture algorithm to simulate a dataset to train and evaluate a joint speaker diarization and speaker recognition module.
-
Employ attention-based speaker segmentation, eliminate the traditional clustering module of the speaker diarization task, and then directly replace it with the speaker recognition module.
-
Implement a transformer-based few-shot learning for speaker recognition such that the model is capable of learning with limited audio data for each speaker.
RQ1 Are there any datasets to evaluate the combined speaker diarization and recognition task? How to simulate the required dataset?
RQ2 How to employ attention-based speaker segmentation, eliminate the traditional clustering module of the speaker diarization task, and then directly replace it with the speaker recognition module?
RQ3 How to implement a speaker recognition model to learn the speaker representations with limited data?
RQ4 How to improve the: scalability and adaptability of the joint diarization and recognition model?
R1 Create a dataset for combined speaker diarization and speaker recognition
R2 Implement a transformer-based speaker change detection module and train on the simulated dataset
R3 Implement a transformer-based few-shot recognition and train on the VoxCeleb dataset
R4 Inference of speaker-recognized diarization by sequentially combining the above modules
R5 Evaluate the combined speaker diarization and recognition
-
Finalize the model architectures to be used for both speaker change detection and speaker recognition modules
-
Write a Project Proposal with a suitable title
-
Meet up with professors and get the final approval
-
Create a dataset of at least 25hours for a combined speaker diarization task and recognition task with manual annotation (Using YouTube Audio)
-
Implement speech mixture algorithm for diarization dataset simulation
-
Generate a simulated dataset for the speaker diarization task using the VoxCeleb dataset
-
Submit the thesis proposal
-
Create a dataset of at least 25hours for a combined speaker diarization task and recognition task with manual annotation (Using YouTube Audio)
-
Implement transformer-based few-shot recognition and train on the VoxCeleb dataset
-
Write the dataset chapter for the final thesis report.