Combining Transcription with Diarization (speaker identification) #99

MustafaCQN · 2023-03-31T11:21:14Z

MustafaCQN
Mar 31, 2023

Hi everyone!

I was working on a project that will takes audio file and transcribes the meeting.

The problem with this project that I am facing is I need to diarize the speakers with their names. I was using Pyannote package for the identification of speakers but the problem is, Both transcription and diarization uses different models which outcomes different timecodes. Because of the different timecodes, I cannot able to match the transcription with the speaker names.

Anybody knows how I can tweak this problem or is there a product/model/method that I can use for both transcribing the matching with the speaker name using timecodes?

Left is speaker identifications with timecodes (pyannote), right is the transcription with timecodes (faster-whisper)

Thank you!

apzl · 2023-04-04T11:44:09Z

apzl
Apr 4, 2023

Have you checked out WhisperX?

4 replies

ciekawy Apr 13, 2023

its slower. Can faster-whisper be used within WhisperX?

landemou Apr 13, 2023

it would be great to have the diarization on faster-whisper but surely very hard to set up !

MustafaCQN Apr 13, 2023
Author

Have you checked out WhisperX?

I will check and reply here after I see the results. Right now I am using another Thread to check speakers speaking identification from google meet. But as you guess its just solving the problem from one platform. And It can't be used cases like phone recording of real meetings etc.

it would be great to have the diarization on faster-whisper but surely very hard to set up !

I would love if faster-whisper releases this!. I was using Pyannote then manual scraping from google meet but in the end both of them giving me different timecodes than faster-whisper. So I have to combine them together for understandable diarization. I have solved this after I widen both timecodes to the nexts beginning code.
Ex:
[5.0 -> 10.0] hi
[12.0 -> 14.0] how are you

to
[5.0 -> 12.00] hi
[12.0 -> 14.0] how are you

then I take the average of the start and end time of the transcription and directly took the speaker name from the given diarization list.
That way this will create a diarization with approx. %80 accuracy with using 2 different models or 1 model and 1 automation thread.

ciekawy Apr 13, 2023

Shouldn't be the same compose of pyannote on whisper as in whisperX / pyannote-whisper project?

NavodPeiris · 2024-01-23T12:36:05Z

NavodPeiris
Jan 23, 2024

checkout this repo: https://github.com/Navodplayer1/speechlib
this uses pyannote diarization and segment the audio according to start and end times. Then apply faster-whisper transcription to each segment. Finally output transcript with time from pyannote diarization and transcripted text from faster-whisper.

you will get accurate timing

You can also do speaker recognition if you provide voices_folder. Then transcription will contain actual speaker names!

1 reply

RustX2802 May 17, 2024

Hi @NavodPeiris, is it possible to apply diarization with speechlib for real-time transcription capabilities? Have you tried this option?

kinthaiofficial · 2026-04-29T00:45:45Z

kinthaiofficial
Apr 29, 2026

Combining transcription with diarization is a great foundation for voice-aware multi-agent systems — once you know who said what, you can route different speakers' utterances to different agents or use speaker identity as an access control signal.

A few practical patterns for the transcription + diarization pipeline:

Pyannote + faster-whisper — the most reliable open-source stack. Pyannote handles speaker timestamps (who spoke when), faster-whisper transcribes the segments. The key is using the pyannote timestamps to slice the audio for whisper transcription rather than letting whisper segment independently.

Word-level alignment — after transcription, use whisper's word_timestamps=True and align the word-level timestamps with pyannote's speaker timestamps. This gives you speaker attribution per word, not just per segment.

Speaker embedding for identification — beyond diarization (distinguishing speakers A from B), speaker verification (is this the registered user "Alice"?) requires embedding comparison. ECAPA-TDNN embeddings work well for this and can run on CPU.

For agent systems, voice-based speaker identity creates an interesting auth pattern: a registered voice as a "soft" auth factor. Not strong enough as sole auth, but useful for "this request sounds like it's from the account owner, not an intruder."

We've been thinking about voice-authenticated agent commands as a modality in KinthAI — the identity layer that makes this possible: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons

Are you using this for meeting transcription, conversation agents, or accessibility tooling?

0 replies

paramashivatma · 2026-04-29T02:24:35Z

paramashivatma
Apr 29, 2026

Unsubscribe

…

On Wed, Apr 29, 2026, 4:46 AM KinthAI ***@***.***> wrote: Combining transcription with diarization is a great foundation for voice-aware multi-agent systems — once you know who said what, you can route different speakers' utterances to different agents or use speaker identity as an access control signal. A few practical patterns for the transcription + diarization pipeline: *Pyannote + faster-whisper* — the most reliable open-source stack. Pyannote handles speaker timestamps (who spoke when), faster-whisper transcribes the segments. The key is using the pyannote timestamps to slice the audio for whisper transcription rather than letting whisper segment independently. *Word-level alignment* — after transcription, use whisper's word_timestamps=True and align the word-level timestamps with pyannote's speaker timestamps. This gives you speaker attribution per word, not just per segment. *Speaker embedding for identification* — beyond diarization (distinguishing speakers A from B), speaker verification (is this the registered user "Alice"?) requires embedding comparison. ECAPA-TDNN embeddings work well for this and can run on CPU. For agent systems, voice-based speaker identity creates an interesting auth pattern: a registered voice as a "soft" auth factor. Not strong enough as sole auth, but useful for "this request sounds like it's from the account owner, not an intruder." We've been thinking about voice-authenticated agent commands as a modality in KinthAI — the identity layer that makes this possible: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons Are you using this for meeting transcription, conversation agents, or accessibility tooling? — Reply to this email directly, view it on GitHub <#99 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BTDXIS4EFI72Q2GXXZ22CTT4YFGF3AVCNFSM6AAAAACYKCXRT6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNZUHE3DIOI> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

0 replies

etlweather · 2026-04-29T05:26:28Z

etlweather
Apr 29, 2026

I did implement something with speaker ID using Pyannote + faster-whisper and use the Word timestamp. The only difficulty I am running into is faster-whisper is basically second-based timestamp so when people talk quickly in turn, some utterances are left with the wrong speaker. Used with meetings.

1 reply

alexkondor03 May 22, 2026

I’m building a meeting note taking app too and ran into similar issues with pyannote and whisper, it seems like no matter what I did some utterances would be labeled incorrectly.

I think the issue comes down to machine diarization having to account for things like people taklking over each other and pauses. I’m looking into perfect diarization atm which uses separate audio streams to label who said what. Looks promising and is what my team is trying next

Combining Transcription with Diarization (speaker identification) #99

Uh oh!

Replies: 5 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MustafaCQN Apr 13, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 6 replies

MustafaCQN Apr 13, 2023
Author