Combining Transcription with Diarization (speaker identification) #99
Replies: 5 comments 6 replies
-
|
Have you checked out WhisperX? |
Beta Was this translation helpful? Give feedback.
-
|
checkout this repo: https://github.com/Navodplayer1/speechlib you will get accurate timing You can also do speaker recognition if you provide voices_folder. Then transcription will contain actual speaker names! |
Beta Was this translation helpful? Give feedback.
-
|
Combining transcription with diarization is a great foundation for voice-aware multi-agent systems — once you know who said what, you can route different speakers' utterances to different agents or use speaker identity as an access control signal. A few practical patterns for the transcription + diarization pipeline: Pyannote + faster-whisper — the most reliable open-source stack. Pyannote handles speaker timestamps (who spoke when), faster-whisper transcribes the segments. The key is using the pyannote timestamps to slice the audio for whisper transcription rather than letting whisper segment independently. Word-level alignment — after transcription, use whisper's Speaker embedding for identification — beyond diarization (distinguishing speakers A from B), speaker verification (is this the registered user "Alice"?) requires embedding comparison. ECAPA-TDNN embeddings work well for this and can run on CPU. For agent systems, voice-based speaker identity creates an interesting auth pattern: a registered voice as a "soft" auth factor. Not strong enough as sole auth, but useful for "this request sounds like it's from the account owner, not an intruder." We've been thinking about voice-authenticated agent commands as a modality in KinthAI — the identity layer that makes this possible: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons Are you using this for meeting transcription, conversation agents, or accessibility tooling? |
Beta Was this translation helpful? Give feedback.
-
|
Unsubscribe
…On Wed, Apr 29, 2026, 4:46 AM KinthAI ***@***.***> wrote:
Combining transcription with diarization is a great foundation for
voice-aware multi-agent systems — once you know who said what, you can
route different speakers' utterances to different agents or use speaker
identity as an access control signal.
A few practical patterns for the transcription + diarization pipeline:
*Pyannote + faster-whisper* — the most reliable open-source stack.
Pyannote handles speaker timestamps (who spoke when), faster-whisper
transcribes the segments. The key is using the pyannote timestamps to slice
the audio for whisper transcription rather than letting whisper segment
independently.
*Word-level alignment* — after transcription, use whisper's
word_timestamps=True and align the word-level timestamps with pyannote's
speaker timestamps. This gives you speaker attribution per word, not just
per segment.
*Speaker embedding for identification* — beyond diarization
(distinguishing speakers A from B), speaker verification (is this the
registered user "Alice"?) requires embedding comparison. ECAPA-TDNN
embeddings work well for this and can run on CPU.
For agent systems, voice-based speaker identity creates an interesting
auth pattern: a registered voice as a "soft" auth factor. Not strong enough
as sole auth, but useful for "this request sounds like it's from the
account owner, not an intruder."
We've been thinking about voice-authenticated agent commands as a modality
in KinthAI — the identity layer that makes this possible:
https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons
Are you using this for meeting transcription, conversation agents, or
accessibility tooling?
—
Reply to this email directly, view it on GitHub
<#99 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BTDXIS4EFI72Q2GXXZ22CTT4YFGF3AVCNFSM6AAAAACYKCXRT6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNZUHE3DIOI>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
|
I did implement something with speaker ID using Pyannote + faster-whisper and use the Word timestamp. The only difficulty I am running into is faster-whisper is basically second-based timestamp so when people talk quickly in turn, some utterances are left with the wrong speaker. Used with meetings. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone!
I was working on a project that will takes audio file and transcribes the meeting.
The problem with this project that I am facing is I need to diarize the speakers with their names. I was using Pyannote package for the identification of speakers but the problem is, Both transcription and diarization uses different models which outcomes different timecodes. Because of the different timecodes, I cannot able to match the transcription with the speaker names.
Anybody knows how I can tweak this problem or is there a product/model/method that I can use for both transcribing the matching with the speaker name using timecodes?
Left is speaker identifications with timecodes (pyannote), right is the transcription with timecodes (faster-whisper)
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions