Expose speaker centroid embeddings on DiarizationResult#463
Expose speaker centroid embeddings on DiarizationResult#463leecrossley wants to merge 3 commits intoargmaxinc:mainfrom
Conversation
make SpeakerEmbedding and its embedding field public, compute cluster centroid vectors in postProcess() before discarding raw embeddings, and surface them via DiarizationResult.speakerCentroidEmbeddings.
|
Thanks for the PR, could you please include tests for this in your PR? Particularly interested in any latency overhead this adds as well. Also the link to your downstream consumer may be a private repo, resolves to 404. |
a2they
left a comment
There was a problem hiding this comment.
Please add a unit test to make sure the correct cendroid is properly set
- revert public exposure of SpeakerEmbedding; nothing in the PR needs it now that centroid compute lives in the clusterer. - surface speakerCentroids on ClusteringResult. - VBxClustering.cluster(...) returns post-reassignment centroids via one centroidsFromAssignments(assignments: clusters, embeddings: all, k: kFinal) pass run after clusterReassignment(...), unifying all three paths (VBx weighted, kMeans correction, AHC fallback) on a single "mean of the final cluster members" definition. one extra O(N x D) mean-pool on the final assignments. - PyannoteDiarizer.postProcess threads ClusteringResult.speakerCentroids into DiarizationResult; removes the inline mean-pool loop. - DiarizationResult.speakerCentroidEmbeddings is now public private(set) var with doc comments covering embedding space (raw embedder output, unnormalised, pre-PLDA), post-reassignment mean, suggested comparison, and pyannote-only applicability. - add DiarizationResult.centroidCosineDistance(between:_:) delegating to MathOps.cosineDistance(_:_:) so numerics match MathOps.cosineDistanceMatrix used by clusterReassignment (vDSP, clamped to [0, 2]). no Accelerate import in DiarizationResult.swift. - make calculateCentroids / centroidsFromAssignments internal so they can be exercised by @testable tests. - new SpeakerCentroidEmbeddingsTests.swift: 3 unit tests on calculateCentroids (VBx weighted path), 3 on centroidsFromAssignments (kMeans correction + AHC fallback + empty cluster), 5 integration tests on VADAudio/VBxClustering including testCentroidValuesMatchFinalAssignmentMean which pins the surfaced value equals the mean of the final members after reassignment. - new DiarizationPipelinePerformanceTests.swift with XCTClockMetric, preload + warmup outside measure{}. uses only pre-existing public API so it compiles on main for baseline comparison.
|
thanks both, addressed + pushed. re 404: overshow is a private commerical repo. usage: let diarization = try await speakerKit.diarize(audioArray: audio)
let centroids = diarization.speakerCentroidEmbeddings
return SpeakerMatch(id: speakerId, embedding: centroids[speakerId])we persist the centroid with each transcribed segment and cosine-match aroujnd chunks to keep speaker ids stable wthout re-running the embedder. https://over.show re latency: centroid map now surfaces from VADAudio,
inside run-to-run noise (delta smaller than either rsd). re tests: new happy to rebase/squash. |
Added. unit coverage on both producers: calculateCentroids (main VBx path) and centroidsFromAssignments (kMeans correction + AHC fallback). integration testCentroidKeysSurviveClusterReassignment pins that every speaker id visible in final segments has a centroid after reassignment and testCentroidValuesMatchFinalAssignmentMean pins the centroid VALUE against the mean of final members - so "correct centroid is properly set" holds post-reassignment, not just pre- |
|
Thanks again for the careful review - my project (Overshow) is a private desktop app, in the Swift helper we run SpeakerKit locally alongside WhisperKit. Per chunk: one diarise pass, each transcript segment gets the best timeoverlap speaker an corresponding centroid from I pushed the last review pass as one batch: dropped the unrelated perf test from the diff, added public init parity, renamed I also updated the PR body / resolved the addressed review threads after replying where useful, I'm happy to adjust anythign else / squash if helpful |
a2they
left a comment
There was a problem hiding this comment.
Thanks for addressing the previous PR feedback. Last few notes on the changes.
| /// ``speakerCentroidEmbeddings``, the centroids have different dimensions, or either | ||
| /// vector is empty. Zero-magnitude centroids (unreachable in real diarization runs) | ||
| /// yield `MathOps.cosineDistance`'s sentinel of `1.0`. | ||
| public func centroidCosineDistance(between a: Int, _ b: Int) -> Float? { |
There was a problem hiding this comment.
nit: the unlabeled second parameter reads awkwardly at call sites (i.e. centroidCosineDistance(between: 0, 1)). Suggest renaming to between:and: (public surface easier to fix now)
| public func centroidCosineDistance(between a: Int, _ b: Int) -> Float? { | |
| public func centroidCosineDistance(between a: Int, and b: Int) -> Float? { |
| /// | ||
| /// - Returns: The nearest compatible centroid, or `nil` when `embedding` is empty, no | ||
| /// centroid exists, or all stored centroids have different dimensions. | ||
| public func nearestSpeakerCentroid(to embedding: [Float]) -> (speakerId: Int, distance: Float)? { |
There was a problem hiding this comment.
tie-breaking here depends on [Int: [Float]] iteration order, which isn't defined. Two centroids at the same distance can return a different speakerId between runs. For a cross-session matching helper this should be deterministic. Suggest iteratingspeakerCentroidEmbeddings.keys.sorted() and documenting that ties resolve to the lowest speakerId.
| /// | ||
| /// This field is populated by the Pyannote backend (`PyannoteDiarizer`). Other backends | ||
| /// conforming to `Diarizer` may leave it as `[:]` if they do not expose per-cluster centroids. | ||
| public private(set) var speakerCentroidEmbeddings: [Int: [Float]] |
There was a problem hiding this comment.
One thing I wanted to flag on the centroid calculation, curious what you think.
The surfaced speakerCentroidEmbeddings is computed over all embeddings, but every centroid used internally in VBxClustering.cluster(...) is computed over the trainable subset only (the nonOverlappedFrameRatio > minActiveRatio filter). All three seed paths (VBx, kMeans, AHC fallback) use embeddingsFloats (trainable). Only the new surfaced centroid uses allEmbeddingsFloats.
So the centroid we return isn't quite the same kind of mean the pipeline itself uses. It folds in the overlap-flagged windows that the embedder is least confident on, which tends to pull the centroid toward the mixed-speaker region of the embedder's output space. Downstream consumers doing cosine matching end up with a noisier reference point than the one clustering already trusted.
Would it be worth adding an option on PyannoteDiarizationOptions, something like centroidSource: .finalAssignment | .trainableOnly, so callers can opt into the trainable-only centroid that matches the pipeline's internal convention? What was your testing like with trainable only vs all embeddings?
Summary
Adds
speakerCentroidEmbeddings: [Int: [Float]]toDiarizationResultso downstream consumers can match speakers across diarization runs without re-running the embedder.Centroids are computed inside the clusterer, not in
postProcess.VBxClustering.cluster(...)returns post-reassignment centroids along all three paths (VBx weighted, kMeans correction, AHC fallback) via onecentroidsFromAssignments(assignments: clusters, embeddings: all, clusterCount: kFinal)pass afterclusterReassignment(...), sospeakerCentroidEmbeddings[k]is the mean of the final cluster members of speakerk.Closes #457.
Motivation
We ship a privacy-first recording app that runs SpeakerKit diarization over many short audio chunks and needs to link speakers to the same person across chunks. Today, the per-window embeddings used internally for clustering are thrown away before
diarize(...)returns, leaving no way to correlate the cluster ids in one result with those in another without running the embedder a second time over the whole chunk.Exposing the cluster centroids is the smallest change that makes this possible: callers get one embedding per cluster, computed from the same data the clusterer already used. https://over.show
Changes
Sources/SpeakerKit/Pyannote/SpeakerClustering.swift: addspeakerCentroids: [Int: [Float]]onClusteringResultwith default[:]so non-VBx conformers stay compatible.Sources/SpeakerKit/Pyannote/VBxClustering.swift:cluster(...)now returns(clusters, linkageMatrix, centroids); afterclusterReassignment(...)it runs one extracentroidsFromAssignments(...)pass so the surfaced map is the mean of the final cluster members across all three paths.Sources/SpeakerKit/Pyannote/PyannoteDiarizer.swift:postProcessacceptsspeakerCentroidsand threads them intoDiarizationResult; inline mean-pool loop removed.Sources/SpeakerKit/DiarizationResult.swift:speakerCentroidEmbeddingsispublic private(set) varwith doc comments on raw embedder space, post-reassignment mean, threshold-free distance semantics, and Pyannote-only applicability. Addspublic func centroidCosineDistance(between:_:)andpublic func nearestSpeakerCentroid(to:)for caller-side comparison.Sources/SpeakerKit/Pyannote/SpeakerEmbedderModel.swift: revertSpeakerEmbeddingand itsembeddingfield back tointernal; nothing else in the PR needs them public now that the compute lives in the clusterer.Tests/SpeakerKitTests/SpeakerCentroidEmbeddingsTests.swift: unit tests across both centroid producers (calculateCentroids+centroidsFromAssignments) and integration tests onVADAudio/VBxClustering, includingtestCentroidValuesMatchFinalAssignmentMeanwhich pins the surfaced value equals the mean of the final cluster members.Cost
The runtime addition is one final
O(N x D)mean-pool on final assignments insideVBxClustering.cluster(...)(N embeddings, D=192). No performance harness is included in this PR.Test plan
swift test --filter SpeakerCentroidEmbeddingsTests: 14 tests, 0 failures, 1 skip (testCentroidCosineDistance_sameDiarizationskips on bundled single-speaker fixtures; the helper is separately covered by unit tests).swift test --filter SpeakerKitTests: 109 tests, 0 failures, 1 skip.git diff --check: clean.