Skip to content

Expose speaker centroid embeddings on DiarizationResult#463

Open
leecrossley wants to merge 3 commits intoargmaxinc:mainfrom
leecrossley:centroid-embeddings-on-main
Open

Expose speaker centroid embeddings on DiarizationResult#463
leecrossley wants to merge 3 commits intoargmaxinc:mainfrom
leecrossley:centroid-embeddings-on-main

Conversation

@leecrossley
Copy link
Copy Markdown

@leecrossley leecrossley commented Apr 19, 2026

Summary

Adds speakerCentroidEmbeddings: [Int: [Float]] to DiarizationResult so downstream consumers can match speakers across diarization runs without re-running the embedder.

Centroids are computed inside the clusterer, not in postProcess. VBxClustering.cluster(...) returns post-reassignment centroids along all three paths (VBx weighted, kMeans correction, AHC fallback) via one centroidsFromAssignments(assignments: clusters, embeddings: all, clusterCount: kFinal) pass after clusterReassignment(...), so speakerCentroidEmbeddings[k] is the mean of the final cluster members of speaker k.

Closes #457.

Motivation

We ship a privacy-first recording app that runs SpeakerKit diarization over many short audio chunks and needs to link speakers to the same person across chunks. Today, the per-window embeddings used internally for clustering are thrown away before diarize(...) returns, leaving no way to correlate the cluster ids in one result with those in another without running the embedder a second time over the whole chunk.

Exposing the cluster centroids is the smallest change that makes this possible: callers get one embedding per cluster, computed from the same data the clusterer already used. https://over.show

Changes

  • Sources/SpeakerKit/Pyannote/SpeakerClustering.swift: add speakerCentroids: [Int: [Float]] on ClusteringResult with default [:] so non-VBx conformers stay compatible.
  • Sources/SpeakerKit/Pyannote/VBxClustering.swift: cluster(...) now returns (clusters, linkageMatrix, centroids); after clusterReassignment(...) it runs one extra centroidsFromAssignments(...) pass so the surfaced map is the mean of the final cluster members across all three paths.
  • Sources/SpeakerKit/Pyannote/PyannoteDiarizer.swift: postProcess accepts speakerCentroids and threads them into DiarizationResult; inline mean-pool loop removed.
  • Sources/SpeakerKit/DiarizationResult.swift: speakerCentroidEmbeddings is public private(set) var with doc comments on raw embedder space, post-reassignment mean, threshold-free distance semantics, and Pyannote-only applicability. Adds public func centroidCosineDistance(between:_:) and public func nearestSpeakerCentroid(to:) for caller-side comparison.
  • Sources/SpeakerKit/Pyannote/SpeakerEmbedderModel.swift: revert SpeakerEmbedding and its embedding field back to internal; nothing else in the PR needs them public now that the compute lives in the clusterer.
  • Tests/SpeakerKitTests/SpeakerCentroidEmbeddingsTests.swift: unit tests across both centroid producers (calculateCentroids + centroidsFromAssignments) and integration tests on VADAudio/VBxClustering, including testCentroidValuesMatchFinalAssignmentMean which pins the surfaced value equals the mean of the final cluster members.

Cost

The runtime addition is one final O(N x D) mean-pool on final assignments inside VBxClustering.cluster(...) (N embeddings, D=192). No performance harness is included in this PR.

Test plan

  • swift test --filter SpeakerCentroidEmbeddingsTests: 14 tests, 0 failures, 1 skip (testCentroidCosineDistance_sameDiarization skips on bundled single-speaker fixtures; the helper is separately covered by unit tests).
  • swift test --filter SpeakerKitTests: 109 tests, 0 failures, 1 skip.
  • git diff --check: clean.

make SpeakerEmbedding and its embedding field public, compute cluster
centroid vectors in postProcess() before discarding raw embeddings,
and surface them via DiarizationResult.speakerCentroidEmbeddings.
@ZachNagengast
Copy link
Copy Markdown
Contributor

ZachNagengast commented Apr 20, 2026

Thanks for the PR, could you please include tests for this in your PR? Particularly interested in any latency overhead this adds as well. Also the link to your downstream consumer may be a private repo, resolves to 404.

Copy link
Copy Markdown
Contributor

@a2they a2they left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a unit test to make sure the correct cendroid is properly set

Comment thread Sources/SpeakerKit/Pyannote/PyannoteDiarizer.swift Outdated
Comment thread Sources/SpeakerKit/Pyannote/SpeakerEmbedderModel.swift Outdated
Comment thread Sources/SpeakerKit/DiarizationResult.swift Outdated
- revert public exposure of SpeakerEmbedding; nothing in the PR needs
  it now that centroid compute lives in the clusterer.
- surface speakerCentroids on ClusteringResult.
- VBxClustering.cluster(...) returns post-reassignment centroids via
  one centroidsFromAssignments(assignments: clusters, embeddings: all,
  k: kFinal) pass run after clusterReassignment(...), unifying all three
  paths (VBx weighted, kMeans correction, AHC fallback) on a single
  "mean of the final cluster members" definition. one extra O(N x D)
  mean-pool on the final assignments.
- PyannoteDiarizer.postProcess threads ClusteringResult.speakerCentroids
  into DiarizationResult; removes the inline mean-pool loop.
- DiarizationResult.speakerCentroidEmbeddings is now public private(set)
  var with doc comments covering embedding space (raw embedder output,
  unnormalised, pre-PLDA), post-reassignment mean, suggested comparison,
  and pyannote-only applicability.
- add DiarizationResult.centroidCosineDistance(between:_:) delegating to
  MathOps.cosineDistance(_:_:) so numerics match MathOps.cosineDistanceMatrix
  used by clusterReassignment (vDSP, clamped to [0, 2]). no Accelerate
  import in DiarizationResult.swift.
- make calculateCentroids / centroidsFromAssignments internal so they
  can be exercised by @testable tests.
- new SpeakerCentroidEmbeddingsTests.swift: 3 unit tests on
  calculateCentroids (VBx weighted path), 3 on centroidsFromAssignments
  (kMeans correction + AHC fallback + empty cluster), 5 integration
  tests on VADAudio/VBxClustering including testCentroidValuesMatchFinalAssignmentMean
  which pins the surfaced value equals the mean of the final members
  after reassignment.
- new DiarizationPipelinePerformanceTests.swift with XCTClockMetric,
  preload + warmup outside measure{}. uses only pre-existing public
  API so it compiles on main for baseline comparison.
@leecrossley
Copy link
Copy Markdown
Author

thanks both, addressed + pushed.

re 404: overshow is a private commerical repo. usage:

let diarization = try await speakerKit.diarize(audioArray: audio)
let centroids = diarization.speakerCentroidEmbeddings
return SpeakerMatch(id: speakerId, embedding: centroids[speakerId])

we persist the centroid with each transcribed segment and cosine-match aroujnd chunks to keep speaker ids stable wthout re-running the embedder. https://over.show

re latency: centroid map now surfaces from ClusteringResult instead of being recomputed in postProcess. VBxClustering.cluster(...) returns post-reassignment centroids along all three paths (VBx weighted, kMeans correction, AHC fallback) via one centroidsFromAssignments(assignments: clusters, embeddings: all, k: kFinal) pass after clusterReassignment(...). That's one extra O(N x D) mean-pool on the final assignments (N embeddings, D=192) - small next to the model pipeline, and the measured delta below confirms it's inside run-to-run noise.

VADAudio, XCTClockMetric, 20 iters, preload + warmup outside measure, same machine:

branch mean (ms) RSD
main 305.7 0.995%
pr 306.3 0.772%
delta +0.6 ms (+0.2%)

inside run-to-run noise (delta smaller than either rsd).

re tests: new SpeakerCentroidEmbeddingsTests.swift - 6 unit across both centroid producers, 5 integration (VADAudio + VBxClustering) incl. a post-reassignment centroid-value regression that pins the surfaced value equals the mean of the final cluster members. full make test green on the branch after make download-speakerkit-models.

happy to rebase/squash.

@leecrossley
Copy link
Copy Markdown
Author

Please add a unit test to make sure the correct cendroid is properly set

Added. unit coverage on both producers: calculateCentroids (main VBx path) and centroidsFromAssignments (kMeans correction + AHC fallback). integration testCentroidKeysSurviveClusterReassignment pins that every speaker id visible in final segments has a centroid after reassignment and testCentroidValuesMatchFinalAssignmentMean pins the centroid VALUE against the mean of final members - so "correct centroid is properly set" holds post-reassignment, not just pre-

Comment thread Tests/SpeakerKitTests/SpeakerCentroidEmbeddingsTests.swift Outdated
Comment thread Tests/SpeakerKitTests/SpeakerCentroidEmbeddingsTests.swift Outdated
Comment thread Tests/SpeakerKitTests/SpeakerCentroidEmbeddingsTests.swift Outdated
Comment thread Tests/SpeakerKitTests/SpeakerCentroidEmbeddingsTests.swift Outdated
Comment thread Tests/SpeakerKitTests/DiarizationPipelinePerformanceTests.swift Outdated
Comment thread Sources/SpeakerKit/Pyannote/VBxClustering.swift Outdated
Comment thread Sources/SpeakerKit/Pyannote/VBxClustering.swift Outdated
Comment thread Sources/SpeakerKit/Pyannote/VBxClustering.swift Outdated
Comment thread Sources/SpeakerKit/DiarizationResult.swift Outdated
Comment thread Sources/SpeakerKit/DiarizationResult.swift
@leecrossley
Copy link
Copy Markdown
Author

Thanks again for the careful review - my project (Overshow) is a private desktop app, in the Swift helper we run SpeakerKit locally alongside WhisperKit. Per chunk: one diarise pass, each transcript segment gets the best timeoverlap speaker an corresponding centroid from speakerCentroidEmbeddings is persisted alongside the segment. Cosine distance on those centroids is used downstream (outside this helper) for cross chunk and cross session speaker reuse. We need to ensure that the returned centroid matches the final post-reassignment speakerId.

I pushed the last review pass as one batch: dropped the unrelated perf test from the diff, added public init parity, renamed k to clusterCount, hoisted the repeated embedding map, tightened the comments/tests, clarified distance semantics w/out inventing a same speaker cutoff and added nearestSpeakerCentroid(to:) as the threshold free lookup helper.

I also updated the PR body / resolved the addressed review threads after replying where useful, I'm happy to adjust anythign else / squash if helpful

@leecrossley leecrossley requested a review from a2they April 24, 2026 19:20
Copy link
Copy Markdown
Contributor

@a2they a2they left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the previous PR feedback. Last few notes on the changes.

/// ``speakerCentroidEmbeddings``, the centroids have different dimensions, or either
/// vector is empty. Zero-magnitude centroids (unreachable in real diarization runs)
/// yield `MathOps.cosineDistance`'s sentinel of `1.0`.
public func centroidCosineDistance(between a: Int, _ b: Int) -> Float? {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the unlabeled second parameter reads awkwardly at call sites (i.e. centroidCosineDistance(between: 0, 1)). Suggest renaming to between:and: (public surface easier to fix now)

Suggested change
public func centroidCosineDistance(between a: Int, _ b: Int) -> Float? {
public func centroidCosineDistance(between a: Int, and b: Int) -> Float? {

///
/// - Returns: The nearest compatible centroid, or `nil` when `embedding` is empty, no
/// centroid exists, or all stored centroids have different dimensions.
public func nearestSpeakerCentroid(to embedding: [Float]) -> (speakerId: Int, distance: Float)? {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tie-breaking here depends on [Int: [Float]] iteration order, which isn't defined. Two centroids at the same distance can return a different speakerId between runs. For a cross-session matching helper this should be deterministic. Suggest iteratingspeakerCentroidEmbeddings.keys.sorted() and documenting that ties resolve to the lowest speakerId.

///
/// This field is populated by the Pyannote backend (`PyannoteDiarizer`). Other backends
/// conforming to `Diarizer` may leave it as `[:]` if they do not expose per-cluster centroids.
public private(set) var speakerCentroidEmbeddings: [Int: [Float]]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I wanted to flag on the centroid calculation, curious what you think.

The surfaced speakerCentroidEmbeddings is computed over all embeddings, but every centroid used internally in VBxClustering.cluster(...) is computed over the trainable subset only (the nonOverlappedFrameRatio > minActiveRatio filter). All three seed paths (VBx, kMeans, AHC fallback) use embeddingsFloats (trainable). Only the new surfaced centroid uses allEmbeddingsFloats.

So the centroid we return isn't quite the same kind of mean the pipeline itself uses. It folds in the overlap-flagged windows that the embedder is least confident on, which tends to pull the centroid toward the mixed-speaker region of the embedder's output space. Downstream consumers doing cosine matching end up with a noisier reference point than the one clustering already trusted.

Would it be worth adding an option on PyannoteDiarizationOptions, something like centroidSource: .finalAssignment | .trainableOnly, so callers can opt into the trainable-only centroid that matches the pipeline's internal convention? What was your testing like with trainable only vs all embeddings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose per-speaker embeddings in DiarizationResult

3 participants