Date: 2026-02-10
Status: Proposed
Group: integration
The current local STT pipeline streams partial/final text into /ws/transcripts, but speaker attribution is static (speaker_1) and does not reflect real-time multi-speaker conversations. We need low-latency diarization that:
- Runs local-first on Apple Silicon.
- Preserves existing ASR responsiveness.
- Avoids tight coupling between transcript generation and speaker attribution.
- Remains reversible (feature can be disabled without breaking transcript flow).
- Existing text ingest path is append-only (
transcript_events) and optimized for low-latency graph updates. - Streaming diarization output is delayed relative to ASR text and may revise speaker identity as context accumulates.
- Apple Silicon is the default runtime target; CUDA-only solutions are not a safe default.
- Model/license constraints matter for local distribution and commercial use.
Adopt a dual-stream late-binding architecture where diarization runs as a parallel sidecar and enriches ASR transcript events after timestamp-based reconciliation.
- Keep current ASR websocket path as transcript source of truth.
- Add a diarization sidecar pipeline that consumes the same microphone stream in parallel.
- Reconcile streams in a bounded alignment buffer using timestamp overlap.
- Emit speaker updates keyed by stable
event_idwith versioned corrections. - Use server-side monotonic timing as canonical clock basis for reconciliation and latency metrics.
- Phase 1 (MVP): Diart stock streaming on CPU with pyannote segmentation/embedding.
- Phase 2 (Hardening): Replace heavy embedding path with Silero VAD + WeSpeaker ECAPA-TDNN ONNX (CoreML EP where available), keep incremental clustering contract stable.
- Phase 3 (Polish): Speaker persistence across sessions, adaptive windowing, bounded post-correction signals.
Transcript events and/or derived utterance payloads include:
event_idspeaker_idspeaker_confidencespeaker_changediarization_version- optional
speaker_segment(start_ms,end_ms,is_overlap)
Additional control events:
speaker_merge(old id -> new id mapping)diarization_reset(sidecar restart; speaker IDs renumbered)
- Use a 2-second reconciliation window for assigning speaker labels to recent ASR text.
- Allow bounded retroactive correction within that window only.
- After window closure, speaker assignment is final for that event revision.
- Diart + pyannote (CPU default): best integration speed and low implementation risk.
- Custom ONNX streaming stack (Silero + WeSpeaker + incremental clustering): best Apple Silicon performance control, higher integration effort.
- NVIDIA Sortformer v2: strongest speed/accuracy on CUDA, not local-default compatible with Apple Silicon.
- Offline/batch stacks (WhisperX/SpeechBrain offline recipes): unsuitable for live graph updates due to latency.
- Research EEND variants: promising but not production-ready for this codebase and deployment profile.
- Preserves transcript responsiveness by decoupling diarization from ASR.
- Enables gradual rollout under feature flag with minimal regression risk.
- Keeps vendor lock-in low and allows hardware-adaptive backends.
- Provides explicit correction semantics for UI consistency.
- Added system complexity (sidecar process + alignment layer).
- Streaming DER/WDER is expected to be worse than offline upper bounds.
- Early-session speaker labels may be unstable before clustering warms up.
- ASR keeps delivering word/segment timestamps suitable for overlap matching.
- Typical sessions are 1-4 active speakers.
- Local CPU budget is sufficient for sidecar at configured step/window.
- Frontend can render speculative-to-stable speaker state transitions.
- Local-first operation on Apple Silicon is the default.
- No hard runtime dependency on CUDA or cloud diarization APIs.
- Licensing must remain compatible with project distribution model.
- Existing
/ws/transcriptsflow must remain backward compatible when diarization is disabled.
Primary success criteria:
- DER (no collar, overlap scored): MVP <= 25%, target <= 20%.
- WDER: target <= 10-12% on in-domain meetings.
- Speaker switch latency (P95): MVP <= 3s, target <= 2s.
- End-to-end speaker assignment latency (P95): MVP <= 4s, target <= 2.5s.
- RTF: < 0.5 MVP, < 0.3 target.
Validation protocol:
- Offline reference run against fixed corpus.
- Streaming replay simulation with identical corpus.
- On-device load and thermal profile validation on Apple Silicon hardware tiers.
- CPU budget miss on M1-class devices: increase step size, reduce model load, or temporarily run dominant-speaker mode.
- Label jitter/flicker: bounded correction window + confidence gating + warm-up indicator.
- Overlap attribution errors: mark overlap regions explicitly and lower confidence.
- Model/license drift: pin model versions and maintain a third-party attribution inventory.
- Confidence: 0.78 overall (high on architecture pattern, medium on first-pass tuning).
- Fallback: if latency or stability targets miss, keep diarization degraded (
speaker_id=nullor dominant-speaker only) while preserving transcript path and feature-flag rollback.
docs/adr/ADR-008-local-stt-transcripts.mddocs/adr/ADR-009-local-llm-defaults.mdLOCAL_STT_SERVICES.mdlct_python_backend/stt_api.pylct_python_backend/services/stt_session.pylct_python_backend/services/transcript_processing.py