Skip to content

feat(stt): Streaming Speech-to-Text with Whisper#1

Closed
beshkenadze wants to merge 103 commits into
mainfrom
feat/streaming-stt
Closed

feat(stt): Streaming Speech-to-Text with Whisper#1
beshkenadze wants to merge 103 commits into
mainfrom
feat/streaming-stt

Conversation

@beshkenadze

Copy link
Copy Markdown
Owner

Summary

Complete STT (Speech-to-Text) module using Whisper models on Apple Silicon with MLX.

Features

  • WhisperSession — streaming transcription for short audio (<30s)
  • LongAudioProcessor — chunked processing for long audio with multiple strategies
  • Chunking strategies: Sequential, VAD (Voice Activity Detection), Sliding Window
  • Deduplication: Levenshtein, Timestamp, Composite strategies for chunk overlap
  • Hallucination detection — stops repetitive token loops
  • Fast loading — int4 quantization option

Models

Model Size Speed
tiny 39M Fastest
base 74M Fast
small 244M Medium
largeTurbo 809M Fast + Best quality
largeV3 1.5G Slow

Usage

let session = try await WhisperSession.fromPretrained(model: .largeTurbo)
for try await result in session.transcribe(audio, sampleRate: 16000) {
    print(result.text)
}

Structure

MLXAudioSTT/
├── Sources/    # Library
├── Tests/      # Unit tests
├── STTDemo/    # CLI demo
└── README.md

Test Plan

  • Unit tests pass
  • Build succeeds
  • Demo works with test audio files

- Create detailed design for native Swift STT with AlignAtt streaming
- Define public API: STTSession protocol + WhisperSession implementation
- Specify AsyncThrowingStream<StreamingResult, Error> output format
- Document module structure under MLXAudio/STT/Whisper/
- Include AlignAtt algorithm details and alignment heads by model
- Add implementation phases (~3 weeks total)
- Define testing strategy with unit, integration, and benchmark tests
Design spec updates from expert panel review:
- Add TranscriptionOptions with language/task/timeout
- Add Performance Requirements (NFRs) section
- Document thread-safety contract
- Add memory estimation API
- Update alignment heads with provisional values
- Add executable Given/When/Then examples
- Expand error handling with recovery strategies
- Add edge case tests to testing section

Implementation plan covers 4 phases:
1. Project setup & audio processing
2. Whisper model architecture
3. AlignAtt streaming logic
4. Integration & testing
Add mel spectrogram computation using Accelerate/vDSP for efficient
FFT-based audio feature extraction. The implementation computes STFT
with Hann windowing, applies triangular mel filterbank, and normalizes
output to Whisper's expected range [-4, 4].

Includes comprehensive tests for shape validation and normalization.
Implements pad_or_trim function that ensures audio is exactly 30 seconds
(480000 samples) for Whisper processing. Short audio is zero-padded,
long audio is trimmed from the beginning.
Add WhisperConfiguration struct defining model architecture parameters
for Whisper models. Includes preset configurations for large-v3-turbo
and large-v3 with their respective alignment heads for AlignAtt streaming.
…apture

- Add WhisperMultiHeadAttention class for Whisper STT model
- Return attention weights only for cross-attention (for AlignAtt streaming)
- Key projection has no bias (matching OpenAI Whisper architecture)
- Add KVCache class for incremental decoding support
- Add causalMask helper for autoregressive attention
- Add MLXNN dependency to Package.swift
- Add comprehensive tests (require Metal-capable environment)
Add ResidualAttentionBlock with:
- Pre-norm architecture (LayerNorm before attention/MLP)
- Self-attention for both encoder and decoder
- Optional cross-attention for decoder blocks
- MLP with 4x expansion factor and GELU activation
- Residual connections around each sub-layer
- Cross-attention weight return for AlignAtt streaming

Tests verify output shapes, attention weight returns, and layer configuration.
Add WhisperModelLoader enum that downloads Whisper models from HuggingFace
and loads safetensors weights into AudioEncoder and TextDecoder components.

- Add swift-transformers dependency for Hub API access
- Support all Whisper model variants (tiny, base, small, medium, large-v3, large-turbo)
- Implement weight key sanitization for Swift property naming conventions
- Support both HuggingFace download and local directory loading
Replace placeholder transcription with real pipeline:
- Pad/trim audio to 30 seconds
- Compute mel spectrogram and encode audio
- Autoregressive decoding with KV cache
- AlignAtt streaming to emit stable tokens based on attention
- Greedy sampling with EOT token detection
- Add test_speech.wav generated with macOS `say` CLI
- Add loadWAV helper to parse 16kHz PCM WAV files
- Add transcribe_realSpeech_returnsExpectedText test (disabled for CI)
- Fix WhisperConfiguration to make alignment_heads optional
- Add test resources to Package.swift
Fixes shape mismatch: AudioConstants.nMels=80 but largeTurbo expects 128.
Now MelSpectrogram.compute() accepts nMels parameter from model config.
beshkenadze and others added 18 commits January 8, 2026 19:33
Root cause: SlidingWindowConfig.deduplicationStrategy was configured
but never used. LongAudioProcessor used MergeConfig.deduplicationStrategy
which defaulted to nil, bypassing the smart deduplication entirely.

Changes:
- MergeConfig.default now uses CompositeDeduplicationStrategy()
- Add effectiveMergeConfig() helper to select proper strategy per type
- Auto/slidingWindow: uses CompositeDeduplicationStrategy with overlapEnd
- Sequential: uses LevenshteinDeduplicationStrategy
- VAD: uses NoOpDeduplicationStrategy (non-overlapping chunks)
- Update integration test to reflect new default behavior

All 43 deduplication tests pass.
- Replace VAD fallback stub with actual VADChunkingStrategy
- Support both EnergyVADProvider and SileroVADProvider
- Map StrategyType.VADConfig to provider configs
- VAD produces non-overlapping chunks, uses NoOpDeduplicationStrategy
… debug logging

- Add options parameter to ChunkTranscriber protocol methods
- Update all ChunkingStrategy implementations to pass options through
- Fix --language flag not working for long audio transcription
- Remove fputs debug statements from WhisperSession, WhisperModelLoader,
  LongAudioProcessor, SlidingWindowChunkingStrategy, and STTDemo
- Add audioDuration calculation for proper timestamp handling

The --language en flag now correctly sets the language token, fixing the
issue where Whisper would output "..." for chunked audio with silence padding.
Remove print("[DEBUG]...") statements that were missed in previous
cleanup. The codex review identified these remaining debug prints:
- LongAudioProcessor.swift: deduplication debug output (4 lines)
- WhisperModelLoader.swift: model loading debug output (13 lines)

This completes the debug logging removal for production-ready output.
- Add console-kit dependency for terminal styling
- Integrate ConsoleKitTerminal for colored output (info, success, error)
- Fix streaming text overwrite using \r\u{1B}[K ANSI codes
- Add getTerminalWidth() to truncate verbose output to terminal width
- Add Taskfile.yml for simplified build commands (task build, task stt:short)
- Add BuildInfo.swift for version display
AudioUtils:
- Change minContentRatioForRepeat threshold from 0.33 to 0.50
- Add repeatPadWithNoiseFill for very short audio (<3 reps)
- Limit max repetitions to 3 to avoid Whisper loop detection

VADChunkingStrategy:
- Add segment packing to combine short VAD segments
- Add timestamp remapping after Whisper inference
- Improve chunk boundary detection

MultiHeadAttention:
- Add cross-attention weight capture for DTW timestamp extraction
- Add AudioUtilsTests for padding strategies
- Add test audio files: short_2s.wav, short_5s.wav, medium_15s.wav, near_30s.wav, mlk_50s.wav
- Add whisper-short-audio-padding-research.md with root cause analysis
- Document why padding fails and production solutions (VAD-based chunking)
- Add Swift logger log levels reference
Detect repetitive token patterns (e.g., "nda nda nda") that indicate
model failure and terminate early. Implementation optimized for hot path:

- Check every 5 tokens to minimize overhead
- O(windowSize=12) per check with packed UInt64 bigram keys
- Early termination emits accumulated text and stops decoding

Prevents infinite loops on difficult audio without significant
performance impact.
Move STT module from snake_case Python-style path to PascalCase
Swift-style path to match project conventions (MLXAudio/).

Structure:
- MLXAudioSTT/Sources/  - library code
- MLXAudioSTT/Tests/    - unit tests
- MLXAudioSTT/STTDemo/  - demo CLI app
Keep these directories locally but exclude from version control.
Also exclude .DS_Store files.
- Add STTSessionProtocol with generate() and generateStream() methods
- Add STTTypes: STTOutput, STTGeneration, STTGenerationInfo, STTError
- Implement SDK v1 API on WhisperSession with streaming token events
- Remove legacy STTSession conformance and backward compatibility
- Update STTDemo to use new API with metrics display (tokens, speed, time)
- Add SDK v1 API tests for generate, generateStream, temperature sampling
- Fix test compilation: add missing options parameter to process() calls

API pattern aligned with MLX-Audio Swift SDK v1:
- generate(audio:) -> STTOutput (blocking)
- generateStream(audio:) -> AsyncThrowingStream<STTGeneration, Error>
- STTGeneration: .token(String), .info(STTGenerationInfo), .result(STTOutput)
- EnergyVADProvider: use threshold 1.0 in detectSpeech since
  speechProbabilities already returns normalized values
- SequentialChunkingStrategyTests: align mock timeRange with
  maxChunkDuration and adjust audio lengths to match seek patterns
- VADChunkingStrategyTests: add packSegments: false to prevent
  segment consolidation that was masking expected chunk counts
Copilot AI review requested due to automatic review settings January 26, 2026 15:17

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes a large number of language data files and frameworks from the Kokoro TTS system while adding a new Speech-to-Text (STT) feature using Whisper models. The changes include deleting ESpeakNG language configuration files, removing various Kokoro decoder/encoder components, and removing the main ContentView.swift along with associated UI files.

Changes:

  • Deletion of ESpeakNG language data files for multiple languages (Slovak, Polish, Czech, Serbian, etc.)
  • Removal of Kokoro TTS decoder components (Generator, MLXSTFT, SineGen, etc.)
  • Removal of Albert model components and building blocks
  • Addition of VoicesApp example application with voice cloning capabilities
  • Addition of GitHub Actions workflow for automated testing

Reviewed changes

Copilot reviewed 200 out of 3636 changed files in this pull request and generated no comments.

Show a summary per file
File Description
MLXAudio/Kokoro/Frameworks/ESpeakNG.xcframework/* Removed language data files and framework headers
MLXAudio/Kokoro/Decoder/* Removed decoder components including Generator, MLXSTFT, and audio processing modules
MLXAudio/Kokoro/BuildingBlocks/* Removed neural network building blocks like AdaIN layers, convolution modules
MLXAudio/Kokoro/Albert/* Removed Albert model implementation files
MLXAudio/ContentView.swift Removed main application UI
Examples/VoicesApp/* Added new voice synthesis example app with cloning support
.github/workflows/tests.yaml Added CI/CD pipeline for automated testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@beshkenadze beshkenadze closed this Mar 7, 2026
beshkenadze added a commit that referenced this pull request Apr 30, 2026
P2 #1: thread repetitionPenalty + repetitionContextSize through the
streaming path (generateStream adapter and inner generateStream),
mirroring the sync generate path. Apply sign-aware penalty in the
streaming AR loop.

P2 #2: gate the hard repeat guard on repetitionPenalty == 1.0 in both
sync and streaming paths. When the caller opts into repetition penalty
(the proper fix), the heuristic backstop is disabled so legitimately
repetitive speech (e.g., 30x 'da') is not truncated. When the caller
runs greedy (penalty == 1.0), the backstop still prevents pathological
KV-cache OOM loops.
beshkenadze added a commit that referenced this pull request May 7, 2026
…ard (Blaizzy#174)

* Qwen3-ASR: chunked prefill + asyncEval + repetition penalty + loop guard

Fixes runaway repetition loop on long-form audio that consumed up to
22 GB RAM and stalled for minutes (related to QwenLM/Qwen3-ASR#129).

Measured on Apple M1 Max with mlx-community/Qwen3-ASR-1.7B-4bit:
- RU 57min: upstream 71-72s -> this branch 65-66s (-8% wall)
- RU 5min: 33s -> 32s (within noise)
- EN 5min: 21s -> 24s (within noise)
- Output bytes parity within +/-0.9% across all benchmarks

Cross-implementation reference (M1 Max, RU 57min):
- mlx-audio-swift Qwen3 1.7B (this branch): 65s, 52x realtime
- FluidAudio Qwen3 CoreML/ANE: 1151s, 3.0x realtime (17x slower)

Changes:
- STTGenerateParameters gains repetitionPenalty (default 1.0 = off,
  backward compat) and repetitionContextSize (default 32).
- generateSingleChunk: chunked prefill (windowSize=2048) with
  eval+clearCache between chunks, asyncEval pipelining for the AR
  loop matching mlx_lm.generate.generate_step pattern, periodic
  Memory.clearCache() every 256 generated tokens.
- Apply mlx-lm sign-aware repetition penalty before argmax.
- Heuristic fail-safe: stop if last 24 generated tokens contain
  <=3 unique IDs (degenerate loop detector).

Reproduction:
- /tmp/ru_10min.wav (10-min Russian slice) hit repetition loop in
  greedy mode (output ended with 'davai, davai, davai...' x100,
  127s wall, KV cache up to 22 GB).
- With repetitionPenalty=1.15 + heuristic guard: 51s wall, clean
  transcript tail, stable memory.

Recommended call site:
  let params = STTGenerateParameters(
      language: "Russian",
      repetitionPenalty: 1.15,
      repetitionContextSize: 32
  )
  let output = model.generate(audio: samples, generationParameters: params)

Backward compatibility: All new parameters default to neutral
values. Existing callers see identical greedy argmax behavior.

Build: xcodebuild SUCCEEDED. Tests: 113/113 pass.

* Qwen3-ASR: address Codex review P2 issues

P2 #1: thread repetitionPenalty + repetitionContextSize through the
streaming path (generateStream adapter and inner generateStream),
mirroring the sync generate path. Apply sign-aware penalty in the
streaming AR loop.

P2 #2: gate the hard repeat guard on repetitionPenalty == 1.0 in both
sync and streaming paths. When the caller opts into repetition penalty
(the proper fix), the heuristic backstop is disabled so legitimately
repetitive speech (e.g., 30x 'da') is not truncated. When the caller
runs greedy (penalty == 1.0), the backstop still prevents pathological
KV-cache OOM loops.

* Qwen3-ASR CLI: expose repetition_penalty / repetition_context_size

Adds two new --gen-kwargs JSON keys so the slim PR's repetition penalty
feature is reachable from mlx-audio-swift-stt without library callers.

Usage:
  mlx-audio-swift-stt --gen-kwargs '{"repetition_penalty":1.15}'

Defaults match STTGenerateParameters: 1.0 / 32 (no behavior change).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants