feat(stt): Streaming Speech-to-Text with Whisper#1
Closed
beshkenadze wants to merge 103 commits into
Closed
Conversation
- Create detailed design for native Swift STT with AlignAtt streaming - Define public API: STTSession protocol + WhisperSession implementation - Specify AsyncThrowingStream<StreamingResult, Error> output format - Document module structure under MLXAudio/STT/Whisper/ - Include AlignAtt algorithm details and alignment heads by model - Add implementation phases (~3 weeks total) - Define testing strategy with unit, integration, and benchmark tests
Design spec updates from expert panel review: - Add TranscriptionOptions with language/task/timeout - Add Performance Requirements (NFRs) section - Document thread-safety contract - Add memory estimation API - Update alignment heads with provisional values - Add executable Given/When/Then examples - Expand error handling with recovery strategies - Add edge case tests to testing section Implementation plan covers 4 phases: 1. Project setup & audio processing 2. Whisper model architecture 3. AlignAtt streaming logic 4. Integration & testing
Add mel spectrogram computation using Accelerate/vDSP for efficient FFT-based audio feature extraction. The implementation computes STFT with Hann windowing, applies triangular mel filterbank, and normalizes output to Whisper's expected range [-4, 4]. Includes comprehensive tests for shape validation and normalization.
Implements pad_or_trim function that ensures audio is exactly 30 seconds (480000 samples) for Whisper processing. Short audio is zero-padded, long audio is trimmed from the beginning.
Add WhisperConfiguration struct defining model architecture parameters for Whisper models. Includes preset configurations for large-v3-turbo and large-v3 with their respective alignment heads for AlignAtt streaming.
…apture - Add WhisperMultiHeadAttention class for Whisper STT model - Return attention weights only for cross-attention (for AlignAtt streaming) - Key projection has no bias (matching OpenAI Whisper architecture) - Add KVCache class for incremental decoding support - Add causalMask helper for autoregressive attention - Add MLXNN dependency to Package.swift - Add comprehensive tests (require Metal-capable environment)
Add ResidualAttentionBlock with: - Pre-norm architecture (LayerNorm before attention/MLP) - Self-attention for both encoder and decoder - Optional cross-attention for decoder blocks - MLP with 4x expansion factor and GELU activation - Residual connections around each sub-layer - Cross-attention weight return for AlignAtt streaming Tests verify output shapes, attention weight returns, and layer configuration.
Add WhisperModelLoader enum that downloads Whisper models from HuggingFace and loads safetensors weights into AudioEncoder and TextDecoder components. - Add swift-transformers dependency for Hub API access - Support all Whisper model variants (tiny, base, small, medium, large-v3, large-turbo) - Implement weight key sanitization for Swift property naming conventions - Support both HuggingFace download and local directory loading
Replace placeholder transcription with real pipeline: - Pad/trim audio to 30 seconds - Compute mel spectrogram and encode audio - Autoregressive decoding with KV cache - AlignAtt streaming to emit stable tokens based on attention - Greedy sampling with EOT token detection
- Add test_speech.wav generated with macOS `say` CLI - Add loadWAV helper to parse 16kHz PCM WAV files - Add transcribe_realSpeech_returnsExpectedText test (disabled for CI) - Fix WhisperConfiguration to make alignment_heads optional - Add test resources to Package.swift
Fixes shape mismatch: AudioConstants.nMels=80 but largeTurbo expects 128. Now MelSpectrogram.compute() accepts nMels parameter from model config.
Root cause: SlidingWindowConfig.deduplicationStrategy was configured but never used. LongAudioProcessor used MergeConfig.deduplicationStrategy which defaulted to nil, bypassing the smart deduplication entirely. Changes: - MergeConfig.default now uses CompositeDeduplicationStrategy() - Add effectiveMergeConfig() helper to select proper strategy per type - Auto/slidingWindow: uses CompositeDeduplicationStrategy with overlapEnd - Sequential: uses LevenshteinDeduplicationStrategy - VAD: uses NoOpDeduplicationStrategy (non-overlapping chunks) - Update integration test to reflect new default behavior All 43 deduplication tests pass.
- Replace VAD fallback stub with actual VADChunkingStrategy - Support both EnergyVADProvider and SileroVADProvider - Map StrategyType.VADConfig to provider configs - VAD produces non-overlapping chunks, uses NoOpDeduplicationStrategy
… debug logging - Add options parameter to ChunkTranscriber protocol methods - Update all ChunkingStrategy implementations to pass options through - Fix --language flag not working for long audio transcription - Remove fputs debug statements from WhisperSession, WhisperModelLoader, LongAudioProcessor, SlidingWindowChunkingStrategy, and STTDemo - Add audioDuration calculation for proper timestamp handling The --language en flag now correctly sets the language token, fixing the issue where Whisper would output "..." for chunked audio with silence padding.
Remove print("[DEBUG]...") statements that were missed in previous
cleanup. The codex review identified these remaining debug prints:
- LongAudioProcessor.swift: deduplication debug output (4 lines)
- WhisperModelLoader.swift: model loading debug output (13 lines)
This completes the debug logging removal for production-ready output.
- Add console-kit dependency for terminal styling
- Integrate ConsoleKitTerminal for colored output (info, success, error)
- Fix streaming text overwrite using \r\u{1B}[K ANSI codes
- Add getTerminalWidth() to truncate verbose output to terminal width
- Add Taskfile.yml for simplified build commands (task build, task stt:short)
- Add BuildInfo.swift for version display
AudioUtils: - Change minContentRatioForRepeat threshold from 0.33 to 0.50 - Add repeatPadWithNoiseFill for very short audio (<3 reps) - Limit max repetitions to 3 to avoid Whisper loop detection VADChunkingStrategy: - Add segment packing to combine short VAD segments - Add timestamp remapping after Whisper inference - Improve chunk boundary detection MultiHeadAttention: - Add cross-attention weight capture for DTW timestamp extraction
- Add AudioUtilsTests for padding strategies - Add test audio files: short_2s.wav, short_5s.wav, medium_15s.wav, near_30s.wav, mlk_50s.wav
- Add whisper-short-audio-padding-research.md with root cause analysis - Document why padding fails and production solutions (VAD-based chunking) - Add Swift logger log levels reference
Detect repetitive token patterns (e.g., "nda nda nda") that indicate model failure and terminate early. Implementation optimized for hot path: - Check every 5 tokens to minimize overhead - O(windowSize=12) per check with packed UInt64 bigram keys - Early termination emits accumulated text and stops decoding Prevents infinite loops on difficult audio without significant performance impact.
Move STT module from snake_case Python-style path to PascalCase Swift-style path to match project conventions (MLXAudio/). Structure: - MLXAudioSTT/Sources/ - library code - MLXAudioSTT/Tests/ - unit tests - MLXAudioSTT/STTDemo/ - demo CLI app
Keep these directories locally but exclude from version control. Also exclude .DS_Store files.
- Add STTSessionProtocol with generate() and generateStream() methods - Add STTTypes: STTOutput, STTGeneration, STTGenerationInfo, STTError - Implement SDK v1 API on WhisperSession with streaming token events - Remove legacy STTSession conformance and backward compatibility - Update STTDemo to use new API with metrics display (tokens, speed, time) - Add SDK v1 API tests for generate, generateStream, temperature sampling - Fix test compilation: add missing options parameter to process() calls API pattern aligned with MLX-Audio Swift SDK v1: - generate(audio:) -> STTOutput (blocking) - generateStream(audio:) -> AsyncThrowingStream<STTGeneration, Error> - STTGeneration: .token(String), .info(STTGenerationInfo), .result(STTOutput)
- EnergyVADProvider: use threshold 1.0 in detectSpeech since speechProbabilities already returns normalized values - SequentialChunkingStrategyTests: align mock timeRange with maxChunkDuration and adjust audio lengths to match seek patterns - VADChunkingStrategyTests: add packSegments: false to prevent segment consolidation that was masking expected chunk counts
There was a problem hiding this comment.
Pull request overview
This PR removes a large number of language data files and frameworks from the Kokoro TTS system while adding a new Speech-to-Text (STT) feature using Whisper models. The changes include deleting ESpeakNG language configuration files, removing various Kokoro decoder/encoder components, and removing the main ContentView.swift along with associated UI files.
Changes:
- Deletion of ESpeakNG language data files for multiple languages (Slovak, Polish, Czech, Serbian, etc.)
- Removal of Kokoro TTS decoder components (Generator, MLXSTFT, SineGen, etc.)
- Removal of Albert model components and building blocks
- Addition of VoicesApp example application with voice cloning capabilities
- Addition of GitHub Actions workflow for automated testing
Reviewed changes
Copilot reviewed 200 out of 3636 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| MLXAudio/Kokoro/Frameworks/ESpeakNG.xcframework/* | Removed language data files and framework headers |
| MLXAudio/Kokoro/Decoder/* | Removed decoder components including Generator, MLXSTFT, and audio processing modules |
| MLXAudio/Kokoro/BuildingBlocks/* | Removed neural network building blocks like AdaIN layers, convolution modules |
| MLXAudio/Kokoro/Albert/* | Removed Albert model implementation files |
| MLXAudio/ContentView.swift | Removed main application UI |
| Examples/VoicesApp/* | Added new voice synthesis example app with cloning support |
| .github/workflows/tests.yaml | Added CI/CD pipeline for automated testing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5 tasks
beshkenadze
added a commit
that referenced
this pull request
Apr 30, 2026
P2 #1: thread repetitionPenalty + repetitionContextSize through the streaming path (generateStream adapter and inner generateStream), mirroring the sync generate path. Apply sign-aware penalty in the streaming AR loop. P2 #2: gate the hard repeat guard on repetitionPenalty == 1.0 in both sync and streaming paths. When the caller opts into repetition penalty (the proper fix), the heuristic backstop is disabled so legitimately repetitive speech (e.g., 30x 'da') is not truncated. When the caller runs greedy (penalty == 1.0), the backstop still prevents pathological KV-cache OOM loops.
beshkenadze
added a commit
that referenced
this pull request
May 7, 2026
…ard (Blaizzy#174) * Qwen3-ASR: chunked prefill + asyncEval + repetition penalty + loop guard Fixes runaway repetition loop on long-form audio that consumed up to 22 GB RAM and stalled for minutes (related to QwenLM/Qwen3-ASR#129). Measured on Apple M1 Max with mlx-community/Qwen3-ASR-1.7B-4bit: - RU 57min: upstream 71-72s -> this branch 65-66s (-8% wall) - RU 5min: 33s -> 32s (within noise) - EN 5min: 21s -> 24s (within noise) - Output bytes parity within +/-0.9% across all benchmarks Cross-implementation reference (M1 Max, RU 57min): - mlx-audio-swift Qwen3 1.7B (this branch): 65s, 52x realtime - FluidAudio Qwen3 CoreML/ANE: 1151s, 3.0x realtime (17x slower) Changes: - STTGenerateParameters gains repetitionPenalty (default 1.0 = off, backward compat) and repetitionContextSize (default 32). - generateSingleChunk: chunked prefill (windowSize=2048) with eval+clearCache between chunks, asyncEval pipelining for the AR loop matching mlx_lm.generate.generate_step pattern, periodic Memory.clearCache() every 256 generated tokens. - Apply mlx-lm sign-aware repetition penalty before argmax. - Heuristic fail-safe: stop if last 24 generated tokens contain <=3 unique IDs (degenerate loop detector). Reproduction: - /tmp/ru_10min.wav (10-min Russian slice) hit repetition loop in greedy mode (output ended with 'davai, davai, davai...' x100, 127s wall, KV cache up to 22 GB). - With repetitionPenalty=1.15 + heuristic guard: 51s wall, clean transcript tail, stable memory. Recommended call site: let params = STTGenerateParameters( language: "Russian", repetitionPenalty: 1.15, repetitionContextSize: 32 ) let output = model.generate(audio: samples, generationParameters: params) Backward compatibility: All new parameters default to neutral values. Existing callers see identical greedy argmax behavior. Build: xcodebuild SUCCEEDED. Tests: 113/113 pass. * Qwen3-ASR: address Codex review P2 issues P2 #1: thread repetitionPenalty + repetitionContextSize through the streaming path (generateStream adapter and inner generateStream), mirroring the sync generate path. Apply sign-aware penalty in the streaming AR loop. P2 #2: gate the hard repeat guard on repetitionPenalty == 1.0 in both sync and streaming paths. When the caller opts into repetition penalty (the proper fix), the heuristic backstop is disabled so legitimately repetitive speech (e.g., 30x 'da') is not truncated. When the caller runs greedy (penalty == 1.0), the backstop still prevents pathological KV-cache OOM loops. * Qwen3-ASR CLI: expose repetition_penalty / repetition_context_size Adds two new --gen-kwargs JSON keys so the slim PR's repetition penalty feature is reachable from mlx-audio-swift-stt without library callers. Usage: mlx-audio-swift-stt --gen-kwargs '{"repetition_penalty":1.15}' Defaults match STTGenerateParameters: 1.0 / 32 (no behavior change).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete STT (Speech-to-Text) module using Whisper models on Apple Silicon with MLX.
Features
Models
Usage
Structure
Test Plan