feat(stt): Streaming Speech-to-Text with Whisper by beshkenadze · Pull Request #1 · beshkenadze/mlx-audio-swift

beshkenadze · 2026-01-09T15:04:03Z

Summary

Complete STT (Speech-to-Text) module using Whisper models on Apple Silicon with MLX.

Features

WhisperSession — streaming transcription for short audio (<30s)
LongAudioProcessor — chunked processing for long audio with multiple strategies
Chunking strategies: Sequential, VAD (Voice Activity Detection), Sliding Window
Deduplication: Levenshtein, Timestamp, Composite strategies for chunk overlap
Hallucination detection — stops repetitive token loops
Fast loading — int4 quantization option

Models

Model	Size	Speed
tiny	39M	Fastest
base	74M	Fast
small	244M	Medium
largeTurbo	809M	Fast + Best quality
largeV3	1.5G	Slow

Usage

let session = try await WhisperSession.fromPretrained(model: .largeTurbo)
for try await result in session.transcribe(audio, sampleRate: 16000) {
    print(result.text)
}

Structure

MLXAudioSTT/
├── Sources/    # Library
├── Tests/      # Unit tests
├── STTDemo/    # CLI demo
└── README.md

Test Plan

Unit tests pass
Build succeeds
Demo works with test audio files

- Create detailed design for native Swift STT with AlignAtt streaming - Define public API: STTSession protocol + WhisperSession implementation - Specify AsyncThrowingStream<StreamingResult, Error> output format - Document module structure under MLXAudio/STT/Whisper/ - Include AlignAtt algorithm details and alignment heads by model - Add implementation phases (~3 weeks total) - Define testing strategy with unit, integration, and benchmark tests

Design spec updates from expert panel review: - Add TranscriptionOptions with language/task/timeout - Add Performance Requirements (NFRs) section - Document thread-safety contract - Add memory estimation API - Update alignment heads with provisional values - Add executable Given/When/Then examples - Expand error handling with recovery strategies - Add edge case tests to testing section Implementation plan covers 4 phases: 1. Project setup & audio processing 2. Whisper model architecture 3. AlignAtt streaming logic 4. Integration & testing

Add mel spectrogram computation using Accelerate/vDSP for efficient FFT-based audio feature extraction. The implementation computes STFT with Hann windowing, applies triangular mel filterbank, and normalizes output to Whisper's expected range [-4, 4]. Includes comprehensive tests for shape validation and normalization.

Implements pad_or_trim function that ensures audio is exactly 30 seconds (480000 samples) for Whisper processing. Short audio is zero-padded, long audio is trimmed from the beginning.

Add WhisperConfiguration struct defining model architecture parameters for Whisper models. Includes preset configurations for large-v3-turbo and large-v3 with their respective alignment heads for AlignAtt streaming.

…apture - Add WhisperMultiHeadAttention class for Whisper STT model - Return attention weights only for cross-attention (for AlignAtt streaming) - Key projection has no bias (matching OpenAI Whisper architecture) - Add KVCache class for incremental decoding support - Add causalMask helper for autoregressive attention - Add MLXNN dependency to Package.swift - Add comprehensive tests (require Metal-capable environment)

Add ResidualAttentionBlock with: - Pre-norm architecture (LayerNorm before attention/MLP) - Self-attention for both encoder and decoder - Optional cross-attention for decoder blocks - MLP with 4x expansion factor and GELU activation - Residual connections around each sub-layer - Cross-attention weight return for AlignAtt streaming Tests verify output shapes, attention weight returns, and layer configuration.

… types

…line

Add WhisperModelLoader enum that downloads Whisper models from HuggingFace and loads safetensors weights into AudioEncoder and TextDecoder components. - Add swift-transformers dependency for Hub API access - Support all Whisper model variants (tiny, base, small, medium, large-v3, large-turbo) - Implement weight key sanitization for Swift property naming conventions - Support both HuggingFace download and local directory loading

Replace placeholder transcription with real pipeline: - Pad/trim audio to 30 seconds - Compute mel spectrogram and encode audio - Autoregressive decoding with KV cache - AlignAtt streaming to emit stable tokens based on attention - Greedy sampling with EOT token detection

- Add test_speech.wav generated with macOS `say` CLI - Add loadWAV helper to parse 16kHz PCM WAV files - Add transcribe_realSpeech_returnsExpectedText test (disabled for CI) - Fix WhisperConfiguration to make alignment_heads optional - Add test resources to Package.swift

Fixes shape mismatch: AudioConstants.nMels=80 but largeTurbo expects 128. Now MelSpectrogram.compute() accepts nMels parameter from model config.

Root cause: SlidingWindowConfig.deduplicationStrategy was configured but never used. LongAudioProcessor used MergeConfig.deduplicationStrategy which defaulted to nil, bypassing the smart deduplication entirely. Changes: - MergeConfig.default now uses CompositeDeduplicationStrategy() - Add effectiveMergeConfig() helper to select proper strategy per type - Auto/slidingWindow: uses CompositeDeduplicationStrategy with overlapEnd - Sequential: uses LevenshteinDeduplicationStrategy - VAD: uses NoOpDeduplicationStrategy (non-overlapping chunks) - Update integration test to reflect new default behavior All 43 deduplication tests pass.

- Replace VAD fallback stub with actual VADChunkingStrategy - Support both EnergyVADProvider and SileroVADProvider - Map StrategyType.VADConfig to provider configs - VAD produces non-overlapping chunks, uses NoOpDeduplicationStrategy

… debug logging - Add options parameter to ChunkTranscriber protocol methods - Update all ChunkingStrategy implementations to pass options through - Fix --language flag not working for long audio transcription - Remove fputs debug statements from WhisperSession, WhisperModelLoader, LongAudioProcessor, SlidingWindowChunkingStrategy, and STTDemo - Add audioDuration calculation for proper timestamp handling The --language en flag now correctly sets the language token, fixing the issue where Whisper would output "..." for chunked audio with silence padding.

Remove print("[DEBUG]...") statements that were missed in previous cleanup. The codex review identified these remaining debug prints: - LongAudioProcessor.swift: deduplication debug output (4 lines) - WhisperModelLoader.swift: model loading debug output (13 lines) This completes the debug logging removal for production-ready output.

- Add console-kit dependency for terminal styling - Integrate ConsoleKitTerminal for colored output (info, success, error) - Fix streaming text overwrite using \r\u{1B}[K ANSI codes - Add getTerminalWidth() to truncate verbose output to terminal width - Add Taskfile.yml for simplified build commands (task build, task stt:short) - Add BuildInfo.swift for version display

AudioUtils: - Change minContentRatioForRepeat threshold from 0.33 to 0.50 - Add repeatPadWithNoiseFill for very short audio (<3 reps) - Limit max repetitions to 3 to avoid Whisper loop detection VADChunkingStrategy: - Add segment packing to combine short VAD segments - Add timestamp remapping after Whisper inference - Improve chunk boundary detection MultiHeadAttention: - Add cross-attention weight capture for DTW timestamp extraction

- Add AudioUtilsTests for padding strategies - Add test audio files: short_2s.wav, short_5s.wav, medium_15s.wav, near_30s.wav, mlk_50s.wav

- Add whisper-short-audio-padding-research.md with root cause analysis - Document why padding fails and production solutions (VAD-based chunking) - Add Swift logger log levels reference

Detect repetitive token patterns (e.g., "nda nda nda") that indicate model failure and terminate early. Implementation optimized for hot path: - Check every 5 tokens to minimize overhead - O(windowSize=12) per check with packed UInt64 bigram keys - Early termination emits accumulated text and stops decoding Prevents infinite loops on difficult audio without significant performance impact.

Move STT module from snake_case Python-style path to PascalCase Swift-style path to match project conventions (MLXAudio/). Structure: - MLXAudioSTT/Sources/ - library code - MLXAudioSTT/Tests/ - unit tests - MLXAudioSTT/STTDemo/ - demo CLI app

Keep these directories locally but exclude from version control. Also exclude .DS_Store files.

- Add STTSessionProtocol with generate() and generateStream() methods - Add STTTypes: STTOutput, STTGeneration, STTGenerationInfo, STTError - Implement SDK v1 API on WhisperSession with streaming token events - Remove legacy STTSession conformance and backward compatibility - Update STTDemo to use new API with metrics display (tokens, speed, time) - Add SDK v1 API tests for generate, generateStream, temperature sampling - Fix test compilation: add missing options parameter to process() calls API pattern aligned with MLX-Audio Swift SDK v1: - generate(audio:) -> STTOutput (blocking) - generateStream(audio:) -> AsyncThrowingStream<STTGeneration, Error> - STTGeneration: .token(String), .info(STTGenerationInfo), .result(STTOutput)

- EnergyVADProvider: use threshold 1.0 in detectSpeech since speechProbabilities already returns normalized values - SequentialChunkingStrategyTests: align mock timeRange with maxChunkDuration and adjust audio lengths to match seek patterns - VADChunkingStrategyTests: add packSegments: false to prevent segment consolidation that was masking expected chunk counts

Copilot

Pull request overview

This PR removes a large number of language data files and frameworks from the Kokoro TTS system while adding a new Speech-to-Text (STT) feature using Whisper models. The changes include deleting ESpeakNG language configuration files, removing various Kokoro decoder/encoder components, and removing the main ContentView.swift along with associated UI files.

Changes:

Deletion of ESpeakNG language data files for multiple languages (Slovak, Polish, Czech, Serbian, etc.)
Removal of Kokoro TTS decoder components (Generator, MLXSTFT, SineGen, etc.)
Removal of Albert model components and building blocks
Addition of VoicesApp example application with voice cloning capabilities
Addition of GitHub Actions workflow for automated testing

Reviewed changes

Copilot reviewed 200 out of 3636 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
MLXAudio/Kokoro/Frameworks/ESpeakNG.xcframework/*	Removed language data files and framework headers
MLXAudio/Kokoro/Decoder/*	Removed decoder components including Generator, MLXSTFT, and audio processing modules
MLXAudio/Kokoro/BuildingBlocks/*	Removed neural network building blocks like AdaIN layers, convolution modules
MLXAudio/Kokoro/Albert/*	Removed Albert model implementation files
MLXAudio/ContentView.swift	Removed main application UI
Examples/VoicesApp/*	Added new voice synthesis example app with cloning support
.github/workflows/tests.yaml	Added CI/CD pipeline for automated testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

P2 #1: thread repetitionPenalty + repetitionContextSize through the streaming path (generateStream adapter and inner generateStream), mirroring the sync generate path. Apply sign-aware penalty in the streaming AR loop. P2 #2: gate the hard repeat guard on repetitionPenalty == 1.0 in both sync and streaming paths. When the caller opts into repetition penalty (the proper fix), the heuristic backstop is disabled so legitimately repetitive speech (e.g., 30x 'da') is not truncated. When the caller runs greedy (penalty == 1.0), the backstop still prevents pathological KV-cache OOM loops.

…ard (Blaizzy#174) * Qwen3-ASR: chunked prefill + asyncEval + repetition penalty + loop guard Fixes runaway repetition loop on long-form audio that consumed up to 22 GB RAM and stalled for minutes (related to QwenLM/Qwen3-ASR#129). Measured on Apple M1 Max with mlx-community/Qwen3-ASR-1.7B-4bit: - RU 57min: upstream 71-72s -> this branch 65-66s (-8% wall) - RU 5min: 33s -> 32s (within noise) - EN 5min: 21s -> 24s (within noise) - Output bytes parity within +/-0.9% across all benchmarks Cross-implementation reference (M1 Max, RU 57min): - mlx-audio-swift Qwen3 1.7B (this branch): 65s, 52x realtime - FluidAudio Qwen3 CoreML/ANE: 1151s, 3.0x realtime (17x slower) Changes: - STTGenerateParameters gains repetitionPenalty (default 1.0 = off, backward compat) and repetitionContextSize (default 32). - generateSingleChunk: chunked prefill (windowSize=2048) with eval+clearCache between chunks, asyncEval pipelining for the AR loop matching mlx_lm.generate.generate_step pattern, periodic Memory.clearCache() every 256 generated tokens. - Apply mlx-lm sign-aware repetition penalty before argmax. - Heuristic fail-safe: stop if last 24 generated tokens contain <=3 unique IDs (degenerate loop detector). Reproduction: - /tmp/ru_10min.wav (10-min Russian slice) hit repetition loop in greedy mode (output ended with 'davai, davai, davai...' x100, 127s wall, KV cache up to 22 GB). - With repetitionPenalty=1.15 + heuristic guard: 51s wall, clean transcript tail, stable memory. Recommended call site: let params = STTGenerateParameters( language: "Russian", repetitionPenalty: 1.15, repetitionContextSize: 32 ) let output = model.generate(audio: samples, generationParameters: params) Backward compatibility: All new parameters default to neutral values. Existing callers see identical greedy argmax behavior. Build: xcodebuild SUCCEEDED. Tests: 113/113 pass. * Qwen3-ASR: address Codex review P2 issues P2 #1: thread repetitionPenalty + repetitionContextSize through the streaming path (generateStream adapter and inner generateStream), mirroring the sync generate path. Apply sign-aware penalty in the streaming AR loop. P2 #2: gate the hard repeat guard on repetitionPenalty == 1.0 in both sync and streaming paths. When the caller opts into repetition penalty (the proper fix), the heuristic backstop is disabled so legitimately repetitive speech (e.g., 30x 'da') is not truncated. When the caller runs greedy (penalty == 1.0), the backstop still prevents pathological KV-cache OOM loops. * Qwen3-ASR CLI: expose repetition_penalty / repetition_context_size Adds two new --gen-kwargs JSON keys so the slim PR's repetition penalty feature is reachable from mlx-audio-swift-stt without library callers. Usage: mlx-audio-swift-stt --gen-kwargs '{"repetition_penalty":1.15}' Defaults match STTGenerateParameters: 1.0 / 32 (no behavior change).

beshkenadze added 30 commits January 6, 2026 17:37

feat(stt): add Tiktoken dependency for Whisper tokenization

1d71782

feat(stt): create STT module directory structure

2547e38

feat(stt): add AudioConstants with Whisper audio parameters

a69005d

feat(stt): add padOrTrim audio utility

c2aefa4

Implements pad_or_trim function that ensures audio is exactly 30 seconds (480000 samples) for Whisper processing. Short audio is zero-padded, long audio is trimmed from the beginning.

feat(stt): add WhisperConfiguration with model presets

0e9a64b

Add WhisperConfiguration struct defining model architecture parameters for Whisper models. Includes preset configurations for large-v3-turbo and large-v3 with their respective alignment heads for AlignAtt streaming.

feat(stt): implement AudioEncoder with Conv1d and transformer blocks

4d1a09e

feat(stt): implement TextDecoder with cross-attention weight capture

b56c651

feat(stt): add StreamingConfig

27f6efe

feat(stt): add alignment heads for all Whisper models

2865f3e

feat(stt): implement StreamingDecoder with AlignAtt core logic

f192e19

feat(stt): add WhisperSession with STTSession protocol and supporting…

e2b6beb

… types

test(stt): add WhisperSession integration tests

a65ef93

docs(stt): add Phase 2 implementation plan for model loading and pipe…

d133d6e

…line

fix(stt): use power-of-two FFT size for vDSP compatibility

a1d61f1

fix(stt): update tests and comments for FFT size change

4072971

feat(stt): add WhisperTokenizer wrapper for swift-tiktoken

20b7450

feat(stt): add model components to WhisperSession

85a1010

fix(stt): add onTermination handler and cleanup currentTask

e6eb321

test(stt): add end-to-end transcription test

4e229c3

fix(stt): pass nMels from config to MelSpectrogram

28ab2b1

Fixes shape mismatch: AudioConstants.nMels=80 but largeTurbo expects 128. Now MelSpectrogram.compute() accepts nMels parameter from model config.

refactor(stt): make nMels required parameter (no legacy defaults)

5f42c17

fix(stt): add thread safety to KVCache with NSLock

0dbae87

beshkenadze and others added 18 commits January 8, 2026 19:33

docs(stt): add deduplication strategies documentation

82e8b17

test(stt): add deduplication integration tests

65a09f4

test(stt): add AudioUtils tests and test audio files

68b502c

- Add AudioUtilsTests for padding strategies - Add test audio files: short_2s.wav, short_5s.wav, medium_15s.wav, near_30s.wav, mlk_50s.wav

docs: add Whisper short audio padding research

3bd7191

- Add whisper-short-audio-padding-research.md with root cause analysis - Document why padding fails and production solutions (VAD-based chunking) - Add Swift logger log levels reference

docs(stt): add simple README with usage examples

8fd173d

chore: remove docs/ from repo, keep locally

9e5b576

chore: remove .serena and claudedocs from repo

e2f0649

Keep these directories locally but exclude from version control. Also exclude .DS_Store files.

Merge branch 'pc/refactor-core' into feat/streaming-stt

249e7bd

Copilot AI review requested due to automatic review settings January 26, 2026 15:17

Copilot AI reviewed Jan 26, 2026

View reviewed changes

beshkenadze added 5 commits February 2, 2026 14:57

Merge branch 'pc/refactor-core' into feat/streaming-stt

087ca70

whisper: add custom repo loading and CLI support

990fb43

whisper: resolve tokenizer-specific token ids

ad98ddd

tests: add coverage and clean artifacts

5891015

Merge upstream/main into feat/streaming-stt and resolve conflicts

7597672

beshkenadze closed this Mar 7, 2026

beshkenadze mentioned this pull request Apr 23, 2026

Parakeet: batch generation, hybrid TDT, bf16 API, perf fixes #6

Closed

5 tasks

beshkenadze mentioned this pull request Apr 30, 2026

Qwen3-ASR: chunked prefill + asyncEval + repetition penalty + loop guard #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stt): Streaming Speech-to-Text with Whisper#1

feat(stt): Streaming Speech-to-Text with Whisper#1
beshkenadze wants to merge 103 commits into
mainfrom
feat/streaming-stt

beshkenadze commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

beshkenadze commented Jan 9, 2026

Summary

Features

Models

Usage

Structure

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants