perf(voxtral): O(1)-per-chunk conv-stem for streaming (no more O(n²) prefix recompute) by beshkenadze · Pull Request #17 · beshkenadze/mlx-audio-swift

beshkenadze · 2026-06-11T12:27:02Z

Summary

Follow-up to #16. VoxtralRealtimeStreamSession recomputed mel + conv stem over the entire accumulated audio on every chunk (O(n²) per stream, plus a CPU reflect-pad round-trip of the whole buffer each step). Mid-stream chunks now derive padded-audio geometry arithmetically and compute only the new conv rows from a bounded raw-sample window — O(1) per chunk regardless of stream length.

Why it's bit-exact

conv row f needs mel frames [2f-3, 2f+1]; a mel frame covers [m·hop − win/2, m·hop + win/2) of the left-padded audio — bounded lookback/lookahead.
The STFT reflect pad resolves to zeros (the streaming left pad is ≥ win/2 zeros), and frozen frames never reach the right zero-pad (the dropped-partial-frame token formula guarantees ≥ 40 samples of slack for any chunk size).
First chunk and finish() keep the full stem, so the final transcript stays equal to offline generate().

Verification

WER(stream vs offline) = 0.0000 on:

conversational_a at 90 / 333 / 480 / 730 ms chunk sizes (incl. non-token-aligned)
intention (1.5s), conversational_fr
a 12-minute stream (1520 chunks): median step 0.193s, p90 0.334s @ 480ms budget, end-to-end realtime — step time no longer grows with stream length

swift build + swift build --build-tests clean.

The stream session recomputed mel + conv stem over the ENTIRE accumulated audio on every chunk — O(n) per chunk, O(n^2) per stream — including a CPU reflect-pad round-trip of the whole buffer. Mid-stream chunks now derive the padded-audio geometry arithmetically and compute only the new conv rows from a bounded raw-sample window. Bit-exactness: conv row f needs mel frames [2f-3, 2f+1]; a mel frame covers [m*hop - win/2, m*hop + win/2) of the left-padded audio; the STFT reflect pad resolves to zeros (left pad >= win/2 zeros) and frozen frames never reach the right pad (covered by frozenGuardTokens — the dropped-partial-frame token count guarantees >= 40 samples of slack for any chunk size). First chunk and finish() keep the full stem so the final transcript stays equal to offline generate(). WER(stream vs offline) = 0.0000 on conversational_a at 90/333/480/730ms chunks, intention, conversational_fr, and a 12-min stream (1520 chunks, median step 0.193s, end-to-end realtime).

beshkenadze merged commit 7062889 into main Jun 11, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(voxtral): O(1)-per-chunk conv-stem for streaming (no more O(n²) prefix recompute)#17

perf(voxtral): O(1)-per-chunk conv-stem for streaming (no more O(n²) prefix recompute)#17
beshkenadze merged 1 commit into
mainfrom
perf/voxtral-incremental-convstem

beshkenadze commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beshkenadze commented Jun 11, 2026

Summary

Why it's bit-exact

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant