Skip to content

perf(voxtral): O(1)-per-chunk conv-stem for streaming (no more O(n²) prefix recompute)#17

Merged
beshkenadze merged 1 commit into
mainfrom
perf/voxtral-incremental-convstem
Jun 11, 2026
Merged

perf(voxtral): O(1)-per-chunk conv-stem for streaming (no more O(n²) prefix recompute)#17
beshkenadze merged 1 commit into
mainfrom
perf/voxtral-incremental-convstem

Conversation

@beshkenadze

Copy link
Copy Markdown
Owner

Summary

Follow-up to #16. VoxtralRealtimeStreamSession recomputed mel + conv stem over the entire accumulated audio on every chunk (O(n²) per stream, plus a CPU reflect-pad round-trip of the whole buffer each step). Mid-stream chunks now derive padded-audio geometry arithmetically and compute only the new conv rows from a bounded raw-sample window — O(1) per chunk regardless of stream length.

Why it's bit-exact

  • conv row f needs mel frames [2f-3, 2f+1]; a mel frame covers [m·hop − win/2, m·hop + win/2) of the left-padded audio — bounded lookback/lookahead.
  • The STFT reflect pad resolves to zeros (the streaming left pad is ≥ win/2 zeros), and frozen frames never reach the right zero-pad (the dropped-partial-frame token formula guarantees ≥ 40 samples of slack for any chunk size).
  • First chunk and finish() keep the full stem, so the final transcript stays equal to offline generate().

Verification

WER(stream vs offline) = 0.0000 on:

  • conversational_a at 90 / 333 / 480 / 730 ms chunk sizes (incl. non-token-aligned)
  • intention (1.5s), conversational_fr
  • a 12-minute stream (1520 chunks): median step 0.193s, p90 0.334s @ 480ms budget, end-to-end realtime — step time no longer grows with stream length

swift build + swift build --build-tests clean.

The stream session recomputed mel + conv stem over the ENTIRE
accumulated audio on every chunk — O(n) per chunk, O(n^2) per stream —
including a CPU reflect-pad round-trip of the whole buffer. Mid-stream
chunks now derive the padded-audio geometry arithmetically and compute
only the new conv rows from a bounded raw-sample window.

Bit-exactness: conv row f needs mel frames [2f-3, 2f+1]; a mel frame
covers [m*hop - win/2, m*hop + win/2) of the left-padded audio; the
STFT reflect pad resolves to zeros (left pad >= win/2 zeros) and frozen
frames never reach the right pad (covered by frozenGuardTokens — the
dropped-partial-frame token count guarantees >= 40 samples of slack for
any chunk size). First chunk and finish() keep the full stem so the
final transcript stays equal to offline generate().

WER(stream vs offline) = 0.0000 on conversational_a at 90/333/480/730ms
chunks, intention, conversational_fr, and a 12-min stream (1520 chunks,
median step 0.193s, end-to-end realtime).
@beshkenadze beshkenadze merged commit 7062889 into main Jun 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant