feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#21
Closed
beshkenadze wants to merge 3 commits into
Closed
feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#21beshkenadze wants to merge 3 commits into
beshkenadze wants to merge 3 commits into
Conversation
276e027 to
3c8841a
Compare
3c8841a to
d092360
Compare
…zy#205) VoxtralRealtimeModel ran the "Realtime" model offline: generateStream consumed the whole buffer up front, then only yielded the finished transcript. Add a genuine online path. - VoxtralRealtimeStreamSession: stateful step(samples)/finish() that ingests audio as it arrives, feeds only newly-frozen conv frames through the transformer encoder with a persistent per-layer KV-cache (O(1) per chunk), maintains the decoder KV-cache, and emits tokens at the model's native transcription delay. - Encoder: encodeIncremental + sliding-window-aligned block feeding with a cache reset at each boundary. RoPE is relative-position invariant, so this is bit-exact to encodeFull (<=sw) and encodeChunked (>sw) in one path. Refactor a convStemForAudio/encodeAudio seam shared with offline. - mlx-audio-swift-stt: route Voxtral --stream through the session (feed 480 ms chunks, print deltas live). - Fixture-based parity test (compile-guarded; swift test can't load the MLX metallib, so MLX runtime verification goes through an executable). Validated (4-bit, mem-capped 18 GB): WER 0.0000 vs offline on intention (1.5 s) and conversational_a (13.26 s); the 13 s clip runs at RTF 0.649 with median 0.285 s/chunk (budget 0.480 s) — steady-state and end-to-end realtime. (cherry picked from commit 9b46bca)
Co-authored-by: Lucas Newman <lucas@future.fit>
generateStream computes the whole-utterance mel up front, so a live caller only
gets text after the entire buffer is in. NemotronASRStreamSession ingests audio
as it arrives (step([Float]) -> Delta / finish()) and emits text per frozen chunk
with the model's native delay.
Bit-identical to generateStream(wholeAudio): normalize==NA makes mel frames
independent; the centered STFT freezes frame m once m*hop+nFft/2<=bufferLen, so
the session only feeds the encoder frozen whole chunks (tail flushed in finish()).
Encoder + greedy RNN-T state moved into NemotronASRStreamEncoderState /
NemotronASRStreamRNNTState; the chunk loop and the RNN-T decode are now shared by
generateStream and the session (SSOT, no duplicated loops).
CLI --stream drives the session for Nemotron (mirrors Voxtral). Tests assert
session(chunked) == generateStream(whole) and feed-granularity invariance on a
tiny synthetic model.
Verified on mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit (FLEURS-ru): 31/31
session==generateStream@{80ms,480ms feed}; TTFT 2.49s -> 0.09s on a 47s clip.
d092360 to
3db113b
Compare
Owner
Author
|
Opened upstream as Blaizzy#208. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
NemotronASRStreamSession— a true online streaming session for Nemotron 3.5 ASR.The existing
generateStreamcomputes the whole-utterance mel up front, so a live caller only receives text once the entire buffer is available.NemotronASRStreamSessioningests audio incrementally as it arrives:step(_ samples: [Float]) -> Delta— feed a chunk, get the newly decoded textfinish() -> Delta— flush the tailThis enables low-latency live transcription (e.g. microphone input) without waiting for the full utterance.
Changes (4 files)
NemotronASRStreamSession.swift(new) — the online streaming sessionNemotronASRModel.swift,NemotronASRStreaming.swift— extract the streaming primitives so they can be driven incrementally; no change to the existing offline /generateStreambehaviourTests/MLXAudioSTTTests.swift— coverage for the sessionValidation
swift buildpasses against currentmain.step/finishyields a transcript consistent with the offline path.