feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#20
Closed
beshkenadze wants to merge 1 commit into
Closed
feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#20beshkenadze wants to merge 1 commit into
beshkenadze wants to merge 1 commit into
Conversation
generateStream computes the whole-utterance mel up front, so a live caller only
gets text after the entire buffer is in. NemotronASRStreamSession ingests audio
as it arrives (step([Float]) -> Delta / finish()) and emits text per frozen chunk
with the model's native delay.
Bit-identical to generateStream(wholeAudio): normalize==NA makes mel frames
independent; the centered STFT freezes frame m once m*hop+nFft/2<=bufferLen, so
the session only feeds the encoder frozen whole chunks (tail flushed in finish()).
Encoder + greedy RNN-T state moved into NemotronASRStreamEncoderState /
NemotronASRStreamRNNTState; the chunk loop and the RNN-T decode are now shared by
generateStream and the session (SSOT, no duplicated loops).
CLI --stream drives the session for Nemotron (mirrors Voxtral). Tests assert
session(chunked) == generateStream(whole) and feed-granularity invariance on a
tiny synthetic model.
Verified on mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit (FLEURS-ru): 31/31
session==generateStream@{80ms,480ms feed}; TTFT 2.49s -> 0.09s on a 47s clip.
cb488bd to
d506509
Compare
Owner
Author
|
Superseded by #21 (clean branch off current main, App.swift demo factored out). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
NemotronASRStreamSession— a true incremental (online) streaming session for Nemotron 3.5 ASR, so a live caller (e.g. a mic feeding 80–480 ms chunks) gets text as audio arrives instead of only after the whole buffer is in.Why
generateStream(...)computes the mel of the whole utterance up front, then walks the cache‑aware encoder — so nothing comes out until the end. On a 47 s clip that's TTFT ≈ 2.49 s ≈ wall. The session emits per frozen chunk: TTFT 2.49 s → 0.09 s (×27), RTF unchanged (~0.048). This is what makes Nemotron usable for live mic / two‑tier STT.How it stays bit‑identical to
generateStream(wholeAudio)normalize: "NA"→ each mel frame is an independent function of a fixed sample window (no per‑utterance mean/std that shifts as audio grows); preemph is causal.nFft/2zero‑pad, so mel frame m covers samples[m·hop − nFft/2, m·hop + nFft/2). Frame m is frozen (unaffected by future audio) oncem·hop + nFft/2 ≤ bufferLen. The session feeds the encoder only frozen, whole chunks; the trailing partial chunk is flushed infinish(), reproducing the offline right‑pad exactly.NemotronASRStreamEncoderState/NemotronASRStreamRNNTState; the chunk loop and the RNN‑T decode are now shared bygenerateStreamand the session (SSOT — no duplicated loops;generateStream's observable behavior is unchanged).Verification
mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit, FLEURS‑ru): 31/31 —session(fed @80 ms) == session(fed @480 ms) == generateStream(wholeAudio), byte‑identical transcript.session(chunked) == generateStream(whole)and feed‑granularity invariance (gated behindMLXAUDIO_ENABLE_MLX_RUNTIME_TESTS=1, like the other Nemotron MLX tests).Files to review (5)
Sources/MLXAudioSTT/Models/NemotronASR/NemotronASRStreamSession.swift— new session + shared RNN‑T decode +makeStreamSessionSources/MLXAudioSTT/Models/NemotronASR/NemotronASRStreaming.swift— resumablestreamEncodeChunks(state lifted out of the loop)Sources/MLXAudioSTT/Models/NemotronASR/NemotronASRModel.swift—generateStreamnow reuses the shared loopsSources/Tools/mlx-audio-swift-stt/App.swift—--streamdrives the session for Nemotron (mirrors Voxtral)Tests/MLXAudioSTTTests.swift— parity + granularity‑invariance testsThis branch is based on
upstream/main(which has the NemotronASR base from Blaizzy#195–Blaizzy#198). Your fork'smainpredates that, so merging here also carries the upstream additions yourmainlacks (Whisper/Kokoro/Irodori/Nemotron). For a clean session‑only diff, Sync fork → upstream first, then this PR re‑computes to just the 5 files above. Your fork‑only Voxtral incremental‑streaming work is untouched by this branch.Note: the dev parity/TTFT exe (
nemo-parity) is kept off this PR (absolute‑path manifest); it lives on local branchdev/nemo-parity-harness.