Skip to content

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#20

Closed
beshkenadze wants to merge 1 commit into
mainfrom
pr/nemotron-stream-session
Closed

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#20
beshkenadze wants to merge 1 commit into
mainfrom
pr/nemotron-stream-session

Conversation

@beshkenadze

Copy link
Copy Markdown
Owner

What

Adds NemotronASRStreamSession — a true incremental (online) streaming session for Nemotron 3.5 ASR, so a live caller (e.g. a mic feeding 80–480 ms chunks) gets text as audio arrives instead of only after the whole buffer is in.

let session = model.makeStreamSession(language: "ru")   // or chunkMs: 320
let delta = session.step(samples)   // text decoded so far this call
let tail  = session.finish()        // flush trailing partial chunk

Why

generateStream(...) computes the mel of the whole utterance up front, then walks the cache‑aware encoder — so nothing comes out until the end. On a 47 s clip that's TTFT ≈ 2.49 s ≈ wall. The session emits per frozen chunk: TTFT 2.49 s → 0.09 s (×27), RTF unchanged (~0.048). This is what makes Nemotron usable for live mic / two‑tier STT.

How it stays bit‑identical to generateStream(wholeAudio)

  • The preprocessor is normalize: "NA" → each mel frame is an independent function of a fixed sample window (no per‑utterance mean/std that shifts as audio grows); preemph is causal.
  • The STFT centers with nFft/2 zero‑pad, so mel frame m covers samples [m·hop − nFft/2, m·hop + nFft/2). Frame m is frozen (unaffected by future audio) once m·hop + nFft/2 ≤ bufferLen. The session feeds the encoder only frozen, whole chunks; the trailing partial chunk is flushed in finish(), reproducing the offline right‑pad exactly.
  • Encoder + greedy RNN‑T state are lifted into NemotronASRStreamEncoderState / NemotronASRStreamRNNTState; the chunk loop and the RNN‑T decode are now shared by generateStream and the session (SSOT — no duplicated loops; generateStream's observable behavior is unchanged).

Verification

  • Parity (real model, mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit, FLEURS‑ru): 31/31session(fed @80 ms) == session(fed @480 ms) == generateStream(wholeAudio), byte‑identical transcript.
  • TTFT 2.49 s → 0.09 s on a 47 s meeting clip.
  • Unit tests on a tiny synthetic model assert session(chunked) == generateStream(whole) and feed‑granularity invariance (gated behind MLXAUDIO_ENABLE_MLX_RUNTIME_TESTS=1, like the other Nemotron MLX tests).

Files to review (5)

  • Sources/MLXAudioSTT/Models/NemotronASR/NemotronASRStreamSession.swift — new session + shared RNN‑T decode + makeStreamSession
  • Sources/MLXAudioSTT/Models/NemotronASR/NemotronASRStreaming.swift — resumable streamEncodeChunks (state lifted out of the loop)
  • Sources/MLXAudioSTT/Models/NemotronASR/NemotronASRModel.swiftgenerateStream now reuses the shared loops
  • Sources/Tools/mlx-audio-swift-stt/App.swift--stream drives the session for Nemotron (mirrors Voxtral)
  • Tests/MLXAudioSTTTests.swift — parity + granularity‑invariance tests

⚠️ Branch base note

This branch is based on upstream/main (which has the NemotronASR base from Blaizzy#195Blaizzy#198). Your fork's main predates that, so merging here also carries the upstream additions your main lacks (Whisper/Kokoro/Irodori/Nemotron). For a clean session‑only diff, Sync fork → upstream first, then this PR re‑computes to just the 5 files above. Your fork‑only Voxtral incremental‑streaming work is untouched by this branch.

Note: the dev parity/TTFT exe (nemo-parity) is kept off this PR (absolute‑path manifest); it lives on local branch dev/nemo-parity-harness.

generateStream computes the whole-utterance mel up front, so a live caller only
gets text after the entire buffer is in. NemotronASRStreamSession ingests audio
as it arrives (step([Float]) -> Delta / finish()) and emits text per frozen chunk
with the model's native delay.

Bit-identical to generateStream(wholeAudio): normalize==NA makes mel frames
independent; the centered STFT freezes frame m once m*hop+nFft/2<=bufferLen, so
the session only feeds the encoder frozen whole chunks (tail flushed in finish()).
Encoder + greedy RNN-T state moved into NemotronASRStreamEncoderState /
NemotronASRStreamRNNTState; the chunk loop and the RNN-T decode are now shared by
generateStream and the session (SSOT, no duplicated loops).

CLI --stream drives the session for Nemotron (mirrors Voxtral). Tests assert
session(chunked) == generateStream(whole) and feed-granularity invariance on a
tiny synthetic model.

Verified on mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit (FLEURS-ru): 31/31
session==generateStream@{80ms,480ms feed}; TTFT 2.49s -> 0.09s on a 47s clip.
@beshkenadze

Copy link
Copy Markdown
Owner Author

Superseded by #21 (clean branch off current main, App.swift demo factored out).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant