feat(nemotron_asr): incremental NemotronASRStreamSession for live mic by beshkenadze · Pull Request #20 · beshkenadze/mlx-audio-swift

beshkenadze · 2026-06-13T10:14:55Z

What

Adds NemotronASRStreamSession — a true incremental (online) streaming session for Nemotron 3.5 ASR, so a live caller (e.g. a mic feeding 80–480 ms chunks) gets text as audio arrives instead of only after the whole buffer is in.

let session = model.makeStreamSession(language: "ru")   // or chunkMs: 320
let delta = session.step(samples)   // text decoded so far this call
let tail  = session.finish()        // flush trailing partial chunk

Why

generateStream(...) computes the mel of the whole utterance up front, then walks the cache‑aware encoder — so nothing comes out until the end. On a 47 s clip that's TTFT ≈ 2.49 s ≈ wall. The session emits per frozen chunk: TTFT 2.49 s → 0.09 s (×27), RTF unchanged (~0.048). This is what makes Nemotron usable for live mic / two‑tier STT.

How it stays bit‑identical to `generateStream(wholeAudio)`

The preprocessor is normalize: "NA" → each mel frame is an independent function of a fixed sample window (no per‑utterance mean/std that shifts as audio grows); preemph is causal.
The STFT centers with nFft/2 zero‑pad, so mel frame m covers samples [m·hop − nFft/2, m·hop + nFft/2). Frame m is frozen (unaffected by future audio) once m·hop + nFft/2 ≤ bufferLen. The session feeds the encoder only frozen, whole chunks; the trailing partial chunk is flushed in finish(), reproducing the offline right‑pad exactly.
Encoder + greedy RNN‑T state are lifted into NemotronASRStreamEncoderState / NemotronASRStreamRNNTState; the chunk loop and the RNN‑T decode are now shared by generateStream and the session (SSOT — no duplicated loops; generateStream's observable behavior is unchanged).

Verification

Parity (real model, mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit, FLEURS‑ru): 31/31 — session(fed @80 ms) == session(fed @480 ms) == generateStream(wholeAudio), byte‑identical transcript.
TTFT 2.49 s → 0.09 s on a 47 s meeting clip.
Unit tests on a tiny synthetic model assert session(chunked) == generateStream(whole) and feed‑granularity invariance (gated behind MLXAUDIO_ENABLE_MLX_RUNTIME_TESTS=1, like the other Nemotron MLX tests).

Files to review (5)

Sources/MLXAudioSTT/Models/NemotronASR/NemotronASRStreamSession.swift — new session + shared RNN‑T decode + makeStreamSession
Sources/MLXAudioSTT/Models/NemotronASR/NemotronASRStreaming.swift — resumable streamEncodeChunks (state lifted out of the loop)
Sources/MLXAudioSTT/Models/NemotronASR/NemotronASRModel.swift — generateStream now reuses the shared loops
Sources/Tools/mlx-audio-swift-stt/App.swift — --stream drives the session for Nemotron (mirrors Voxtral)
Tests/MLXAudioSTTTests.swift — parity + granularity‑invariance tests

⚠️ Branch base note

This branch is based on upstream/main (which has the NemotronASR base from Blaizzy#195–Blaizzy#198). Your fork's main predates that, so merging here also carries the upstream additions your main lacks (Whisper/Kokoro/Irodori/Nemotron). For a clean session‑only diff, Sync fork → upstream first, then this PR re‑computes to just the 5 files above. Your fork‑only Voxtral incremental‑streaming work is untouched by this branch.

Note: the dev parity/TTFT exe (nemo-parity) is kept off this PR (absolute‑path manifest); it lives on local branch dev/nemo-parity-harness.

generateStream computes the whole-utterance mel up front, so a live caller only gets text after the entire buffer is in. NemotronASRStreamSession ingests audio as it arrives (step([Float]) -> Delta / finish()) and emits text per frozen chunk with the model's native delay. Bit-identical to generateStream(wholeAudio): normalize==NA makes mel frames independent; the centered STFT freezes frame m once m*hop+nFft/2<=bufferLen, so the session only feeds the encoder frozen whole chunks (tail flushed in finish()). Encoder + greedy RNN-T state moved into NemotronASRStreamEncoderState / NemotronASRStreamRNNTState; the chunk loop and the RNN-T decode are now shared by generateStream and the session (SSOT, no duplicated loops). CLI --stream drives the session for Nemotron (mirrors Voxtral). Tests assert session(chunked) == generateStream(whole) and feed-granularity invariance on a tiny synthetic model. Verified on mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit (FLEURS-ru): 31/31 session==generateStream@{80ms,480ms feed}; TTFT 2.49s -> 0.09s on a 47s clip.

beshkenadze · 2026-06-14T13:35:41Z

Superseded by #21 (clean branch off current main, App.swift demo factored out).

beshkenadze force-pushed the pr/nemotron-stream-session branch from cb488bd to d506509 Compare June 13, 2026 10:26

beshkenadze mentioned this pull request Jun 14, 2026

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic #21

Closed

beshkenadze closed this Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#20

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#20
beshkenadze wants to merge 1 commit into
mainfrom
pr/nemotron-stream-session

beshkenadze commented Jun 13, 2026

Uh oh!

beshkenadze commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beshkenadze commented Jun 13, 2026

What

Why

How it stays bit‑identical to generateStream(wholeAudio)

Verification

Files to review (5)

⚠️ Branch base note

Uh oh!

beshkenadze commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

How it stays bit‑identical to `generateStream(wholeAudio)`