feat(nemotron_asr): incremental NemotronASRStreamSession for live mic by beshkenadze · Pull Request #21 · beshkenadze/mlx-audio-swift

beshkenadze · 2026-06-14T12:52:41Z

What

Adds NemotronASRStreamSession — a true online streaming session for Nemotron 3.5 ASR.

The existing generateStream computes the whole-utterance mel up front, so a live caller only receives text once the entire buffer is available. NemotronASRStreamSession ingests audio incrementally as it arrives:

step(_ samples: [Float]) -> Delta — feed a chunk, get the newly decoded text
finish() -> Delta — flush the tail

This enables low-latency live transcription (e.g. microphone input) without waiting for the full utterance.

Changes (4 files)

NemotronASRStreamSession.swift (new) — the online streaming session
NemotronASRModel.swift, NemotronASRStreaming.swift — extract the streaming primitives so they can be driven incrementally; no change to the existing offline / generateStream behaviour
Tests/MLXAudioSTTTests.swift — coverage for the session

Validation

swift build passes against current main.
Functional: feeding a ~1 min English clip chunk-by-chunk through step/finish yields a transcript consistent with the offline path.

…zy#205) VoxtralRealtimeModel ran the "Realtime" model offline: generateStream consumed the whole buffer up front, then only yielded the finished transcript. Add a genuine online path. - VoxtralRealtimeStreamSession: stateful step(samples)/finish() that ingests audio as it arrives, feeds only newly-frozen conv frames through the transformer encoder with a persistent per-layer KV-cache (O(1) per chunk), maintains the decoder KV-cache, and emits tokens at the model's native transcription delay. - Encoder: encodeIncremental + sliding-window-aligned block feeding with a cache reset at each boundary. RoPE is relative-position invariant, so this is bit-exact to encodeFull (<=sw) and encodeChunked (>sw) in one path. Refactor a convStemForAudio/encodeAudio seam shared with offline. - mlx-audio-swift-stt: route Voxtral --stream through the session (feed 480 ms chunks, print deltas live). - Fixture-based parity test (compile-guarded; swift test can't load the MLX metallib, so MLX runtime verification goes through an executable). Validated (4-bit, mem-capped 18 GB): WER 0.0000 vs offline on intention (1.5 s) and conversational_a (13.26 s); the 13 s clip runs at RTF 0.649 with median 0.285 s/chunk (budget 0.480 s) — steady-state and end-to-end realtime. (cherry picked from commit 9b46bca)

Co-authored-by: Lucas Newman <lucas@future.fit>

generateStream computes the whole-utterance mel up front, so a live caller only gets text after the entire buffer is in. NemotronASRStreamSession ingests audio as it arrives (step([Float]) -> Delta / finish()) and emits text per frozen chunk with the model's native delay. Bit-identical to generateStream(wholeAudio): normalize==NA makes mel frames independent; the centered STFT freezes frame m once m*hop+nFft/2<=bufferLen, so the session only feeds the encoder frozen whole chunks (tail flushed in finish()). Encoder + greedy RNN-T state moved into NemotronASRStreamEncoderState / NemotronASRStreamRNNTState; the chunk loop and the RNN-T decode are now shared by generateStream and the session (SSOT, no duplicated loops). CLI --stream drives the session for Nemotron (mirrors Voxtral). Tests assert session(chunked) == generateStream(whole) and feed-granularity invariance on a tiny synthetic model. Verified on mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit (FLEURS-ru): 31/31 session==generateStream@{80ms,480ms feed}; TTFT 2.49s -> 0.09s on a 47s clip.

beshkenadze · 2026-06-14T21:14:54Z

Opened upstream as Blaizzy#208.

beshkenadze force-pushed the feat/nemotron-asr-stream-session branch 2 times, most recently from 276e027 to 3c8841a Compare June 14, 2026 13:35

beshkenadze mentioned this pull request Jun 14, 2026

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic #20

Closed

beshkenadze force-pushed the feat/nemotron-asr-stream-session branch from 3c8841a to d092360 Compare June 14, 2026 18:12

beshkenadze and others added 3 commits June 14, 2026 12:02

Fix Qwen3-TTS CustomVoice voice parsing (Blaizzy#186)

3f6b055

Co-authored-by: Lucas Newman <lucas@future.fit>

beshkenadze force-pushed the feat/nemotron-asr-stream-session branch from d092360 to 3db113b Compare June 14, 2026 21:14

beshkenadze closed this Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#21

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#21
beshkenadze wants to merge 3 commits into
mainfrom
feat/nemotron-asr-stream-session

beshkenadze commented Jun 14, 2026 •

edited

Loading

Uh oh!

beshkenadze commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

beshkenadze commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes (4 files)

Validation

Uh oh!

beshkenadze commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

beshkenadze commented Jun 14, 2026 •

edited

Loading