Skip to content

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#21

Closed
beshkenadze wants to merge 3 commits into
mainfrom
feat/nemotron-asr-stream-session
Closed

feat(nemotron_asr): incremental NemotronASRStreamSession for live mic#21
beshkenadze wants to merge 3 commits into
mainfrom
feat/nemotron-asr-stream-session

Conversation

@beshkenadze

@beshkenadze beshkenadze commented Jun 14, 2026

Copy link
Copy Markdown
Owner

What

Adds NemotronASRStreamSession — a true online streaming session for Nemotron 3.5 ASR.

The existing generateStream computes the whole-utterance mel up front, so a live caller only receives text once the entire buffer is available. NemotronASRStreamSession ingests audio incrementally as it arrives:

  • step(_ samples: [Float]) -> Delta — feed a chunk, get the newly decoded text
  • finish() -> Delta — flush the tail

This enables low-latency live transcription (e.g. microphone input) without waiting for the full utterance.

Changes (4 files)

  • NemotronASRStreamSession.swift (new) — the online streaming session
  • NemotronASRModel.swift, NemotronASRStreaming.swift — extract the streaming primitives so they can be driven incrementally; no change to the existing offline / generateStream behaviour
  • Tests/MLXAudioSTTTests.swift — coverage for the session

Validation

  • swift build passes against current main.
  • Functional: feeding a ~1 min English clip chunk-by-chunk through step/finish yields a transcript consistent with the offline path.

@beshkenadze beshkenadze force-pushed the feat/nemotron-asr-stream-session branch 2 times, most recently from 276e027 to 3c8841a Compare June 14, 2026 13:35
@beshkenadze beshkenadze force-pushed the feat/nemotron-asr-stream-session branch from 3c8841a to d092360 Compare June 14, 2026 18:12
beshkenadze and others added 3 commits June 14, 2026 12:02
…zy#205)

VoxtralRealtimeModel ran the "Realtime" model offline: generateStream
consumed the whole buffer up front, then only yielded the finished
transcript. Add a genuine online path.

- VoxtralRealtimeStreamSession: stateful step(samples)/finish() that
  ingests audio as it arrives, feeds only newly-frozen conv frames
  through the transformer encoder with a persistent per-layer KV-cache
  (O(1) per chunk), maintains the decoder KV-cache, and emits tokens at
  the model's native transcription delay.
- Encoder: encodeIncremental + sliding-window-aligned block feeding with
  a cache reset at each boundary. RoPE is relative-position invariant, so
  this is bit-exact to encodeFull (<=sw) and encodeChunked (>sw) in one
  path. Refactor a convStemForAudio/encodeAudio seam shared with offline.
- mlx-audio-swift-stt: route Voxtral --stream through the session (feed
  480 ms chunks, print deltas live).
- Fixture-based parity test (compile-guarded; swift test can't load the
  MLX metallib, so MLX runtime verification goes through an executable).

Validated (4-bit, mem-capped 18 GB): WER 0.0000 vs offline on
intention (1.5 s) and conversational_a (13.26 s); the 13 s clip runs at
RTF 0.649 with median 0.285 s/chunk (budget 0.480 s) — steady-state and
end-to-end realtime.

(cherry picked from commit 9b46bca)
Co-authored-by: Lucas Newman <lucas@future.fit>
generateStream computes the whole-utterance mel up front, so a live caller only
gets text after the entire buffer is in. NemotronASRStreamSession ingests audio
as it arrives (step([Float]) -> Delta / finish()) and emits text per frozen chunk
with the model's native delay.

Bit-identical to generateStream(wholeAudio): normalize==NA makes mel frames
independent; the centered STFT freezes frame m once m*hop+nFft/2<=bufferLen, so
the session only feeds the encoder frozen whole chunks (tail flushed in finish()).
Encoder + greedy RNN-T state moved into NemotronASRStreamEncoderState /
NemotronASRStreamRNNTState; the chunk loop and the RNN-T decode are now shared by
generateStream and the session (SSOT, no duplicated loops).

CLI --stream drives the session for Nemotron (mirrors Voxtral). Tests assert
session(chunked) == generateStream(whole) and feed-granularity invariance on a
tiny synthetic model.

Verified on mlx-community/nemotron-3.5-asr-streaming-0.6b-8bit (FLEURS-ru): 31/31
session==generateStream@{80ms,480ms feed}; TTFT 2.49s -> 0.09s on a 47s clip.
@beshkenadze beshkenadze force-pushed the feat/nemotron-asr-stream-session branch from d092360 to 3db113b Compare June 14, 2026 21:14
@beshkenadze

Copy link
Copy Markdown
Owner Author

Opened upstream as Blaizzy#208.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants