[speechlm2] Add streaming inference pipeline for NemotronVoiceChat by erastorgueva-nv · Pull Request #15571 · NVIDIA-NeMo/NeMo

erastorgueva-nv · 2026-04-01T06:27:05Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a streaming (real-time, chunk-by-chunk) inference pipeline for NemotronVoiceChat,
following the same architecture as the NeMo ASR Inference Pipelines.

Collection: speechlm2

Changelog

Add StreamingS2SPipeline with generate_step() API for both batch file processing and server integration
Add NemotronVoicechatInferenceWrapper with infer_one_step() implementing perception → LLM → TTS → codec decode
Add perception cache with optional CUDA graph support for cache-aware streaming encoding
Add S2SPipelineBuilder factory and Hydra config (s2s_streaming.yaml) for easy setup
Add state management: slot-based S2SContextManager for decode state lifecycle, S2SStreamingState for output accumulation
Add s2s_streaming_infer.py entry script for batch inference on files/manifests
Extend DuplexSTTModel: KV cache support for Nemotron hybrid Mamba/Attention (with monkey-patches for upstream HF bugs), save_pretrained with tokenizer export, function head, ASR logit boosts, cache_position forwarding
Speed up model loading: meta device initialization, skip codec silence token computation when codec has random weights
Fix sampling: nan/inf check before top-p filtering, vectorized repetition penalty
Fix byte-level BPE decoding and BOS/EOS preservation in text output
Refactor tests: shared conftest.py fixtures, offline-vs-streaming parity test, no-crash config sweep
Add docs: streaming_inference.rst with architecture, config reference, and server integration guide

Modifications to more general code - FYI @kevinhu-nv @Edresson

Extend NemotronVoiceChat: from_pretrained supports loading from HF-format checkpoint with llm_artifacts/
Extend EarTTSModel: vectorized depth-sum, precomputed RVQ schedule, optional torch.compile, subword cache
Patches: _patch_nemotron_cache_bugs and _patch_nemotron_block_forward methods in DuplexSTTModel are patching bugs in the HF Nemotron model code so we can get the KV caching to work. The patches seem to work for me, though I wonder if we can use more up-to-date code that doesn't have the patches.

Usage

python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \
    audio_file=/path/to/audio.wav \
    s2s.model_path=/path/to/checkpoint \
    s2s.speaker_name="<name>" \
    s2s.engine_type="native" \
    streaming.chunk_size_in_secs=0.08 \
    streaming.buffer_size_in_secs=1.68

from nemo.collections.speechlm2.inference import S2SPipelineBuilder

pipeline = S2SPipelineBuilder.build_pipeline(cfg)
output = pipeline.run(audio_filepaths, options=options)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

…model.py modification for function_head Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…with patches Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…le, optional torch.compile & subword cache Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…g - adjusted infer_one_step code so operations will match offline Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…nce wrapper loading Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…s which will be ignored anyway Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…_history_size parameter Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…StepResult etc Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…ogit comparison Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…ep, add docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

… for parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…tic parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…kens_to_str_raw Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…ering Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

pzelasko

Finalized my first review pass :)

pzelasko · 2026-04-06T15:13:13Z

+
+    system_prompt: str | None = None
+
+    top_p: float | None = None          # (0, 1]


Is it possible to support different top_p/temperature/repetition_penalty in different examples in a batch without using a for loop over them?

Is there a strong motivation to support that? Or could we expect the user to define session-level decoding parameters and have easier batching?

Yes, I think it will be possible. We can implement it once we support B > 1?
I think doing "session-level decoding parameters" would introduce an uncomfortable constraint - I would rather have it so that each stream can have its own sampling parameters

pzelasko · 2026-04-06T16:03:42Z

        pretrained_weights: Whether to load pretrained weights (True) or random init (False)
        dtype: Data type for the model
        trust_remote_code: Whether to trust remote code when loading model (needed for some models like Nemotron)
+        use_meta_device: If True, create the model on the meta device (no memory allocation).


Is this compatible with transformers v5 or v4? Or both? We already bumped NeMo to v5

The meta device initialization is handled by PyTorch's torch.device('meta') context manager, which is independent of transformers. The transformers APIs used here (AutoConfig.from_pretrained, AutoModelForCausalLM.from_config/from_pretrained) are stable across v4 and v5.

pzelasko · 2026-04-06T16:14:19Z

+
+from nemo.collections.speechlm2.models import NemotronVoiceChat
+
+_pretrained_llm = "TinyLlama/TinyLlama_v1.1"


Since so much logic is dedicated to Nemotron v2 in VoiceChat, shouldn't we test against Nemotron v2 as well? Or are you concerned it will take very long time to load in CI?

Exactly, I used tiny llama to keep model loading relatively quick.
Note that there are some tests in this PR that use the full voicechat checkpoint. They are only run if you specify the path to the local checkpoint. Once the checkpoint is public, I plan to make the test use that public checkpoint. So the test can easily run on CI, not just locally. And I believe that will satisfy the requirement of making sure we do test with nemotron v2

OK makes sense. If you want to add proper regression/functional tests, follow the pattern in tests/functional_tests and also please add a nightly E2E test in tests/e2e_tests

…e classes Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

… logs to debug Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…s; test padding Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…ake ctm token-level Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…neOutput, move token finalization into output object Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…ug print Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

… channel from inference Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

copy-pr-bot · 2026-04-24T22:46:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…red between duplexstt and inference pipeline model wrapper; add tests Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…ne models

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

pzelasko

Thanks, great work, it's greatly improved!
Left a few comments, my main concern remains hardcoding batch_size==1 in some asserts, while the entire pipeline looks capable of processing batch_size>1 at least for an offline evaluation (online inference is a separate story)

pzelasko · 2026-05-06T19:39:01Z

+(With ``s2s.decode_audio=false``, the model still predicts text/ASR tokens but
+skips EarTTS generation and codec decoding.)
+
+Quick Start


Quick Start should live at the top of this page

pzelasko · 2026-05-06T19:40:17Z

+     - (required)
+     - Sliding-window size passed to the perception encoder.
+   * - ``batch_size``
+     - ``1``


flagging - is this still true?

pzelasko · 2026-05-06T19:42:44Z

+     - ``8192``
+     - Maximum number of frames per stream.
+
+Padding Settings (top-level)


It should explain that it's not the typical padding used for batching, but extra silence following user's recording in order to allow the model to generate response.

pzelasko · 2026-05-06T19:49:57Z

@@ -0,0 +1,226 @@
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.


why is it called model.py and not llm.py like in vllm/ directory?

pzelasko · 2026-05-06T19:58:14Z

+            # With cache: we get exactly num_frames_per_chunk output frames
+            base_frame_index = 0
+        else:
+            # Without cache: use the second-to-last encoded frame as the


Can you explain where the 10ms is coming from? I thought cache-aware has only "regular" chunks of 80ms

pzelasko · 2026-05-06T20:01:13Z

+        self.output_sample_rate = getattr(self.streaming_cfg, "output_sample_rate", 22050)
+        self.batch_size = getattr(self.streaming_cfg, "batch_size", 1)
+        self.max_len = getattr(self.streaming_cfg, "max_len", 8192)
+        if self.batch_size != 1:


Is this still a limitation? The code seems capable of bs>1 inference

pzelasko · 2026-05-06T20:04:25Z

@@ -0,0 +1,278 @@
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.


In that case could we split this up into a library module (here) and a script (the best place seems to be examples/speechlm2/voicechat/to_vllm.py)?

This repo doesn't have scripts-as-importable-modules-living-in-library pattern (I'm sure it has its own name)

pzelasko · 2026-05-06T20:06:08Z

@@ -0,0 +1,244 @@
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#


Can we split up this script in the same way as the one above (ideally make to_vllm.py just work with both types of checkpoints)

pzelasko · 2026-05-06T20:09:56Z

+#    decoding.
+# ---------------------------------------------------------------------------
+
+SECONDS_PER_FRAME = 0.08


This seems like a dangerous constant to hardcode, can't we pass model.token_equivalent_duration directly (not sure if this property has been exposed to voicechat class but it's in AudioPerceptionModule and some SALM and S2S classes IIRC)

pzelasko · 2026-05-06T20:11:32Z

+
+from nemo.collections.speechlm2.models import NemotronVoiceChat
+
+_pretrained_llm = "TinyLlama/TinyLlama_v1.1"


OK makes sense. If you want to add proper regression/functional tests, follow the pattern in tests/functional_tests and also please add a nightly E2E test in tests/e2e_tests

github-advanced-security AI found potential problems Apr 1, 2026

View reviewed changes

erastorgueva-nv added 27 commits April 1, 2026 06:33

add changes from duplex-realtime-inference branch, except duplex_stt_…

85f9406

…model.py modification for function_head Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add on: asr_logits boosts, speaker embedding, fc head

98da1ad

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add use_llm_cache option, will use HybridMambaAttentionDynamicCache, …

52813de

…with patches Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add tts inference speedups: vectorize depthsum, precompute rvq schedu…

a65916a

…le, optional torch.compile & subword cache Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

allow using speaker_latent with vllm (need to update vllm eartts.py)

2b21753

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add flag for speaker_name if doing standalone inference

11abaa0

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove standalone code path; add parity check for offline vs streamin…

0b506b2

…g - adjusted infer_one_step code so operations will match offline Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

skip pretrained ASR/LLM downloads in from_pretrained; simplify infere…

a7c61d9

…nce wrapper loading Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

quickfix for parity harness regarding speaker name / reference in tts

40475f0

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

speed up model loading: use meta device, dont get codec silence token…

1449183

…s which will be ignored anyway Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

normalize indentation to 4-space

ef06833

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove hardcoded env var, simple tidy: remove dead code atc

a485dcc

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

always use codec cache => remove use_codec_cache flag and codec_token…

dd88987

…_history_size parameter Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove newlines in logs

adc42f7

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

further tidying: pass StreamingDecodeState directly, return Inference…

8babc04

…StepResult etc Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Add pytest-based offline vs. incremental inference parity test with l…

0285589

…ogit comparison Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

refactor streaming S2S pipeline: extract helpers, factor infer_one_st…

a918f7f

…ep, add docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

in test: use existing audio file, allow system prompt, specify params…

277511b

… for parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Refactor voicechat tests: shared fixtures, no-crash sweep, determinis…

dc6a759

…tic parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Fix byte-level BPE decoding in raw output: unify tokens_to_str and to…

6e98c85

…kens_to_str_raw Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

use whisper normalizer for wer calculation

819e5f4

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove unnecessary logging in perception cache step

e8e7151

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

vectorize rep penalty; fix sampling - nan/inf check before top-p filt…

eebea30

…ering Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Preserve BOS/EOS as literal strings in decoded text output

de230d9

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

update triton code; bugfix for vllm dtype/device

8b849c1

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Always send prefill before audio streaming; fix bfloat16 audio output

d3db700

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove triton code to keep PR simple

81a752e

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

erastorgueva-nv force-pushed the duplex-realtime-inference-rebase branch from 150aab1 to 81a752e Compare April 1, 2026 06:33

add missing __init__.py

b7673a4

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

pzelasko requested changes Apr 6, 2026

View reviewed changes

refactor: split model_factory into backend/ modules; unify vLLM engin…

88543cf

…e classes Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

github-advanced-security AI found potential problems Apr 9, 2026

View reviewed changes

erastorgueva-nv added 15 commits April 10, 2026 00:09

address CodeQL errors

e0db2ca

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Clean up debug/logging: logger pattern, keep logits on GPU, per-frame…

65d14e6

… logs to debug Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Add per-step progress bar, timing summary, and pad-visible logging

c3af8fa

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

refactor pipeline test builders to accept overrides as positional arg…

9c3326d

…s; test padding Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

fix O(n^2) output accumulation, release state when stream finished, m…

3c3fdec

…ake ctm token-level Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Rename S2SStreamingState to S2SStreamingOutput, remove slots & Pipeli…

56053ce

…neOutput, move token finalization into output object Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Refactor special-token handling into shared text_utils functions

cfde1af

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

safe_cast_to, shared tokenizer export, meta tensor handling

fc316cd

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

skip checkpoint weights for vLLM-replaced submodules

0319f80

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add available profile and logit boost flags to yaml; allow timing deb…

910fbd4

…ug print Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Move build_input_embedding to DuplexSTTModel; remove function-calling…

98e29e3

… channel from inference Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

move cache setup into backend classes

3c37ed2

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

fix codec state creation for vllm TTS backend

982508f

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

move abort_request from inference wrapper to ModelInterface base class

8d55eee

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

rename S2S request default helper; document CUDA graph warmup

105b686

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

erastorgueva-nv added 10 commits April 25, 2026 05:26

clarify rst docs

e1616bb

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

rename vllm checkpoint converters

40df5a8

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add clarifying comment about '.cpu()'

92e8ae3

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add comment explaining why skip self.get_codec_silence_frame()

637b309

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

move tests to voicechat folder

61eeea1

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

change _maybe_apply_forced_turn_taking call signature and make it sha…

e41ccc5

…red between duplexstt and inference pipeline model wrapper; add tests Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

apply isort and black reformatting

abe5ae2

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

sort will be from longest to shortest, following convention for offli…

98a9761

…ne models

remove Nemotron LLM cache workaround

bd5ead4

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove stale function_head code

469c2cf

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

pzelasko reviewed May 6, 2026

View reviewed changes


		system_prompt: str \| None = None

		top_p: float \| None = None # (0, 1]


		from nemo.collections.speechlm2.models import NemotronVoiceChat

		_pretrained_llm = "TinyLlama/TinyLlama_v1.1"

		@@ -0,0 +1,226 @@
		# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,278 @@
		# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,244 @@
		# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
		#

Conversation

erastorgueva-nv commented Apr 1, 2026

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzelasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

pzelasko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone