[speechlm2] Add streaming inference pipeline for NemotronVoiceChat#15571
[speechlm2] Add streaming inference pipeline for NemotronVoiceChat#15571erastorgueva-nv wants to merge 61 commits intoNVIDIA-NeMo:mainfrom
Conversation
…model.py modification for function_head Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…with patches Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…le, optional torch.compile & subword cache Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…g - adjusted infer_one_step code so operations will match offline Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…nce wrapper loading Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…s which will be ignored anyway Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…_history_size parameter Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…StepResult etc Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ogit comparison Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ep, add docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
… for parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…tic parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…kens_to_str_raw Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ering Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
150aab1 to
81a752e
Compare
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
pzelasko
left a comment
There was a problem hiding this comment.
Finalized my first review pass :)
|
|
||
| system_prompt: str | None = None | ||
|
|
||
| top_p: float | None = None # (0, 1] |
There was a problem hiding this comment.
Is it possible to support different top_p/temperature/repetition_penalty in different examples in a batch without using a for loop over them?
Is there a strong motivation to support that? Or could we expect the user to define session-level decoding parameters and have easier batching?
There was a problem hiding this comment.
Yes, I think it will be possible. We can implement it once we support B > 1?
I think doing "session-level decoding parameters" would introduce an uncomfortable constraint - I would rather have it so that each stream can have its own sampling parameters
| pretrained_weights: Whether to load pretrained weights (True) or random init (False) | ||
| dtype: Data type for the model | ||
| trust_remote_code: Whether to trust remote code when loading model (needed for some models like Nemotron) | ||
| use_meta_device: If True, create the model on the meta device (no memory allocation). |
There was a problem hiding this comment.
Is this compatible with transformers v5 or v4? Or both? We already bumped NeMo to v5
There was a problem hiding this comment.
The meta device initialization is handled by PyTorch's torch.device('meta') context manager, which is independent of transformers. The transformers APIs used here (AutoConfig.from_pretrained, AutoModelForCausalLM.from_config/from_pretrained) are stable across v4 and v5.
|
|
||
| from nemo.collections.speechlm2.models import NemotronVoiceChat | ||
|
|
||
| _pretrained_llm = "TinyLlama/TinyLlama_v1.1" |
There was a problem hiding this comment.
Since so much logic is dedicated to Nemotron v2 in VoiceChat, shouldn't we test against Nemotron v2 as well? Or are you concerned it will take very long time to load in CI?
There was a problem hiding this comment.
Exactly, I used tiny llama to keep model loading relatively quick.
Note that there are some tests in this PR that use the full voicechat checkpoint. They are only run if you specify the path to the local checkpoint. Once the checkpoint is public, I plan to make the test use that public checkpoint. So the test can easily run on CI, not just locally. And I believe that will satisfy the requirement of making sure we do test with nemotron v2
There was a problem hiding this comment.
OK makes sense. If you want to add proper regression/functional tests, follow the pattern in tests/functional_tests and also please add a nightly E2E test in tests/e2e_tests
…e classes Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
… logs to debug Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…s; test padding Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ake ctm token-level Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…neOutput, move token finalization into output object Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ug print Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
… channel from inference Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…red between duplexstt and inference pipeline model wrapper; add tests Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
pzelasko
left a comment
There was a problem hiding this comment.
Thanks, great work, it's greatly improved!
Left a few comments, my main concern remains hardcoding batch_size==1 in some asserts, while the entire pipeline looks capable of processing batch_size>1 at least for an offline evaluation (online inference is a separate story)
| (With ``s2s.decode_audio=false``, the model still predicts text/ASR tokens but | ||
| skips EarTTS generation and codec decoding.) | ||
|
|
||
| Quick Start |
There was a problem hiding this comment.
Quick Start should live at the top of this page
| - (required) | ||
| - Sliding-window size passed to the perception encoder. | ||
| * - ``batch_size`` | ||
| - ``1`` |
There was a problem hiding this comment.
flagging - is this still true?
| - ``8192`` | ||
| - Maximum number of frames per stream. | ||
|
|
||
| Padding Settings (top-level) |
There was a problem hiding this comment.
It should explain that it's not the typical padding used for batching, but extra silence following user's recording in order to allow the model to generate response.
| @@ -0,0 +1,226 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
why is it called model.py and not llm.py like in vllm/ directory?
| # With cache: we get exactly num_frames_per_chunk output frames | ||
| base_frame_index = 0 | ||
| else: | ||
| # Without cache: use the second-to-last encoded frame as the |
There was a problem hiding this comment.
Can you explain where the 10ms is coming from? I thought cache-aware has only "regular" chunks of 80ms
| self.output_sample_rate = getattr(self.streaming_cfg, "output_sample_rate", 22050) | ||
| self.batch_size = getattr(self.streaming_cfg, "batch_size", 1) | ||
| self.max_len = getattr(self.streaming_cfg, "max_len", 8192) | ||
| if self.batch_size != 1: |
There was a problem hiding this comment.
Is this still a limitation? The code seems capable of bs>1 inference
| @@ -0,0 +1,278 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
In that case could we split this up into a library module (here) and a script (the best place seems to be examples/speechlm2/voicechat/to_vllm.py)?
This repo doesn't have scripts-as-importable-modules-living-in-library pattern (I'm sure it has its own name)
| @@ -0,0 +1,244 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
| # | |||
There was a problem hiding this comment.
Can we split up this script in the same way as the one above (ideally make to_vllm.py just work with both types of checkpoints)
| # decoding. | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| SECONDS_PER_FRAME = 0.08 |
There was a problem hiding this comment.
This seems like a dangerous constant to hardcode, can't we pass model.token_equivalent_duration directly (not sure if this property has been exposed to voicechat class but it's in AudioPerceptionModule and some SALM and S2S classes IIRC)
|
|
||
| from nemo.collections.speechlm2.models import NemotronVoiceChat | ||
|
|
||
| _pretrained_llm = "TinyLlama/TinyLlama_v1.1" |
There was a problem hiding this comment.
OK makes sense. If you want to add proper regression/functional tests, follow the pattern in tests/functional_tests and also please add a nightly E2E test in tests/e2e_tests
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add a streaming (real-time, chunk-by-chunk) inference pipeline for NemotronVoiceChat,
following the same architecture as the NeMo ASR Inference Pipelines.
Collection: speechlm2
Changelog
StreamingS2SPipelinewithgenerate_step()API for both batch file processing and server integrationNemotronVoicechatInferenceWrapperwithinfer_one_step()implementing perception → LLM → TTS → codec decodeS2SPipelineBuilderfactory and Hydra config (s2s_streaming.yaml) for easy setupS2SContextManagerfor decode state lifecycle,S2SStreamingStatefor output accumulations2s_streaming_infer.pyentry script for batch inference on files/manifestsDuplexSTTModel: KV cache support for Nemotron hybrid Mamba/Attention (with monkey-patches for upstream HF bugs),save_pretrainedwith tokenizer export, function head, ASR logit boosts,cache_positionforwardingconftest.pyfixtures, offline-vs-streaming parity test, no-crash config sweepstreaming_inference.rstwith architecture, config reference, and server integration guideModifications to more general code - FYI @kevinhu-nv @Edresson
NemotronVoiceChat:from_pretrainedsupports loading from HF-format checkpoint withllm_artifacts/EarTTSModel: vectorized depth-sum, precomputed RVQ schedule, optionaltorch.compile, subword cache_patch_nemotron_cache_bugsand_patch_nemotron_block_forwardmethods inDuplexSTTModelare patching bugs in the HF Nemotron model code so we can get the KV caching to work. The patches seem to work for me, though I wonder if we can use more up-to-date code that doesn't have the patches.Usage
python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \ audio_file=/path/to/audio.wav \ s2s.model_path=/path/to/checkpoint \ s2s.speaker_name="<name>" \ s2s.engine_type="native" \ streaming.chunk_size_in_secs=0.08 \ streaming.buffer_size_in_secs=1.68GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information