-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[speechlm2] Add streaming inference pipeline for NemotronVoiceChat #15571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 28 commits
85f9406
98da1ad
52813de
a65916a
2b21753
11abaa0
0b506b2
a7c61d9
40475f0
1449183
ef06833
a485dcc
dd88987
adc42f7
8babc04
0285589
a918f7f
277511b
dc6a759
6e98c85
819e5f4
e8e7151
eebea30
de230d9
8b849c1
d3db700
81a752e
b7673a4
df2e3bb
c3c0d7e
770efb4
89f818f
3e9e3e1
84eeec5
800bcc2
88543cf
e0db2ca
65d14e6
c3af8fa
9c3326d
3c3fdec
56053ce
cfde1af
fc316cd
0319f80
910fbd4
98e29e3
3c37ed2
982508f
8d55eee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -246,7 +246,35 @@ You can evaluate and run full-duplex inference using the `NemotronVoiceChat` pip | |
|
|
||
| print(f"Agent response: {generated_text}") | ||
| # generated_speech can now be saved or played (sampled at model.target_sample_rate) | ||
|
|
||
|
|
||
| NemotronVoiceChat Streaming Inference | ||
| ************************************* | ||
|
|
||
| For real-time, chunk-by-chunk inference (as opposed to the offline mode shown | ||
| above), use the Streaming S2S Pipeline: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from nemo.collections.speechlm2.inference import S2SPipelineBuilder | ||
|
|
||
| pipeline = S2SPipelineBuilder.build_pipeline(cfg) | ||
| output = pipeline.run(audio_filepaths, options=options) | ||
|
|
||
| Or from the command line: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \ | ||
| audio_file=/path/to/audio \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both examples here showcase |
||
| s2s.model_path=/path/to/checkpoint \ | ||
| s2s.speaker_name="<speaker>" \ | ||
| s2s.engine_type=native \ | ||
| s2s.system_prompt="You are a helpful assistant." \ | ||
| streaming.chunk_size_in_secs=0.24 \ | ||
| streaming.buffer_size_in_secs=1.68 | ||
|
|
||
| See :doc:`streaming_inference` for full details on configuration, architecture, | ||
| and server integration. | ||
|
|
||
| Training a Model | ||
| ---------------- | ||
|
|
@@ -341,3 +369,4 @@ For more information, see additional sections in the SpeechLM2 docs: | |
| datasets | ||
| configs | ||
| training_and_scaling | ||
| streaming_inference | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,304 @@ | ||
| Streaming Inference | ||
| =================== | ||
|
|
||
| The speechlm2 collection provides a streaming inference pipeline for | ||
| NemotronVoiceChat that processes audio in real time, chunk by chunk, and | ||
| produces both text and speech output incrementally. The pipeline follows the | ||
| same methodology as the NeMo ASR Inference Pipelines (see | ||
| ``nemo.collections.asr.inference``). | ||
|
|
||
| Overview | ||
| -------- | ||
|
|
||
| The streaming inference stack has four layers: | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| Entry Script s2s_streaming_infer.py (Hydra) | ||
| │ | ||
| ▼ | ||
| Pipeline StreamingS2SPipeline | ||
| │ - audio buffering | ||
| │ - state management | ||
| │ - file I/O | ||
| ▼ | ||
| Model Wrapper NemotronVoicechatInferenceWrapper | ||
| │ - infer_one_step() | ||
| │ - perception / LLM / TTS / codec | ||
| ▼ | ||
| Model NemotronVoiceChat | ||
| - DuplexSTTModel + DuplexEARTTS | ||
|
|
||
| Quick Start | ||
| ----------- | ||
|
|
||
| Batch Inference from a Script | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| The simplest way to run streaming inference is with the provided Hydra script: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \ | ||
| --config-path=examples/speechlm2/nemo_inference_pipelines/conf \ | ||
| --config-name=s2s_streaming \ | ||
| audio_file=/path/to/audio_or_directory_or_manifest.json \ | ||
| output_dir=./generated \ | ||
| s2s.model_path=/path/to/checkpoint \ | ||
| s2s.speaker_name="<speaker>" \ | ||
| s2s.engine_type=native \ | ||
| s2s.system_prompt="You are a helpful assistant." \ | ||
| streaming.chunk_size_in_secs=0.24 \ | ||
| streaming.buffer_size_in_secs=1.68 | ||
|
|
||
| This will: | ||
|
|
||
| 1. Load the NemotronVoiceChat checkpoint. | ||
| 2. Stream each audio file through the pipeline in chunks. | ||
| 3. Save generated ``.wav``, stereo (input+output), and ``.txt`` files under | ||
| ``output_dir``. | ||
|
|
||
| Programmatic Usage | ||
| ^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from nemo.collections.speechlm2.inference import S2SPipelineBuilder | ||
|
|
||
| pipeline = S2SPipelineBuilder.build_pipeline(cfg) | ||
| output = pipeline.run(audio_filepaths, options=options) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there another entry-point with a streaming input connector (mic)? We should mention. |
||
|
|
||
| # output.texts -- generated agent text per file | ||
| # output.asr_texts -- recognized user text per file | ||
| # output.audio_filepaths -- paths to generated .wav files | ||
|
|
||
|
|
||
| Architecture | ||
| ------------ | ||
|
|
||
| The Core Loop | ||
| ^^^^^^^^^^^^^ | ||
|
|
||
| Like the ASR pipeline's ``BasePipeline.run()``, the S2S pipeline iterates | ||
| over chunks and calls a single step method: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| pipeline.open_session() | ||
| for frames in streamer: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we show how |
||
| pipeline.generate_step(frames) | ||
|
erastorgueva-nv marked this conversation as resolved.
Outdated
|
||
| pipeline.close_session() | ||
| return PipelineOutput(...) | ||
|
|
||
| ``generate_step()`` is the unified entry point used by **both** the batch | ||
| ``run()`` method and server deployments. | ||
|
|
||
|
|
||
| What Happens Inside One Step | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| Each call to ``generate_step(frames)`` performs: | ||
|
|
||
| 1. **Prefill detection** -- A zero-length first frame with a system prompt | ||
| triggers ``prefill_for_new_stream()``, which initializes the LLM KV cache | ||
| with the system prompt and the TTS speaker embedding. | ||
|
|
||
| 2. **Audio buffering** -- ``BatchedAudioBufferer`` (reused from ASR | ||
| infrastructure) maintains a sliding window of ``buffer_size_in_secs``. | ||
|
|
||
| 3. **Model inference** via ``infer_one_step(audio_buffer, state)``: | ||
|
|
||
| a. **Perception** -- The audio buffer is encoded by the streaming | ||
| FastConformer encoder into frame embeddings. | ||
| b. **Per-frame LLM loop** -- For each of the ``num_frames_per_chunk`` | ||
| frames, the pipeline builds an input embedding (user audio + | ||
| previous-step text/ASR tokens), runs it through the LLM, and obtains | ||
| predicted text and ASR tokens. | ||
| c. **Per-frame TTS** -- Each predicted text token is fed into the EarTTS | ||
| model to produce audio codec codes. | ||
| d. **Codec decode** -- The accumulated codes are decoded into a waveform. | ||
|
|
||
| 4. **State updates** -- The context manager advances ``frame_idx`` and | ||
| updates the subword mask. | ||
|
|
||
| 5. **Output accumulation** -- Decoded audio and text are appended to the | ||
|
erastorgueva-nv marked this conversation as resolved.
|
||
| per-stream ``S2SStreamingState``. | ||
|
|
||
|
|
||
| Two Kinds of State | ||
| ^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| The pipeline maintains two separate state objects per stream: | ||
|
|
||
| **StreamingDecodeState** (model level) | ||
| Lives in ``S2SContextManager`` slots. Contains LLM KV cache, TTS KV | ||
| cache, perception cache, codec cache, token workspaces (``gen_text``, | ||
| ``gen_asr_text``), and ``frame_idx``. Created by the wrapper, mutated | ||
| in-place by ``infer_one_step()``, destroyed at end-of-stream. | ||
|
|
||
| **S2SStreamingState** (pipeline level) | ||
|
erastorgueva-nv marked this conversation as resolved.
Outdated
|
||
| Lives in the pipeline's ``_state_pool``. Accumulates generated audio | ||
| chunks, text strings, and word timings across steps. Kept alive until | ||
| ``close_session()`` so the final ``PipelineOutput`` can be assembled. | ||
|
|
||
|
|
||
| Configuration | ||
| ------------- | ||
|
|
||
| The streaming inference configuration is defined in | ||
| ``examples/speechlm2/nemo_inference_pipelines/conf/s2s_streaming.yaml``. | ||
|
|
||
| Key configuration groups: | ||
|
|
||
| S2S Model Settings (``s2s``) | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 30 15 55 | ||
|
|
||
| * - Parameter | ||
| - Default | ||
| - Description | ||
| * - ``model_path`` | ||
| - (required) | ||
| - Path to the NemotronVoiceChat HuggingFace checkpoint. | ||
| * - ``engine_type`` | ||
| - (required) | ||
| - ``native``, ``vllm_llm``, ``vllm_eartts``, or | ||
|
erastorgueva-nv marked this conversation as resolved.
|
||
| ``vllm_llm_vllm_eartts``. | ||
| * - ``speaker_name`` | ||
| - ``null`` | ||
| - Registered speaker name (must match a speaker in the checkpoint). | ||
| * - ``system_prompt`` | ||
| - (required) | ||
| - Text injected into the LLM KV cache before audio streaming begins. | ||
| * - ``compute_dtype`` | ||
| - ``bfloat16`` | ||
| - Precision for LLM/embedding layers. | ||
| * - ``use_perception_cache`` | ||
| - ``true`` | ||
| - Cache-aware streaming for the perception encoder. | ||
| * - ``use_llm_cache`` | ||
| - ``true`` | ||
| - Use KV cache for incremental LLM decoding. | ||
| * - ``top_p`` | ||
| - ``0.5`` | ||
| - Top-p sampling threshold. | ||
| * - ``temperature`` | ||
| - ``0.3`` | ||
| - Sampling temperature. | ||
| * - ``deterministic`` | ||
| - ``false`` | ||
| - Force deterministic mode (native engine only). | ||
| * - ``profile_timing`` | ||
| - ``false`` | ||
| - Insert ``torch.cuda.synchronize()`` around each stage for accurate | ||
| per-stage timing. Disabled by default to avoid GPU stalls. | ||
|
|
||
| Streaming Settings (``streaming``) | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 30 15 55 | ||
|
|
||
| * - Parameter | ||
| - Default | ||
| - Description | ||
| * - ``chunk_size_in_secs`` | ||
| - (required) | ||
| - Audio processed per inference step. Must be a multiple of 0.08 s. | ||
| * - ``buffer_size_in_secs`` | ||
| - (required) | ||
| - Sliding-window size passed to the perception encoder. | ||
| * - ``batch_size`` | ||
| - ``1`` | ||
| - Number of concurrent streams (currently only 1 is supported). | ||
| * - ``max_len`` | ||
| - ``8192`` | ||
| - Maximum number of frames per stream. | ||
|
|
||
| Padding Settings (top-level) | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| At most one of these may be set: | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 30 15 55 | ||
|
|
||
| * - Parameter | ||
| - Default | ||
| - Description | ||
| * - ``pad_audio_to_sec`` | ||
| - ``null`` | ||
| - Pad each input to a fixed duration. | ||
| * - ``pad_silence_ratio`` | ||
| - ``null`` | ||
| - Append silence equal to this fraction of the original duration. | ||
| * - ``pad_audio_by_sec`` | ||
| - ``null`` | ||
| - Append a fixed number of extra seconds of silence. | ||
|
|
||
|
|
||
| Server Integration | ||
| ------------------ | ||
|
|
||
| The same ``generate_step()`` method used by ``run()`` can be called directly | ||
| from a custom server. The zero-length Frame protocol handles prefill: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| # 1. Prefill system prompt (zero-length frame) | ||
| prefill_frame = Frame( | ||
| samples=torch.empty(0), | ||
| stream_id=stream_id, | ||
| is_first=True, is_last=False, | ||
| options=S2SRequestOptions(system_prompt=prompt), | ||
| ) | ||
| pipeline.generate_step([prefill_frame]) | ||
|
|
||
| # 2. Stream audio chunks | ||
| for chunk in audio_source: | ||
| frame = Frame( | ||
| samples=chunk, | ||
| stream_id=stream_id, | ||
| is_first=(i == 0), is_last=(i == last), | ||
| ) | ||
| pipeline.generate_step([frame]) | ||
|
|
||
|
|
||
| Batch Size | ||
| ---------- | ||
|
|
||
| The pipeline currently supports ``batch_size=1`` (one stream at a time). | ||
|
|
||
|
|
||
| File Layout | ||
|
erastorgueva-nv marked this conversation as resolved.
Outdated
|
||
| ----------- | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| nemo/collections/speechlm2/inference/ | ||
| ├── __init__.py # Public exports | ||
| ├── factory/ | ||
| │ └── s2s_pipeline_builder.py # S2SPipelineBuilder | ||
| ├── pipelines/ | ||
| │ ├── s2s_pipeline_interface.py # Base: _state_pool, sessions | ||
| │ └── streaming_s2s_pipeline.py # StreamingS2SPipeline | ||
| ├── model_wrappers/ | ||
| │ ├── decode_state.py # StreamingDecodeState, InferenceStepResult | ||
| │ ├── nemotron_voicechat_inference_wrapper.py | ||
| │ ├── model_factory.py # Native / vLLM model interfaces | ||
| │ └── perception_cache.py # Perception cache + CUDA graphs | ||
| ├── streaming/ | ||
| │ ├── framing/ | ||
| │ │ └── s2s_request_options.py # S2SRequestOptions | ||
| │ └── state/ | ||
| │ ├── s2s_state.py # S2SStreamingState | ||
| │ └── s2s_context_manager.py # Slot-based decode-state lifecycle | ||
| ├── utils/ | ||
| │ ├── pipeline_utils.py # PipelineOutput, text helpers | ||
| │ └── audio_data.py # Manifest / folder loading | ||
| └── vllm/ # Optional vLLM engine backend | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume a single-turn evaluation? Or the audio file can have multiple turns and the agent is expected to handle that correctly? Let's clarify this in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean - it's full-duplex, so it just generates one frame of output for every frame of audio input. Audio input can contain single-turn, muti-turn, whatever.
Or if you're asking about "evaluation" - the code doesn't support detailed "evaluation". We just generate text & audio for the full audio file (plus with an option to add silence padding at the end, so the agent can finish speaking). The one bit of "evaluation" we have is WER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's write these in here - it's not obvious for outside reader what characterizes the inputs and outputs in this API.