NVIDIA-NeMo · erastorgueva-nv · Mar 11, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 13, 2026
diff --git a/docs/source/speechlm2/intro.rst b/docs/source/speechlm2/intro.rst
@@ -246,7 +246,35 @@ You can evaluate and run full-duplex inference using the `NemotronVoiceChat` pip
 
     print(f"Agent response: {generated_text}")
     # generated_speech can now be saved or played (sampled at model.target_sample_rate)
-
+
+NemotronVoiceChat Streaming Inference
+*************************************
+
+For real-time, chunk-by-chunk inference (as opposed to the offline mode shown
+above), use the Streaming S2S Pipeline:
+
+.. code-block:: python
+
+    from nemo.collections.speechlm2.inference import S2SPipelineBuilder
+
+    pipeline = S2SPipelineBuilder.build_pipeline(cfg)
+    output = pipeline.run(audio_filepaths, options=options)
+
+Or from the command line:
+
+.. code-block:: bash
+
+    python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \
+        audio_file=/path/to/audio \
+        s2s.model_path=/path/to/checkpoint \
+        s2s.speaker_name="<speaker>" \
+        s2s.engine_type=native \
+        s2s.system_prompt="You are a helpful assistant." \
+        streaming.chunk_size_in_secs=0.24 \
+        streaming.buffer_size_in_secs=1.68
+
+See :doc:`streaming_inference` for full details on configuration, architecture,
+and server integration.
 
 Training a Model
 ----------------
@@ -341,3 +369,4 @@ For more information, see additional sections in the SpeechLM2 docs:
    datasets
    configs
    training_and_scaling
+   streaming_inference
diff --git a/docs/source/speechlm2/streaming_inference.rst b/docs/source/speechlm2/streaming_inference.rst
@@ -0,0 +1,304 @@
+Streaming Inference
+===================
+
+The speechlm2 collection provides a streaming inference pipeline for
+NemotronVoiceChat that processes audio in real time, chunk by chunk, and
+produces both text and speech output incrementally.  The pipeline follows the
+same methodology as the NeMo ASR Inference Pipelines (see
+``nemo.collections.asr.inference``).
+
+Overview
+--------
+
+The streaming inference stack has four layers:
+
+.. code-block:: text
+
+    Entry Script          s2s_streaming_infer.py (Hydra)
+         │
+         ▼
+    Pipeline              StreamingS2SPipeline
+         │                  - audio buffering
+         │                  - state management
+         │                  - file I/O
+         ▼
+    Model Wrapper         NemotronVoicechatInferenceWrapper
+         │                  - infer_one_step()
+         │                  - perception / LLM / TTS / codec
+         ▼
+    Model                 NemotronVoiceChat
+                            - DuplexSTTModel + DuplexEARTTS
+
+Quick Start
+-----------
+
+Batch Inference from a Script
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The simplest way to run streaming inference is with the provided Hydra script:
+
+.. code-block:: bash
+
+    python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \
+        --config-path=examples/speechlm2/nemo_inference_pipelines/conf \
+        --config-name=s2s_streaming \
+        audio_file=/path/to/audio_or_directory_or_manifest.json \
+        output_dir=./generated \
+        s2s.model_path=/path/to/checkpoint \
+        s2s.speaker_name="<speaker>" \
+        s2s.engine_type=native \
+        s2s.system_prompt="You are a helpful assistant." \
+        streaming.chunk_size_in_secs=0.24 \
+        streaming.buffer_size_in_secs=1.68
+
+This will:
+
+1. Load the NemotronVoiceChat checkpoint.
+2. Stream each audio file through the pipeline in chunks.
+3. Save generated ``.wav``, stereo (input+output), and ``.txt`` files under
+   ``output_dir``.
+
+Programmatic Usage
+^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python
+
+    from nemo.collections.speechlm2.inference import S2SPipelineBuilder
+
+    pipeline = S2SPipelineBuilder.build_pipeline(cfg)
+    output = pipeline.run(audio_filepaths, options=options)
+
+    # output.texts            -- generated agent text per file
+    # output.asr_texts        -- recognized user text per file
+    # output.audio_filepaths  -- paths to generated .wav files
+
+
+Architecture
+------------
+
+The Core Loop
+^^^^^^^^^^^^^
+
+Like the ASR pipeline's ``BasePipeline.run()``, the S2S pipeline iterates
+over chunks and calls a single step method:
+
+.. code-block:: python
+
+    pipeline.open_session()
+    for frames in streamer:
+        pipeline.generate_step(frames)
+    pipeline.close_session()
+    return PipelineOutput(...)
+
+``generate_step()`` is the unified entry point used by **both** the batch
+``run()`` method and server deployments.
+
+
+What Happens Inside One Step
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Each call to ``generate_step(frames)`` performs:
+
+1. **Prefill detection** -- A zero-length first frame with a system prompt
+   triggers ``prefill_for_new_stream()``, which initializes the LLM KV cache
+   with the system prompt and the TTS speaker embedding.
+
+2. **Audio buffering** -- ``BatchedAudioBufferer`` (reused from ASR
+   infrastructure) maintains a sliding window of ``buffer_size_in_secs``.
+
+3. **Model inference** via ``infer_one_step(audio_buffer, state)``:
+
+   a. **Perception** -- The audio buffer is encoded by the streaming
+      FastConformer encoder into frame embeddings.
+   b. **Per-frame LLM loop** -- For each of the ``num_frames_per_chunk``
+      frames, the pipeline builds an input embedding (user audio +
+      previous-step text/ASR tokens), runs it through the LLM, and obtains
+      predicted text and ASR tokens.
+   c. **Per-frame TTS** -- Each predicted text token is fed into the EarTTS
+      model to produce audio codec codes.
+   d. **Codec decode** -- The accumulated codes are decoded into a waveform.
+
+4. **State updates** -- The context manager advances ``frame_idx`` and
+   updates the subword mask.
+
+5. **Output accumulation** -- Decoded audio and text are appended to the
+   per-stream ``S2SStreamingState``.
+
+
+Two Kinds of State
+^^^^^^^^^^^^^^^^^^
+
+The pipeline maintains two separate state objects per stream:
+
+**StreamingDecodeState** (model level)
+    Lives in ``S2SContextManager`` slots.  Contains LLM KV cache, TTS KV
+    cache, perception cache, codec cache, token workspaces (``gen_text``,
+    ``gen_asr_text``), and ``frame_idx``.  Created by the wrapper, mutated
+    in-place by ``infer_one_step()``, destroyed at end-of-stream.
+
+**S2SStreamingState** (pipeline level)
+    Lives in the pipeline's ``_state_pool``.  Accumulates generated audio
+    chunks, text strings, and word timings across steps.  Kept alive until
+    ``close_session()`` so the final ``PipelineOutput`` can be assembled.
+
+
+Configuration
+-------------
+
+The streaming inference configuration is defined in
+``examples/speechlm2/nemo_inference_pipelines/conf/s2s_streaming.yaml``.
+
+Key configuration groups:
+
+S2S Model Settings (``s2s``)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 15 55
+
+   * - Parameter
+     - Default
+     - Description
+   * - ``model_path``
+     - (required)
+     - Path to the NemotronVoiceChat HuggingFace checkpoint.
+   * - ``engine_type``
+     - (required)
+     - ``native``, ``vllm_llm``, ``vllm_eartts``, or
+       ``vllm_llm_vllm_eartts``.
+   * - ``speaker_name``
+     - ``null``
+     - Registered speaker name (must match a speaker in the checkpoint).
+   * - ``system_prompt``
+     - (required)
+     - Text injected into the LLM KV cache before audio streaming begins.
+   * - ``compute_dtype``
+     - ``bfloat16``
+     - Precision for LLM/embedding layers.
+   * - ``use_perception_cache``
+     - ``true``
+     - Cache-aware streaming for the perception encoder.
+   * - ``use_llm_cache``
+     - ``true``
+     - Use KV cache for incremental LLM decoding.
+   * - ``top_p``
+     - ``0.5``
+     - Top-p sampling threshold.
+   * - ``temperature``
+     - ``0.3``
+     - Sampling temperature.
+   * - ``deterministic``
+     - ``false``
+     - Force deterministic mode (native engine only).
+   * - ``profile_timing``
+     - ``false``
+     - Insert ``torch.cuda.synchronize()`` around each stage for accurate
+       per-stage timing.  Disabled by default to avoid GPU stalls.
+
+Streaming Settings (``streaming``)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 15 55
+
+   * - Parameter
+     - Default
+     - Description
+   * - ``chunk_size_in_secs``
+     - (required)
+     - Audio processed per inference step.  Must be a multiple of 0.08 s.
+   * - ``buffer_size_in_secs``
+     - (required)
+     - Sliding-window size passed to the perception encoder.
+   * - ``batch_size``
+     - ``1``
+     - Number of concurrent streams (currently only 1 is supported).
+   * - ``max_len``
+     - ``8192``
+     - Maximum number of frames per stream.
+
+Padding Settings (top-level)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+At most one of these may be set:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 15 55
+
+   * - Parameter
+     - Default
+     - Description
+   * - ``pad_audio_to_sec``
+     - ``null``
+     - Pad each input to a fixed duration.
+   * - ``pad_silence_ratio``
+     - ``null``
+     - Append silence equal to this fraction of the original duration.
+   * - ``pad_audio_by_sec``
+     - ``null``
+     - Append a fixed number of extra seconds of silence.
+
+
+Server Integration
+------------------
+
+The same ``generate_step()`` method used by ``run()`` can be called directly
+from a custom server.  The zero-length Frame protocol handles prefill:
+
+.. code-block:: python
+
+    # 1. Prefill system prompt (zero-length frame)
+    prefill_frame = Frame(
+        samples=torch.empty(0),
+        stream_id=stream_id,
+        is_first=True, is_last=False,
+        options=S2SRequestOptions(system_prompt=prompt),
+    )
+    pipeline.generate_step([prefill_frame])
+
+    # 2. Stream audio chunks
+    for chunk in audio_source:
+        frame = Frame(
+            samples=chunk,
+            stream_id=stream_id,
+            is_first=(i == 0), is_last=(i == last),
+        )
+        pipeline.generate_step([frame])
+
+
+Batch Size
+----------
+
+The pipeline currently supports ``batch_size=1`` (one stream at a time).
+
+
+File Layout
+-----------
+
+.. code-block:: text
+
+    nemo/collections/speechlm2/inference/
+    ├── __init__.py                          # Public exports
+    ├── factory/
+    │   └── s2s_pipeline_builder.py          # S2SPipelineBuilder
+    ├── pipelines/
+    │   ├── s2s_pipeline_interface.py        # Base: _state_pool, sessions
+    │   └── streaming_s2s_pipeline.py        # StreamingS2SPipeline
+    ├── model_wrappers/
+    │   ├── decode_state.py                  # StreamingDecodeState, InferenceStepResult
+    │   ├── nemotron_voicechat_inference_wrapper.py
+    │   ├── model_factory.py                 # Native / vLLM model interfaces
+    │   └── perception_cache.py              # Perception cache + CUDA graphs
+    ├── streaming/
+    │   ├── framing/
+    │   │   └── s2s_request_options.py       # S2SRequestOptions
+    │   └── state/
+    │       ├── s2s_state.py                 # S2SStreamingState
+    │       └── s2s_context_manager.py       # Slot-based decode-state lifecycle
+    ├── utils/
+    │   ├── pipeline_utils.py                # PipelineOutput, text helpers
+    │   └── audio_data.py                    # Manifest / folder loading
+    └── vllm/                                # Optional vLLM engine backend