Streaming Inference

The speechlm2 collection provides a streaming inference pipeline for NemotronVoiceChat that processes audio in real time, chunk by chunk, and produces both text and speech output incrementally. The pipeline follows the same methodology as the NeMo ASR Inference Pipelines (see nemo.collections.asr.inference).

Overview

The streaming inference stack has four layers:

Entry Script          s2s_streaming_infer.py (Hydra)
     │
     ▼
Pipeline              StreamingS2SPipeline
     │                  - audio buffering
     │                  - state management
     │                  - file I/O
     ▼
Model Wrapper         NemotronVoicechatInferenceWrapper
     │                  - infer_one_step()
     │                  - perception / LLM / TTS / codec
     ▼
Model                 NemotronVoiceChat
                        - DuplexSTTModel + DuplexEARTTS

Quick Start

Batch Inference from a Script

The simplest way to run streaming inference is with the provided Hydra script:

python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \
    --config-path=examples/speechlm2/nemo_inference_pipelines/conf \
    --config-name=s2s_streaming \
    audio_file=/path/to/audio_or_directory_or_manifest.json \
    output_dir=./generated \
    s2s.model_path=/path/to/checkpoint \
    s2s.speaker_name="<speaker>" \
    s2s.engine_type=native \
    s2s.system_prompt="You are a helpful assistant." \
    streaming.chunk_size_in_secs=0.24 \
    streaming.buffer_size_in_secs=1.68

This will:

Load the NemotronVoiceChat checkpoint.
Stream each audio file through the pipeline in chunks.
Save generated .wav, stereo (input+output), and .txt files under output_dir.

Programmatic Usage

from nemo.collections.speechlm2.inference import S2SPipelineBuilder

pipeline = S2SPipelineBuilder.build_pipeline(cfg)
output = pipeline.run(audio_filepaths, options=options)

# output.texts            -- generated agent text per file
# output.asr_texts        -- recognized user text per file
# output.audio_filepaths  -- paths to generated .wav files

Architecture

The Core Loop

Like the ASR pipeline's BasePipeline.run(), the S2S pipeline iterates over chunks and calls a single step method:

pipeline.open_session()
for frames in streamer:
    pipeline.generate_step(frames)
pipeline.close_session()
return PipelineOutput(...)

generate_step() is the unified entry point used by both the batch run() method and server deployments.

What Happens Inside One Step

Each call to generate_step(frames) performs:

Prefill detection -- A zero-length first frame with a system prompt triggers prefill_for_new_stream(), which initializes the LLM KV cache with the system prompt and the TTS speaker embedding.
Audio buffering -- BatchedAudioBufferer (reused from ASR infrastructure) maintains a sliding window of buffer_size_in_secs.
Model inference via infer_one_step(audio_buffer, state):
1. Perception -- The audio buffer is encoded by the streaming FastConformer encoder into frame embeddings.
2. Per-frame LLM loop -- For each of the num_frames_per_chunk frames, the pipeline builds an input embedding (user audio + previous-step text/ASR tokens), runs it through the LLM, and obtains predicted text and ASR tokens.
3. Per-frame TTS -- Each predicted text token is fed into the EarTTS model to produce audio codec codes.
4. Codec decode -- The accumulated codes are decoded into a waveform.
State updates -- The context manager advances frame_idx and updates the subword mask.
Output accumulation -- Decoded audio and text are appended to the per-stream S2SStreamingState.

Two Kinds of State

The pipeline maintains two separate state objects per stream:

StreamingDecodeState (model level): Lives in S2SContextManager slots. Contains LLM KV cache, TTS KV cache, perception cache, codec cache, token workspaces (gen_text, gen_asr_text), and frame_idx. Created by the wrapper, mutated in-place by infer_one_step(), destroyed at end-of-stream.
S2SStreamingState (pipeline level): Lives in the pipeline's _state_pool. Accumulates generated audio chunks, text strings, and word timings across steps. Kept alive until close_session() so the final PipelineOutput can be assembled.

Configuration

The streaming inference configuration is defined in examples/speechlm2/nemo_inference_pipelines/conf/s2s_streaming.yaml.

Key configuration groups:

S2S Model Settings (`s2s`)

Parameter	Default	Description
`model_path`	(required)	Path to the NemotronVoiceChat HuggingFace checkpoint.
`engine_type`	(required)	`native`, `vllm_llm`, `vllm_eartts`, or `vllm_llm_vllm_eartts`.
`speaker_name`	`null`	Registered speaker name (must match a speaker in the checkpoint).
`system_prompt`	(required)	Text injected into the LLM KV cache before audio streaming begins.
`compute_dtype`	`bfloat16`	Precision for LLM/embedding layers.
`use_perception_cache`	`true`	Cache-aware streaming for the perception encoder.
`use_llm_cache`	`true`	Use KV cache for incremental LLM decoding.
`top_p`	`0.5`	Top-p sampling threshold.
`temperature`	`0.3`	Sampling temperature.
`deterministic`	`false`	Force deterministic mode (native engine only).
`profile_timing`	`false`	Insert `torch.cuda.synchronize()` around each stage for accurate per-stage timing. Disabled by default to avoid GPU stalls.

Streaming Settings (`streaming`)

Parameter	Default	Description
`chunk_size_in_secs`	(required)	Audio processed per inference step. Must be a multiple of 0.08 s.
`buffer_size_in_secs`	(required)	Sliding-window size passed to the perception encoder.
`batch_size`	`1`	Number of concurrent streams (currently only 1 is supported).
`max_len`	`8192`	Maximum number of frames per stream.

Padding Settings (top-level)

At most one of these may be set:

Parameter	Default	Description
`pad_audio_to_sec`	`null`	Pad each input to a fixed duration.
`pad_silence_ratio`	`null`	Append silence equal to this fraction of the original duration.
`pad_audio_by_sec`	`null`	Append a fixed number of extra seconds of silence.

Server Integration

The same generate_step() method used by run() can be called directly from a custom server. The zero-length Frame protocol handles prefill:

# 1. Prefill system prompt (zero-length frame)
prefill_frame = Frame(
    samples=torch.empty(0),
    stream_id=stream_id,
    is_first=True, is_last=False,
    options=S2SRequestOptions(system_prompt=prompt),
)
pipeline.generate_step([prefill_frame])

# 2. Stream audio chunks
for chunk in audio_source:
    frame = Frame(
        samples=chunk,
        stream_id=stream_id,
        is_first=(i == 0), is_last=(i == last),
    )
    pipeline.generate_step([frame])

Batch Size

The pipeline currently supports batch_size=1 (one stream at a time).

File Layout

nemo/collections/speechlm2/inference/
├── __init__.py                          # Public exports
├── factory/
│   └── s2s_pipeline_builder.py          # S2SPipelineBuilder
├── pipelines/
│   ├── s2s_pipeline_interface.py        # Base: _state_pool, sessions
│   └── streaming_s2s_pipeline.py        # StreamingS2SPipeline
├── model_wrappers/
│   ├── decode_state.py                  # StreamingDecodeState, InferenceStepResult
│   ├── nemotron_voicechat_inference_wrapper.py
│   ├── model_factory.py                 # Native / vLLM model interfaces
│   └── perception_cache.py              # Perception cache + CUDA graphs
├── streaming/
│   ├── framing/
│   │   └── s2s_request_options.py       # S2SRequestOptions
│   └── state/
│       ├── s2s_state.py                 # S2SStreamingState
│       └── s2s_context_manager.py       # Slot-based decode-state lifecycle
├── utils/
│   ├── pipeline_utils.py                # PipelineOutput, text helpers
│   └── audio_data.py                    # Manifest / folder loading
└── vllm/                                # Optional vLLM engine backend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming Inference

Overview

Quick Start

Batch Inference from a Script

Programmatic Usage

Architecture

The Core Loop

What Happens Inside One Step

Two Kinds of State

Configuration

S2S Model Settings (`s2s`)

Streaming Settings (`streaming`)

Padding Settings (top-level)

Server Integration

Batch Size

File Layout

FilesExpand file tree

streaming_inference.rst

Latest commit

History

streaming_inference.rst

File metadata and controls

Streaming Inference

Overview

Quick Start

Batch Inference from a Script

Programmatic Usage

Architecture

The Core Loop

What Happens Inside One Step

Two Kinds of State

Configuration

S2S Model Settings (s2s)

Streaming Settings (streaming)

Padding Settings (top-level)

Server Integration

Batch Size

File Layout

S2S Model Settings (`s2s`)

Streaming Settings (`streaming`)