The speechlm2 collection provides a streaming inference pipeline for
NemotronVoiceChat that processes audio in real time, chunk by chunk, and
produces both text and speech output incrementally. The pipeline follows the
same methodology as the NeMo ASR Inference Pipelines (see
nemo.collections.asr.inference).
The streaming inference stack has four layers:
Entry Script s2s_streaming_infer.py (Hydra)
│
▼
Pipeline StreamingS2SPipeline
│ - audio buffering
│ - state management
│ - file I/O
▼
Model Wrapper NemotronVoicechatInferenceWrapper
│ - infer_one_step()
│ - perception / LLM / TTS / codec
▼
Model NemotronVoiceChat
- DuplexSTTModel + DuplexEARTTS
The simplest way to run streaming inference is with the provided Hydra script:
python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \
--config-path=examples/speechlm2/nemo_inference_pipelines/conf \
--config-name=s2s_streaming \
audio_file=/path/to/audio_or_directory_or_manifest.json \
output_dir=./generated \
s2s.model_path=/path/to/checkpoint \
s2s.speaker_name="<speaker>" \
s2s.engine_type=native \
s2s.system_prompt="You are a helpful assistant." \
streaming.chunk_size_in_secs=0.24 \
streaming.buffer_size_in_secs=1.68This will:
- Load the NemotronVoiceChat checkpoint.
- Stream each audio file through the pipeline in chunks.
- Save generated
.wav, stereo (input+output), and.txtfiles underoutput_dir.
from nemo.collections.speechlm2.inference import S2SPipelineBuilder
pipeline = S2SPipelineBuilder.build_pipeline(cfg)
output = pipeline.run(audio_filepaths, options=options)
# output.texts -- generated agent text per file
# output.asr_texts -- recognized user text per file
# output.audio_filepaths -- paths to generated .wav filesLike the ASR pipeline's BasePipeline.run(), the S2S pipeline iterates
over chunks and calls a single step method:
pipeline.open_session()
for frames in streamer:
pipeline.generate_step(frames)
pipeline.close_session()
return PipelineOutput(...)generate_step() is the unified entry point used by both the batch
run() method and server deployments.
Each call to generate_step(frames) performs:
- Prefill detection -- A zero-length first frame with a system prompt
triggers
prefill_for_new_stream(), which initializes the LLM KV cache with the system prompt and the TTS speaker embedding. - Audio buffering --
BatchedAudioBufferer(reused from ASR infrastructure) maintains a sliding window ofbuffer_size_in_secs. - Model inference via
infer_one_step(audio_buffer, state):- Perception -- The audio buffer is encoded by the streaming FastConformer encoder into frame embeddings.
- Per-frame LLM loop -- For each of the
num_frames_per_chunkframes, the pipeline builds an input embedding (user audio + previous-step text/ASR tokens), runs it through the LLM, and obtains predicted text and ASR tokens. - Per-frame TTS -- Each predicted text token is fed into the EarTTS model to produce audio codec codes.
- Codec decode -- The accumulated codes are decoded into a waveform.
- State updates -- The context manager advances
frame_idxand updates the subword mask. - Output accumulation -- Decoded audio and text are appended to the
per-stream
S2SStreamingState.
The pipeline maintains two separate state objects per stream:
- StreamingDecodeState (model level)
- Lives in
S2SContextManagerslots. Contains LLM KV cache, TTS KV cache, perception cache, codec cache, token workspaces (gen_text,gen_asr_text), andframe_idx. Created by the wrapper, mutated in-place byinfer_one_step(), destroyed at end-of-stream. - S2SStreamingState (pipeline level)
- Lives in the pipeline's
_state_pool. Accumulates generated audio chunks, text strings, and word timings across steps. Kept alive untilclose_session()so the finalPipelineOutputcan be assembled.
The streaming inference configuration is defined in
examples/speechlm2/nemo_inference_pipelines/conf/s2s_streaming.yaml.
Key configuration groups:
| Parameter | Default | Description |
|---|---|---|
model_path |
(required) | Path to the NemotronVoiceChat HuggingFace checkpoint. |
engine_type |
(required) | native, vllm_llm, vllm_eartts, or
vllm_llm_vllm_eartts. |
speaker_name |
null |
Registered speaker name (must match a speaker in the checkpoint). |
system_prompt |
(required) | Text injected into the LLM KV cache before audio streaming begins. |
compute_dtype |
bfloat16 |
Precision for LLM/embedding layers. |
use_perception_cache |
true |
Cache-aware streaming for the perception encoder. |
use_llm_cache |
true |
Use KV cache for incremental LLM decoding. |
top_p |
0.5 |
Top-p sampling threshold. |
temperature |
0.3 |
Sampling temperature. |
deterministic |
false |
Force deterministic mode (native engine only). |
profile_timing |
false |
Insert torch.cuda.synchronize() around each stage for accurate
per-stage timing. Disabled by default to avoid GPU stalls. |
| Parameter | Default | Description |
|---|---|---|
chunk_size_in_secs |
(required) | Audio processed per inference step. Must be a multiple of 0.08 s. |
buffer_size_in_secs |
(required) | Sliding-window size passed to the perception encoder. |
batch_size |
1 |
Number of concurrent streams (currently only 1 is supported). |
max_len |
8192 |
Maximum number of frames per stream. |
At most one of these may be set:
| Parameter | Default | Description |
|---|---|---|
pad_audio_to_sec |
null |
Pad each input to a fixed duration. |
pad_silence_ratio |
null |
Append silence equal to this fraction of the original duration. |
pad_audio_by_sec |
null |
Append a fixed number of extra seconds of silence. |
The same generate_step() method used by run() can be called directly
from a custom server. The zero-length Frame protocol handles prefill:
# 1. Prefill system prompt (zero-length frame)
prefill_frame = Frame(
samples=torch.empty(0),
stream_id=stream_id,
is_first=True, is_last=False,
options=S2SRequestOptions(system_prompt=prompt),
)
pipeline.generate_step([prefill_frame])
# 2. Stream audio chunks
for chunk in audio_source:
frame = Frame(
samples=chunk,
stream_id=stream_id,
is_first=(i == 0), is_last=(i == last),
)
pipeline.generate_step([frame])The pipeline currently supports batch_size=1 (one stream at a time).
nemo/collections/speechlm2/inference/
├── __init__.py # Public exports
├── factory/
│ └── s2s_pipeline_builder.py # S2SPipelineBuilder
├── pipelines/
│ ├── s2s_pipeline_interface.py # Base: _state_pool, sessions
│ └── streaming_s2s_pipeline.py # StreamingS2SPipeline
├── model_wrappers/
│ ├── decode_state.py # StreamingDecodeState, InferenceStepResult
│ ├── nemotron_voicechat_inference_wrapper.py
│ ├── model_factory.py # Native / vLLM model interfaces
│ └── perception_cache.py # Perception cache + CUDA graphs
├── streaming/
│ ├── framing/
│ │ └── s2s_request_options.py # S2SRequestOptions
│ └── state/
│ ├── s2s_state.py # S2SStreamingState
│ └── s2s_context_manager.py # Slot-based decode-state lifecycle
├── utils/
│ ├── pipeline_utils.py # PipelineOutput, text helpers
│ └── audio_data.py # Manifest / folder loading
└── vllm/ # Optional vLLM engine backend