Skip to content

Latest commit

 

History

History
304 lines (239 loc) · 9.49 KB

File metadata and controls

304 lines (239 loc) · 9.49 KB

Streaming Inference

The speechlm2 collection provides a streaming inference pipeline for NemotronVoiceChat that processes audio in real time, chunk by chunk, and produces both text and speech output incrementally. The pipeline follows the same methodology as the NeMo ASR Inference Pipelines (see nemo.collections.asr.inference).

Overview

The streaming inference stack has four layers:

Entry Script          s2s_streaming_infer.py (Hydra)
     │
     ▼
Pipeline              StreamingS2SPipeline
     │                  - audio buffering
     │                  - state management
     │                  - file I/O
     ▼
Model Wrapper         NemotronVoicechatInferenceWrapper
     │                  - infer_one_step()
     │                  - perception / LLM / TTS / codec
     ▼
Model                 NemotronVoiceChat
                        - DuplexSTTModel + DuplexEARTTS

Quick Start

Batch Inference from a Script

The simplest way to run streaming inference is with the provided Hydra script:

python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \
    --config-path=examples/speechlm2/nemo_inference_pipelines/conf \
    --config-name=s2s_streaming \
    audio_file=/path/to/audio_or_directory_or_manifest.json \
    output_dir=./generated \
    s2s.model_path=/path/to/checkpoint \
    s2s.speaker_name="<speaker>" \
    s2s.engine_type=native \
    s2s.system_prompt="You are a helpful assistant." \
    streaming.chunk_size_in_secs=0.24 \
    streaming.buffer_size_in_secs=1.68

This will:

  1. Load the NemotronVoiceChat checkpoint.
  2. Stream each audio file through the pipeline in chunks.
  3. Save generated .wav, stereo (input+output), and .txt files under output_dir.

Programmatic Usage

from nemo.collections.speechlm2.inference import S2SPipelineBuilder

pipeline = S2SPipelineBuilder.build_pipeline(cfg)
output = pipeline.run(audio_filepaths, options=options)

# output.texts            -- generated agent text per file
# output.asr_texts        -- recognized user text per file
# output.audio_filepaths  -- paths to generated .wav files

Architecture

The Core Loop

Like the ASR pipeline's BasePipeline.run(), the S2S pipeline iterates over chunks and calls a single step method:

pipeline.open_session()
for frames in streamer:
    pipeline.generate_step(frames)
pipeline.close_session()
return PipelineOutput(...)

generate_step() is the unified entry point used by both the batch run() method and server deployments.

What Happens Inside One Step

Each call to generate_step(frames) performs:

  1. Prefill detection -- A zero-length first frame with a system prompt triggers prefill_for_new_stream(), which initializes the LLM KV cache with the system prompt and the TTS speaker embedding.
  2. Audio buffering -- BatchedAudioBufferer (reused from ASR infrastructure) maintains a sliding window of buffer_size_in_secs.
  3. Model inference via infer_one_step(audio_buffer, state):
    1. Perception -- The audio buffer is encoded by the streaming FastConformer encoder into frame embeddings.
    2. Per-frame LLM loop -- For each of the num_frames_per_chunk frames, the pipeline builds an input embedding (user audio + previous-step text/ASR tokens), runs it through the LLM, and obtains predicted text and ASR tokens.
    3. Per-frame TTS -- Each predicted text token is fed into the EarTTS model to produce audio codec codes.
    4. Codec decode -- The accumulated codes are decoded into a waveform.
  4. State updates -- The context manager advances frame_idx and updates the subword mask.
  5. Output accumulation -- Decoded audio and text are appended to the per-stream S2SStreamingState.

Two Kinds of State

The pipeline maintains two separate state objects per stream:

StreamingDecodeState (model level)
Lives in S2SContextManager slots. Contains LLM KV cache, TTS KV cache, perception cache, codec cache, token workspaces (gen_text, gen_asr_text), and frame_idx. Created by the wrapper, mutated in-place by infer_one_step(), destroyed at end-of-stream.
S2SStreamingState (pipeline level)
Lives in the pipeline's _state_pool. Accumulates generated audio chunks, text strings, and word timings across steps. Kept alive until close_session() so the final PipelineOutput can be assembled.

Configuration

The streaming inference configuration is defined in examples/speechlm2/nemo_inference_pipelines/conf/s2s_streaming.yaml.

Key configuration groups:

S2S Model Settings (s2s)

Parameter Default Description
model_path (required) Path to the NemotronVoiceChat HuggingFace checkpoint.
engine_type (required) native, vllm_llm, vllm_eartts, or vllm_llm_vllm_eartts.
speaker_name null Registered speaker name (must match a speaker in the checkpoint).
system_prompt (required) Text injected into the LLM KV cache before audio streaming begins.
compute_dtype bfloat16 Precision for LLM/embedding layers.
use_perception_cache true Cache-aware streaming for the perception encoder.
use_llm_cache true Use KV cache for incremental LLM decoding.
top_p 0.5 Top-p sampling threshold.
temperature 0.3 Sampling temperature.
deterministic false Force deterministic mode (native engine only).
profile_timing false Insert torch.cuda.synchronize() around each stage for accurate per-stage timing. Disabled by default to avoid GPU stalls.

Streaming Settings (streaming)

Parameter Default Description
chunk_size_in_secs (required) Audio processed per inference step. Must be a multiple of 0.08 s.
buffer_size_in_secs (required) Sliding-window size passed to the perception encoder.
batch_size 1 Number of concurrent streams (currently only 1 is supported).
max_len 8192 Maximum number of frames per stream.

Padding Settings (top-level)

At most one of these may be set:

Parameter Default Description
pad_audio_to_sec null Pad each input to a fixed duration.
pad_silence_ratio null Append silence equal to this fraction of the original duration.
pad_audio_by_sec null Append a fixed number of extra seconds of silence.

Server Integration

The same generate_step() method used by run() can be called directly from a custom server. The zero-length Frame protocol handles prefill:

# 1. Prefill system prompt (zero-length frame)
prefill_frame = Frame(
    samples=torch.empty(0),
    stream_id=stream_id,
    is_first=True, is_last=False,
    options=S2SRequestOptions(system_prompt=prompt),
)
pipeline.generate_step([prefill_frame])

# 2. Stream audio chunks
for chunk in audio_source:
    frame = Frame(
        samples=chunk,
        stream_id=stream_id,
        is_first=(i == 0), is_last=(i == last),
    )
    pipeline.generate_step([frame])

Batch Size

The pipeline currently supports batch_size=1 (one stream at a time).

File Layout

nemo/collections/speechlm2/inference/
├── __init__.py                          # Public exports
├── factory/
│   └── s2s_pipeline_builder.py          # S2SPipelineBuilder
├── pipelines/
│   ├── s2s_pipeline_interface.py        # Base: _state_pool, sessions
│   └── streaming_s2s_pipeline.py        # StreamingS2SPipeline
├── model_wrappers/
│   ├── decode_state.py                  # StreamingDecodeState, InferenceStepResult
│   ├── nemotron_voicechat_inference_wrapper.py
│   ├── model_factory.py                 # Native / vLLM model interfaces
│   └── perception_cache.py              # Perception cache + CUDA graphs
├── streaming/
│   ├── framing/
│   │   └── s2s_request_options.py       # S2SRequestOptions
│   └── state/
│       ├── s2s_state.py                 # S2SStreamingState
│       └── s2s_context_manager.py       # Slot-based decode-state lifecycle
├── utils/
│   ├── pipeline_utils.py                # PipelineOutput, text helpers
│   └── audio_data.py                    # Manifest / folder loading
└── vllm/                                # Optional vLLM engine backend