Skip to content

Commit c57902d

Browse files
committed
docs(realtime): document pipeline streaming + disable_thinking
Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
1 parent 7cb2df9 commit c57902d

1 file changed

Lines changed: 35 additions & 0 deletions

File tree

docs/content/features/openai-realtime.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,41 @@ This configuration links the following components:
3131

3232
Make sure all referenced models (`silero-vad-ggml`, `whisper-large-turbo`, `qwen3-4b`, `tts-1`) are also installed or defined in your LocalAI instance.
3333

34+
### Streaming the pipeline
35+
36+
By default each stage runs to completion before the next begins: the whole utterance is transcribed, the full LLM reply is generated, then it is synthesized. Each stage can instead be streamed incrementally, which lowers the time-to-first-audio of a turn:
37+
38+
```yaml
39+
name: gpt-realtime
40+
pipeline:
41+
vad: silero-vad-ggml
42+
transcription: whisper-large-turbo
43+
llm: qwen3-4b
44+
tts: tts-1
45+
streaming:
46+
llm: true # stream LLM tokens as transcript deltas
47+
tts: true # emit audio deltas per synthesized chunk
48+
transcription: true # stream transcript text deltas of the user's speech
49+
```
50+
51+
- **streaming.tts**: emit a `response.output_audio.delta` per audio chunk the TTS backend produces, instead of one delta for the whole utterance.
52+
- **streaming.transcription**: stream `conversation.item.input_audio_transcription.delta` events as the transcript is produced (requires a transcription backend that supports streaming).
53+
- **streaming.llm**: stream the LLM reply token-by-token as `response.output_audio_transcript.delta` events and, when `streaming.tts` is also enabled, synthesize each completed sentence as soon as it is ready — overlapping generation, synthesis and playback. Streaming is used only for turns that cannot produce a tool call; turns with tools fall back to the buffered path so partial tool-call output is never spoken.
54+
55+
All streaming flags are off by default, so existing pipelines are unaffected.
56+
57+
### Disabling thinking
58+
59+
For reasoning models, you can force the pipeline LLM's thinking off without editing the LLM model config:
60+
61+
```yaml
62+
pipeline:
63+
llm: qwen3-4b
64+
disable_thinking: true # maps to enable_thinking=false for the realtime LLM
65+
```
66+
67+
This is applied only to the realtime session's copy of the LLM config, so it does not affect other users of the same model. Leave it unset to use the LLM model config's own reasoning settings.
68+
3469
## Transports
3570

3671
The Realtime API supports two transports: **WebSocket** and **WebRTC**.

0 commit comments

Comments
 (0)