Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
397a2dc
Add WebRTC VAD realtime prototype
CTKnight Apr 12, 2026
1ab2dfc
Add realtime WebRTC mock demo prototype
CTKnight Apr 12, 2026
576a3b1
Fix realtime audio ingest and add diagnostics
CTKnight Apr 13, 2026
fc74db9
Remove temporary audio dump checkpoints
CTKnight Apr 13, 2026
783a40e
Trim frontend audio debug UI
CTKnight Apr 13, 2026
4876745
Add auto VAD and push-to-talk input modes
CTKnight Apr 13, 2026
8c02040
Add server-side barge-in for realtime audio
CTKnight Apr 13, 2026
25d600b
Add realtime text turns and conversation transcript
CTKnight Apr 13, 2026
cffc114
use exsting script to run realtime demo
CTKnight Apr 13, 2026
a7e4206
ice server info gathering
CTKnight Apr 13, 2026
e10cd59
public ip turn
CTKnight Apr 13, 2026
83a4704
websocket impl
CTKnight Apr 13, 2026
fadb5a3
ws launcher backend entry
CTKnight Apr 13, 2026
0b91b90
fix websocket realtime audio payload
CTKnight Apr 13, 2026
2c0d427
remove webrtc transport
CTKnight Apr 13, 2026
e15a200
fix realtime text delta normalization
CTKnight Apr 13, 2026
1f22ead
fix stream abort handling resulting in orphaned request leaks
rycerzes Apr 17, 2026
d41b7c8
Merge pull request #1 from rycerzes/webrtc-vad
CTKnight Apr 19, 2026
f00a783
Merge remote-tracking branch 'upstream/main' into prototype/webrtc-vad
CTKnight Apr 19, 2026
734f5d2
Merge remote-tracking branch 'fork/prototype/webrtc-vad' into prototy…
CTKnight Apr 19, 2026
b2cd6c0
Package realtime websocket deps in base install
CTKnight Apr 20, 2026
5c4d504
Replay captured audio in mock realtime backend
CTKnight Apr 20, 2026
4d5387e
Fix realtime speech branch completion
CTKnight Apr 20, 2026
7f00ed1
Revert "Fix realtime speech branch completion"
CTKnight Apr 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions examples/run_qwen3_omni_speech_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,12 @@ def parse_args() -> argparse.Namespace:
parser.add_argument(
"--relay-backend", type=str, default="shm", choices=["nixl", "shm"]
)
parser.add_argument(
"--mem-fraction-static",
type=float,
default=0.7,
help="Static memory fraction for SGLang-backed AR stages.",
)

# Server
parser.add_argument("--host", type=str, default="0.0.0.0")
Expand Down Expand Up @@ -91,6 +97,13 @@ async def main_async(args: argparse.Namespace) -> None:
gpu_placement=gpu_placement,
)

server_args_overrides = {"mem_fraction_static": args.mem_fraction_static}
for stage in config.stages:
if stage.name in {"thinker", "talker_ar"}:
stage.executor.args.setdefault("server_args_overrides", {}).update(
server_args_overrides
)

runner = MultiProcessPipelineRunner(config)
logger.info("Starting 9-stage speech pipeline (multiprocess)...")
await runner.start(timeout=600)
Expand Down
84 changes: 84 additions & 0 deletions playground/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ This directory contains multiple playground interfaces for SGLang-Omni.
|---|---|
| `web/` | Full-featured HTML/CSS/JS UI served directly by the sglang-omni server. Supports text, audio, image, video inputs and a built-in file browser. |
| `gradio/` | Lightweight Gradio app that connects to a running server via HTTP. Text chat with streaming, model selector, and generation parameter controls. |
| `realtime-ws/` | Standalone websocket realtime app with server-side VAD, text input, microphone streaming, and streamed assistant audio playback. |
| `tts/` | S2 Pro TTS Gradio app with shared controls for voice cloning plus separate streaming and non-streaming playback modes. |

## Web Playground
Expand All @@ -20,6 +21,88 @@ uv pip install -v -e .

Then open `http://localhost:8000` in your browser.

## Realtime WebSocket Playground

Install the project before launching:

```bash
uv pip install -v -e .
```

Launch the backend plus standalone frontend app with one command:

```bash
./playground/realtime-ws/start.sh [--mock] [realtime-options] [backend-options...]
```

Minimal usable commands:

```bash
# local smoke test
./playground/realtime-ws/start.sh --mock

# real model
./playground/realtime-ws/start.sh --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct
```

In normal backend mode, pass the usual speech server flags such as `--model-path`:

```bash
./playground/realtime-ws/start.sh \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct
```

Then open `http://localhost:7862`.

For a browser smoke test without loading any model, launch the mock realtime API:

```bash
./playground/realtime-ws/start.sh --mock
```

That path exercises:

- browser microphone capture over websocket PCM streaming
- server-side VAD turn detection
- automatic response start after speech stop
- streamed assistant audio playback in the browser
- text prompts over the same websocket session

The mock backend returns canned text plus playback of the captured client audio
(falling back to a synthetic tone when there is no input audio) instead of
calling the inference pipeline.

### Remote browser over SSH port forwarding

Because the transport is plain HTTP + WebSocket, standard SSH forwarding is
enough for remote browser testing.

Example:

```bash
./playground/realtime-ws/start.sh --mock
```

Forward the backend port and the frontend port from the remote machine:

```bash
ssh -L 8000:localhost:8000 -L 7862:localhost:7862 user@host
```

For the full launcher help, run:

```bash
./playground/realtime-ws/start.sh --help
```

The websocket playground:

- streams microphone PCM to the backend over `/v1/realtime/ws`
- runs server-side VAD to auto-trigger one inference turn per utterance
- supports manual push-to-talk and text prompts in the same session
- streams assistant audio back over the websocket and auto-plays it in the browser
- keeps the frontend separate from the inference API server

### Custom port

```bash
Expand Down Expand Up @@ -95,6 +178,7 @@ ssh -L 8000:localhost:8000 -L 7860:localhost:7860 user@host
| `/` | Web playground UI (index.html, app.js, styles.css) |
| `/v1/chat/completions` | Chat completions (text + audio, streaming) |
| `/v1/audio/speech` | Text-to-speech |
| `/v1/realtime/ws` | Realtime websocket session transport |
| `/v1/models` | List available models |
| `/v1/fs/list` | Browse server filesystem |
| `/v1/fs/file` | Download a server file |
Expand Down
Loading