Skip to content

[V1]: Isolate IPC endpoints per server run#402

Open
Ratish1 wants to merge 4 commits intosgl-project:mainfrom
Ratish1:fix/omni-v1-ipc-runtime-dir
Open

[V1]: Isolate IPC endpoints per server run#402
Ratish1 wants to merge 4 commits intosgl-project:mainfrom
Ratish1:fix/omni-v1-ipc-runtime-dir

Conversation

@Ratish1
Copy link
Copy Markdown
Collaborator

@Ratish1 Ratish1 commented May 6, 2026

Motivation

Omni V1 currently derives default IPC endpoints from endpoints.base_path / config.name. When two V1 servers are launched with the same model name and default IPC settings, they can bind/connect to the same ZMQ socket paths and leak control-plane messages across server instances.

This applies the same per-server IPC namespace invariant used for the V0 runtime isolation fix in #263, adapted to the V1 compiler and StageProcessSpec startup path.

Root Cause

PipelineConfig.name defaults to the model path. Two Qwen3-Omni V1 servers launched with the same model therefore allocate identical stable endpoints under the same model-name directory.

Because V1 uses ZMQ for Coordinator-to-Stage control-plane traffic, sharing these paths can let one server receive another server's stage messages or completions.

Modifications

  • Add a managed IpcRuntimeDir lifecycle in sglang_omni_v1.config.compiler, matching the V0 per-run IPC directory ownership model from [Server]: Isolate IPC Endpoints Per Server Run #263.
  • Allocate IPC endpoints under a unique runtime directory created with tempfile.mkdtemp(...) instead of the stable base_path / config.name path.
  • Split V1 compilation into prepare_pipeline_runtime(...) and compile_pipeline_core(...) so callers that own server lifecycle also own IPC cleanup.
  • Keep V1 multi-process startup V1-native: MultiProcessPipelineRunner prepares one endpoint map in the main process and passes it into StageProcessSpec construction, so subprocesses still do not recompile endpoint state.
  • Use the managed compile path in the V1 single-process server launcher and close the runtime directory during shutdown/failure cleanup.
  • Add V1 IPC regression coverage for same-model namespace uniqueness, non-IPC no-op behavior, unmanaged IPC rejection, compile failure cleanup, caller-owned cleanup, successful compile ownership, multi-process cleanup, and single-process launcher cleanup.

Accuracy Tests

  • Unit tests: PASS
  • V1 runtime endpoint preflight: PASS
    • observed ipc:///tmp/sglang_omni_v1/qwen-qwen3-omni-30b-a3b-instruct-9ta1b5i2/stage_preprocessing.sock
    • observed ipc:///tmp/sglang_omni_v1/qwen-qwen3-omni-30b-a3b-instruct-9ta1b5i2/stage_thinker.sock
    • observed ipc:///tmp/sglang_omni_v1/qwen-qwen3-omni-30b-a3b-instruct-9ta1b5i2/completion.sock
  • Two Qwen3-Omni V1 text servers launched side by side with default IPC settings: PASS
    • server on port 8101 submitted to ipc:///tmp/sglang_omni_v1/qwen-qwen3-omni-30b-a3b-instruct-y3dtxwhb/stage_preprocessing.sock
    • server on port 8102 submitted to ipc:///tmp/sglang_omni_v1/qwen-qwen3-omni-30b-a3b-instruct-1w6lk5as/stage_preprocessing.sock
  • Sent concurrent simple curl chat-completion requests to both servers and received HTTP 200 responses from both: PASS

Speed Tests and Profiling

Not run. This changes server startup endpoint allocation and shutdown cleanup only; it does not touch request scheduling, model execution, relay tensor transfer, or decode hot paths.

@Ratish1 Ratish1 marked this pull request as ready for review May 6, 2026 18:58
@Ratish1 Ratish1 added the run-ci Triggers GPU CI workflows label May 6, 2026
@Ratish1 Ratish1 force-pushed the fix/omni-v1-ipc-runtime-dir branch from d074add to ee450f3 Compare May 6, 2026 20:08
Comment thread tests/test_v1_ipc_runtime_dir.py
Comment thread sglang_omni_v1/pipeline/mp_runner.py Outdated
Comment thread sglang_omni_v1/serve/launcher.py
Comment thread sglang_omni_v1/config/compiler.py
Comment thread sglang_omni_v1/config/compiler.py
Comment thread sglang_omni_v1/config/compiler.py
@Ratish1 Ratish1 force-pushed the fix/omni-v1-ipc-runtime-dir branch from ee450f3 to d35685c Compare May 6, 2026 20:11
Comment thread sglang_omni_v1/config/compiler.py
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Please rebase and rerun CI after #403 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-ci Triggers GPU CI workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants