Add SGLang-Omni Router for V1#401
Draft
Ratish1 wants to merge 13 commits intosgl-project:mainfrom
Draft
Conversation
8f063d6 to
c430eec
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR adds the SGLang-Omni Router for Omni V1. The router is an external HTTP process that sits in front of complete Omni V1 server replicas and routes OpenAI-compatible traffic across worker URLs.
The immediate use case is the router side of the colocation plan in #376: one full Omni V1 replica per server/GPU, then client traffic enters through one router endpoint. This PR intentionally implements only the router. Colocated Qwen3-Omni server placement and the H20 end-to-end CI lane remain separate integration work.
RFC: Router Contract
Scope
The router owns replica-level HTTP routing:
Each worker is an opaque base URL. The router does not inspect the Omni V1 pipeline graph, does not route individual stages, and does not mutate request JSON for DP rank or topology-aware behavior.
Public Interface
sgl-omni-router.python -m sglang_omni_router.serveusable for direct module execution.--worker-urls,--policy,--request-timeout-secs,--max-payload-size, health thresholds, and health probe timing.round_robin,least_request, andrandom.sgl-omni routercommand.Worker Model
Workers are complete Omni V1 HTTP server replicas. Each worker tracks:
unknown,healthy,unhealthy, ordeaddisabledstateRouting eligibility is:
Dead workers are quarantined from routing. Recovery is explicit through the worker update API, followed by a health probe before the worker becomes routable again.
Request Lifecycle
For model requests, the router:
Content-Lengthagainstmax_payload_sizeStreaming responses use
httpx.AsyncClient.send(..., stream=True)and relayupstream.aiter_bytes()without parsing, buffering, or synthesizing SSE frames.Supported Routes
The first router surface is intentionally explicit:
GET /v1/modelsPOST /v1/chat/completionsPOST /v1/audio/speechGET /liveGET /readyGET /healthGET /workersGET /workers/{worker_id}POST /workersPUT /workers/{worker_id}DELETE /workers/{worker_id}There is no catch-all proxy in this PR. New proxied routes should be added only when they correspond to validated Omni V1 backend endpoints.
Capability Routing
Default workers advertise the complete Omni V1 replica capability set:
chatspeechstreamingimage_inputaudio_inputvideo_inputaudio_outputThe router infers required capabilities from the current V1 request shape. Chat requests always require
chat, addstreamingwhenstream=true, and add modality capabilities for image, audio, video, and chat audio-output fields. Speech requests requirespeechand addstreamingfor streaming speech.Routing Policies
round_robin: default policy and best first CI policy because it deterministically exercises every eligible worker.least_request: selects the minimum active-request group and round-robins among tied workers.random: diagnostic policy.Active request counts include streaming requests until the stream generator exits.
Health And Worker Management
Health uses active probes against the configured health endpoint. Consecutive failures mark a worker
deadafter the configured threshold. Dead workers are skipped by later health probes until explicitly recovered.Worker CRUD is available for trusted internal deployments:
Observability
The router exposes pool health through
/healthand detailed worker state through/workers. Route diagnostics are optional and include selected worker URL/id, policy, required capabilities, worker health state, disabled state, routability, request id, status code, byte counts, duration, and streaming completion state.Route logging is best-effort. A route-log write failure is logged and does not fail the proxied request.
Modifications
sglang_omni_routerpackage with focused modules for config validation, worker state, active health probing, worker selection, proxying, FastAPI app wiring, and serve entrypoint.sgl-omni-routeras the dedicated public console command.sgl-omniCLI surface so router help does not depend on unrelated Omni client imports./v1/modelsaggregation across routable workers with query/header preservation and per-worker failure details when all eligible reads fail.Related Issues
Related to #376.
Accuracy Test
In progress.
Benchmark & Profiling
In progress.
CI
In progress.