Skip to content

Add SGLang-Omni Router for V1#401

Draft
Ratish1 wants to merge 13 commits intosgl-project:mainfrom
Ratish1:feat/omni-router-v1
Draft

Add SGLang-Omni Router for V1#401
Ratish1 wants to merge 13 commits intosgl-project:mainfrom
Ratish1:feat/omni-router-v1

Conversation

@Ratish1
Copy link
Copy Markdown
Collaborator

@Ratish1 Ratish1 commented May 6, 2026

Motivation

This PR adds the SGLang-Omni Router for Omni V1. The router is an external HTTP process that sits in front of complete Omni V1 server replicas and routes OpenAI-compatible traffic across worker URLs.

The immediate use case is the router side of the colocation plan in #376: one full Omni V1 replica per server/GPU, then client traffic enters through one router endpoint. This PR intentionally implements only the router. Colocated Qwen3-Omni server placement and the H20 end-to-end CI lane remain separate integration work.

RFC: Router Contract

Scope

The router owns replica-level HTTP routing:

client
  -> sgl-omni-router
    -> complete Omni V1 server replica A
    -> complete Omni V1 server replica B
    -> complete Omni V1 server replica N

Each worker is an opaque base URL. The router does not inspect the Omni V1 pipeline graph, does not route individual stages, and does not mutate request JSON for DP rank or topology-aware behavior.

Public Interface

  • Adds the dedicated console command sgl-omni-router.
  • Keeps python -m sglang_omni_router.serve usable for direct module execution.
  • Uses canonical router arguments such as --worker-urls, --policy, --request-timeout-secs, --max-payload-size, health thresholds, and health probe timing.
  • Supports underscore-only policy names: round_robin, least_request, and random.
  • Does not add a secondary sgl-omni router command.

Worker Model

Workers are complete Omni V1 HTTP server replicas. Each worker tracks:

  • normalized URL and stable URL-encoded worker id
  • optional model name
  • declared capability set
  • active request count
  • health state: unknown, healthy, unhealthy, or dead
  • manual disabled state
  • consecutive health success and failure counters
  • last health status, error, and check timestamp

Routing eligibility is:

worker.health_state == healthy and worker.disabled == false

Dead workers are quarantined from routing. Recovery is explicit through the worker update API, followed by a health probe before the worker becomes routable again.

Request Lifecycle

For model requests, the router:

  1. receives the FastAPI request
  2. checks Content-Length against max_payload_size
  3. reads the body once as bytes
  4. rejects oversized bodies
  5. parses small JSON bodies only for route metadata
  6. preserves the original request body bytes for upstream forwarding
  7. infers required capabilities from endpoint and metadata
  8. filters routable workers by required capabilities
  9. applies the selected policy
  10. increments the selected worker active request count
  11. forwards the request with hop-by-hop headers stripped
  12. relays the upstream status, headers, and body
  13. adds router diagnostic response headers
  14. decrements active request count during cleanup
  15. emits route diagnostics when route logging is configured

Streaming responses use httpx.AsyncClient.send(..., stream=True) and relay upstream.aiter_bytes() without parsing, buffering, or synthesizing SSE frames.

Supported Routes

The first router surface is intentionally explicit:

  • GET /v1/models
  • POST /v1/chat/completions
  • POST /v1/audio/speech
  • GET /live
  • GET /ready
  • GET /health
  • GET /workers
  • GET /workers/{worker_id}
  • POST /workers
  • PUT /workers/{worker_id}
  • DELETE /workers/{worker_id}

There is no catch-all proxy in this PR. New proxied routes should be added only when they correspond to validated Omni V1 backend endpoints.

Capability Routing

Default workers advertise the complete Omni V1 replica capability set:

  • chat
  • speech
  • streaming
  • image_input
  • audio_input
  • video_input
  • audio_output

The router infers required capabilities from the current V1 request shape. Chat requests always require chat, add streaming when stream=true, and add modality capabilities for image, audio, video, and chat audio-output fields. Speech requests require speech and add streaming for streaming speech.

Routing Policies

  • round_robin: default policy and best first CI policy because it deterministically exercises every eligible worker.
  • least_request: selects the minimum active-request group and round-robins among tied workers.
  • random: diagnostic policy.

Active request counts include streaming requests until the stream generator exits.

Health And Worker Management

Health uses active probes against the configured health endpoint. Consecutive failures mark a worker dead after the configured threshold. Dead workers are skipped by later health probes until explicitly recovered.

Worker CRUD is available for trusted internal deployments:

  • add new workers at runtime
  • inspect the worker pool
  • disable workers without losing health state
  • mark workers dead
  • clear dead state and immediately reprobe
  • delete workers from the routing pool

Observability

The router exposes pool health through /health and detailed worker state through /workers. Route diagnostics are optional and include selected worker URL/id, policy, required capabilities, worker health state, disabled state, routability, request id, status code, byte counts, duration, and streaming completion state.

Route logging is best-effort. A route-log write failure is logged and does not fail the proxied request.

Modifications

  • Adds the top-level sglang_omni_router package with focused modules for config validation, worker state, active health probing, worker selection, proxying, FastAPI app wiring, and serve entrypoint.
  • Adds sgl-omni-router as the dedicated public console command.
  • Removes the router path from the existing sgl-omni CLI surface so router help does not depend on unrelated Omni client imports.
  • Adds strict worker URL normalization and validation for HTTP(S) base URLs.
  • Adds worker health state, dead-worker quarantine, manual disable, active request accounting, and runtime worker CRUD.
  • Adds modality-aware candidate filtering while forwarding original request bytes unchanged.
  • Adds exact streaming byte relay for chat and speech streaming responses.
  • Adds /v1/models aggregation across routable workers with query/header preservation and per-worker failure details when all eligible reads fail.
  • Adds route diagnostics and non-fatal route-log writing for operator and CI visibility.
  • Adds router unit and app tests covering config validation, health lifecycle, worker CRUD, policies, modality routing, raw body forwarding, streaming relay, model aggregation, route diagnostics, and cleanup on upstream failures.

Related Issues

Related to #376.

Accuracy Test

In progress.

Benchmark & Profiling

In progress.

CI

In progress.

@Ratish1 Ratish1 changed the title Add SGLang-Omni Router for V1 replicas Add SGLang-Omni Router for V1 May 6, 2026
@Ratish1 Ratish1 force-pushed the feat/omni-router-v1 branch from 8f063d6 to c430eec Compare May 6, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant