You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Splitting this off from #89 / the encoder-TP thread in #188 — @ischencheng asked the right question above about how encoder TP fits into the post-refactor architecture (encoders living under SimpleScheduler). #89 was closed when we folded it into the broader refactor, but the work itself didn't go away; it deserves its own tracking issue so the design conversation has a home.
Why encoder TP matters more than text-side TP
The text encoder story is essentially "tokenizer + embed lookup" — bytes in, ids out, near-zero GPU footprint. Multimodal encoders are real models: the Qwen3-Omni audio encoder and image encoder are both on the order of hundreds of millions of parameters with activation peaks that scale with input length. On the current Qwen3-Omni speech pipeline, the audio + image encoders co-located on the thinker GPU are the main reason long-video inputs OOM (see the discussion in #327 and the --encoder-mem-reserve workaround in #339). The activation footprint of the encoder grows with input length in a way the thinker AR loop simply doesn't.
So encoder TP isn't a marginal optimization. For any pipeline serving long video or high-resolution multi-image inputs, it's a production requirement. We should pick it back up as a first-class workstream, not let it die in the cracks of the v1 refactor.
Two paths
Path A — build encoder TP inside sglang-omni. Wrap the encoder modules under models/<name>/components/ with a TP layer, manage the process group from inside SimpleScheduler (or a dedicated EncoderScheduler if SimpleScheduler stays minimal), use NCCL all-reduce inside encoder forward, share the rest of the stage machinery with the non-TP path. Self-contained, lands fast. Cost: we own a parallel TP implementation for every encoder we ever add, and we duplicate work that sglang main has already done for its own VLM serving.
Path B — import encoders from sglang main and inherit native TP. sglang upstream already has the multimodal towers (Qwen2-VL, Qwen3-VL, Qwen3-Omni vision/audio encoders) registered in its model registry, and those implementations are wired into sglang's TP infrastructure — ColumnParallelLinear, RowParallelLinear, parallel state, etc. — directly. If sglang-omni imports those encoder definitions from sglang main instead of carrying its own copies under models/<name>/components/image_encoder.py, encoder TP comes for free, and we don't maintain a second implementation. This is also consistent with the broader positioning of sglang-omni as the out-of-tree omni framework: anything that's already core sglang (model definitions with native parallelism, scheduler primitives, KV cache) we re-use rather than reimplement.
Why I lean toward Path B as the long-term direction
Three reasons.
First, maintenance cost. The encoder ecosystem actually lives upstream — every new VLM that lands in sglang main gets a vision tower added there. Tracking upstream is much cheaper than re-wrapping each one in sglang-omni.
Second, correctness. sglang main's encoders are exercised by upstream VLM serving CI, real users, real workloads. Our reimplementation wouldn't have that surface area; subtly divergent attention masking or RoPE in the encoder would surface much later in production.
Third, this directly serves #188's line-count goal. Every encoder definition we delete from models/<name>/components/ and replace with from sglang.srt.models.X import VisionEncoder is a contribution to the ~33K → ~10K reduction.
What Path B actually needs (the hard part)
Path B sounds simple — "just import from upstream" — but the design surface is non-trivial:
Interface contract with SimpleScheduler. sglang main's encoder modules are designed to be invoked from inside a sglang Scheduler/ModelRunner that owns a TP process group. SimpleScheduler currently doesn't own a process group — it's a single-process inbox→fn→outbox loop. Three options to settle: either SimpleScheduler grows TP-awareness, or encoder stages route through a different scheduler type that does, or we wrap each encoder in a minimal sglang TpModelWorker and let sglang manage the parallelism while SimpleScheduler just dispatches batches into it. The third option is probably the cleanest — it keeps SimpleScheduler genuinely simple and doesn't fork sglang's TP execution path.
Stable upstream import surface. Same coupling concern as the OmniScheduler ↔ sglang Scheduler discussion in SGLang-Omni Refactoring Proposal #188 — the more sglang internals we touch, the more we break on upgrade. Encoder modules are probably more stable than scheduler internals, but we should still treat them as a versioned contract surface (which import paths we depend on, which class signatures we promise to handle), not a free dependency.
Encoders not yet upstreamed. Some encoders we use are in sglang-omni but not yet in sglang main (custom Fish audio tokenizer, future Boson audio encoder). For those, the choice is upstream them first or live with Path A locally for them. The framework needs to support both modes side by side: encoder loaded from sglang main with native TP, or encoder loaded from sglang-omni components/ without it. The pipeline config should not have to know which mode a stage is in.
Memory accounting under TP. sglang main's TP encoders assume ownership of an allocator budget on each TP rank. Our co-location story (encoder + thinker on the same GPU, [Feat] Expose encoder mem reserve as --encoder-mem-reserve CLI flag #339) collides with this directly. The total_gpu_mem_fraction / mem_fraction_static design from Ratish's comment in SGLang-Omni Refactoring Proposal #188 has to extend to encoder stages cleanly, including the case where the encoder is itself sharded across N GPUs.
Splitting this off from #89 / the encoder-TP thread in #188 — @ischencheng asked the right question above about how encoder TP fits into the post-refactor architecture (encoders living under SimpleScheduler). #89 was closed when we folded it into the broader refactor, but the work itself didn't go away; it deserves its own tracking issue so the design conversation has a home.
Why encoder TP matters more than text-side TP
The text encoder story is essentially "tokenizer + embed lookup" — bytes in, ids out, near-zero GPU footprint. Multimodal encoders are real models: the Qwen3-Omni audio encoder and image encoder are both on the order of hundreds of millions of parameters with activation peaks that scale with input length. On the current Qwen3-Omni speech pipeline, the audio + image encoders co-located on the thinker GPU are the main reason long-video inputs OOM (see the discussion in #327 and the
--encoder-mem-reserveworkaround in #339). The activation footprint of the encoder grows with input length in a way the thinker AR loop simply doesn't.So encoder TP isn't a marginal optimization. For any pipeline serving long video or high-resolution multi-image inputs, it's a production requirement. We should pick it back up as a first-class workstream, not let it die in the cracks of the v1 refactor.
Two paths
Path A — build encoder TP inside sglang-omni. Wrap the encoder modules under
models/<name>/components/with a TP layer, manage the process group from insideSimpleScheduler(or a dedicatedEncoderSchedulerif SimpleScheduler stays minimal), use NCCL all-reduce inside encoder forward, share the rest of the stage machinery with the non-TP path. Self-contained, lands fast. Cost: we own a parallel TP implementation for every encoder we ever add, and we duplicate work that sglang main has already done for its own VLM serving.Path B — import encoders from sglang main and inherit native TP. sglang upstream already has the multimodal towers (Qwen2-VL, Qwen3-VL, Qwen3-Omni vision/audio encoders) registered in its model registry, and those implementations are wired into sglang's TP infrastructure —
ColumnParallelLinear,RowParallelLinear, parallel state, etc. — directly. If sglang-omni imports those encoder definitions from sglang main instead of carrying its own copies undermodels/<name>/components/image_encoder.py, encoder TP comes for free, and we don't maintain a second implementation. This is also consistent with the broader positioning of sglang-omni as the out-of-tree omni framework: anything that's already core sglang (model definitions with native parallelism, scheduler primitives, KV cache) we re-use rather than reimplement.Why I lean toward Path B as the long-term direction
Three reasons.
First, maintenance cost. The encoder ecosystem actually lives upstream — every new VLM that lands in sglang main gets a vision tower added there. Tracking upstream is much cheaper than re-wrapping each one in sglang-omni.
Second, correctness. sglang main's encoders are exercised by upstream VLM serving CI, real users, real workloads. Our reimplementation wouldn't have that surface area; subtly divergent attention masking or RoPE in the encoder would surface much later in production.
Third, this directly serves #188's line-count goal. Every encoder definition we delete from
models/<name>/components/and replace withfrom sglang.srt.models.X import VisionEncoderis a contribution to the ~33K → ~10K reduction.What Path B actually needs (the hard part)
Path B sounds simple — "just import from upstream" — but the design surface is non-trivial:
Interface contract with SimpleScheduler. sglang main's encoder modules are designed to be invoked from inside a sglang Scheduler/ModelRunner that owns a TP process group. SimpleScheduler currently doesn't own a process group — it's a single-process inbox→fn→outbox loop. Three options to settle: either SimpleScheduler grows TP-awareness, or encoder stages route through a different scheduler type that does, or we wrap each encoder in a minimal sglang
TpModelWorkerand let sglang manage the parallelism while SimpleScheduler just dispatches batches into it. The third option is probably the cleanest — it keeps SimpleScheduler genuinely simple and doesn't fork sglang's TP execution path.Stable upstream import surface. Same coupling concern as the OmniScheduler ↔ sglang Scheduler discussion in SGLang-Omni Refactoring Proposal #188 — the more sglang internals we touch, the more we break on upgrade. Encoder modules are probably more stable than scheduler internals, but we should still treat them as a versioned contract surface (which import paths we depend on, which class signatures we promise to handle), not a free dependency.
Encoders not yet upstreamed. Some encoders we use are in sglang-omni but not yet in sglang main (custom Fish audio tokenizer, future Boson audio encoder). For those, the choice is upstream them first or live with Path A locally for them. The framework needs to support both modes side by side: encoder loaded from sglang main with native TP, or encoder loaded from sglang-omni
components/without it. The pipeline config should not have to know which mode a stage is in.Memory accounting under TP. sglang main's TP encoders assume ownership of an allocator budget on each TP rank. Our co-location story (encoder + thinker on the same GPU, [Feat] Expose encoder mem reserve as --encoder-mem-reserve CLI flag #339) collides with this directly. The
total_gpu_mem_fraction/mem_fraction_staticdesign from Ratish's comment in SGLang-Omni Refactoring Proposal #188 has to extend to encoder stages cleanly, including the case where the encoder is itself sharded across N GPUs.