Skip to content

[RFC] Multimodal encoder TP — in-tree implementation vs. upstream import #375

@zhaochenyang20

Description

@zhaochenyang20

Splitting this off from #89 / the encoder-TP thread in #188@ischencheng asked the right question above about how encoder TP fits into the post-refactor architecture (encoders living under SimpleScheduler). #89 was closed when we folded it into the broader refactor, but the work itself didn't go away; it deserves its own tracking issue so the design conversation has a home.

Why encoder TP matters more than text-side TP

The text encoder story is essentially "tokenizer + embed lookup" — bytes in, ids out, near-zero GPU footprint. Multimodal encoders are real models: the Qwen3-Omni audio encoder and image encoder are both on the order of hundreds of millions of parameters with activation peaks that scale with input length. On the current Qwen3-Omni speech pipeline, the audio + image encoders co-located on the thinker GPU are the main reason long-video inputs OOM (see the discussion in #327 and the --encoder-mem-reserve workaround in #339). The activation footprint of the encoder grows with input length in a way the thinker AR loop simply doesn't.

So encoder TP isn't a marginal optimization. For any pipeline serving long video or high-resolution multi-image inputs, it's a production requirement. We should pick it back up as a first-class workstream, not let it die in the cracks of the v1 refactor.

Two paths

Path A — build encoder TP inside sglang-omni. Wrap the encoder modules under models/<name>/components/ with a TP layer, manage the process group from inside SimpleScheduler (or a dedicated EncoderScheduler if SimpleScheduler stays minimal), use NCCL all-reduce inside encoder forward, share the rest of the stage machinery with the non-TP path. Self-contained, lands fast. Cost: we own a parallel TP implementation for every encoder we ever add, and we duplicate work that sglang main has already done for its own VLM serving.

Path B — import encoders from sglang main and inherit native TP. sglang upstream already has the multimodal towers (Qwen2-VL, Qwen3-VL, Qwen3-Omni vision/audio encoders) registered in its model registry, and those implementations are wired into sglang's TP infrastructure — ColumnParallelLinear, RowParallelLinear, parallel state, etc. — directly. If sglang-omni imports those encoder definitions from sglang main instead of carrying its own copies under models/<name>/components/image_encoder.py, encoder TP comes for free, and we don't maintain a second implementation. This is also consistent with the broader positioning of sglang-omni as the out-of-tree omni framework: anything that's already core sglang (model definitions with native parallelism, scheduler primitives, KV cache) we re-use rather than reimplement.

Why I lean toward Path B as the long-term direction

Three reasons.

First, maintenance cost. The encoder ecosystem actually lives upstream — every new VLM that lands in sglang main gets a vision tower added there. Tracking upstream is much cheaper than re-wrapping each one in sglang-omni.

Second, correctness. sglang main's encoders are exercised by upstream VLM serving CI, real users, real workloads. Our reimplementation wouldn't have that surface area; subtly divergent attention masking or RoPE in the encoder would surface much later in production.

Third, this directly serves #188's line-count goal. Every encoder definition we delete from models/<name>/components/ and replace with from sglang.srt.models.X import VisionEncoder is a contribution to the ~33K → ~10K reduction.

What Path B actually needs (the hard part)

Path B sounds simple — "just import from upstream" — but the design surface is non-trivial:

  1. Interface contract with SimpleScheduler. sglang main's encoder modules are designed to be invoked from inside a sglang Scheduler/ModelRunner that owns a TP process group. SimpleScheduler currently doesn't own a process group — it's a single-process inbox→fn→outbox loop. Three options to settle: either SimpleScheduler grows TP-awareness, or encoder stages route through a different scheduler type that does, or we wrap each encoder in a minimal sglang TpModelWorker and let sglang manage the parallelism while SimpleScheduler just dispatches batches into it. The third option is probably the cleanest — it keeps SimpleScheduler genuinely simple and doesn't fork sglang's TP execution path.

  2. Stable upstream import surface. Same coupling concern as the OmniScheduler ↔ sglang Scheduler discussion in SGLang-Omni Refactoring Proposal #188 — the more sglang internals we touch, the more we break on upgrade. Encoder modules are probably more stable than scheduler internals, but we should still treat them as a versioned contract surface (which import paths we depend on, which class signatures we promise to handle), not a free dependency.

  3. Encoders not yet upstreamed. Some encoders we use are in sglang-omni but not yet in sglang main (custom Fish audio tokenizer, future Boson audio encoder). For those, the choice is upstream them first or live with Path A locally for them. The framework needs to support both modes side by side: encoder loaded from sglang main with native TP, or encoder loaded from sglang-omni components/ without it. The pipeline config should not have to know which mode a stage is in.

  4. Memory accounting under TP. sglang main's TP encoders assume ownership of an allocator budget on each TP rank. Our co-location story (encoder + thinker on the same GPU, [Feat] Expose encoder mem reserve as --encoder-mem-reserve CLI flag #339) collides with this directly. The total_gpu_mem_fraction / mem_fraction_static design from Ratish's comment in SGLang-Omni Refactoring Proposal #188 has to extend to encoder stages cleanly, including the case where the encoder is itself sharded across N GPUs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions