[RFC] Multimodal encoder TP — in-tree implementation vs. upstream import

Splitting this off from #89 / the encoder-TP thread in #188 — @ischencheng asked the right question above about how encoder TP fits into the post-refactor architecture (encoders living under SimpleScheduler). #89 was closed when we folded it into the broader refactor, but the work itself didn't go away; it deserves its own tracking issue so the design conversation has a home.

## Why encoder TP matters more than text-side TP

The text encoder story is essentially "tokenizer + embed lookup" — bytes in, ids out, near-zero GPU footprint. Multimodal encoders are real models: the Qwen3-Omni audio encoder and image encoder are both on the order of hundreds of millions of parameters with activation peaks that scale with input length. On the current Qwen3-Omni speech pipeline, the audio + image encoders co-located on the thinker GPU are the main reason long-video inputs OOM (see the discussion in #327 and the `--encoder-mem-reserve` workaround in #339). The activation footprint of the encoder grows with input length in a way the thinker AR loop simply doesn't.

So encoder TP isn't a marginal optimization. For any pipeline serving long video or high-resolution multi-image inputs, it's a production requirement. We should pick it back up as a first-class workstream, not let it die in the cracks of the v1 refactor.

## Two paths

**Path A — build encoder TP inside sglang-omni.** Wrap the encoder modules under `models/<name>/components/` with a TP layer, manage the process group from inside `SimpleScheduler` (or a dedicated `EncoderScheduler` if SimpleScheduler stays minimal), use NCCL all-reduce inside encoder forward, share the rest of the stage machinery with the non-TP path. Self-contained, lands fast. Cost: we own a parallel TP implementation for every encoder we ever add, and we duplicate work that sglang main has already done for its own VLM serving.

**Path B — import encoders from sglang main and inherit native TP.** sglang upstream already has the multimodal towers (Qwen2-VL, Qwen3-VL, Qwen3-Omni vision/audio encoders) registered in its model registry, and those implementations are wired into sglang's TP infrastructure — `ColumnParallelLinear`, `RowParallelLinear`, parallel state, etc. — directly. If sglang-omni imports those encoder definitions from sglang main instead of carrying its own copies under `models/<name>/components/image_encoder.py`, encoder TP comes for free, and we don't maintain a second implementation. This is also consistent with the broader positioning of sglang-omni as the out-of-tree omni framework: anything that's already core sglang (model definitions with native parallelism, scheduler primitives, KV cache) we re-use rather than reimplement.

## Why I lean toward Path B as the long-term direction

Three reasons.

First, maintenance cost. The encoder ecosystem actually lives upstream — every new VLM that lands in sglang main gets a vision tower added there. Tracking upstream is much cheaper than re-wrapping each one in sglang-omni.

Second, correctness. sglang main's encoders are exercised by upstream VLM serving CI, real users, real workloads. Our reimplementation wouldn't have that surface area; subtly divergent attention masking or RoPE in the encoder would surface much later in production.

Third, this directly serves #188's line-count goal. Every encoder definition we delete from `models/<name>/components/` and replace with `from sglang.srt.models.X import VisionEncoder` is a contribution to the ~33K → ~10K reduction.

## What Path B actually needs (the hard part)

Path B *sounds* simple — "just import from upstream" — but the design surface is non-trivial:

1. **Interface contract with SimpleScheduler.** sglang main's encoder modules are designed to be invoked from inside a sglang Scheduler/ModelRunner that owns a TP process group. SimpleScheduler currently doesn't own a process group — it's a single-process inbox→fn→outbox loop. Three options to settle: either SimpleScheduler grows TP-awareness, or encoder stages route through a different scheduler type that does, or we wrap each encoder in a minimal sglang `TpModelWorker` and let sglang manage the parallelism while SimpleScheduler just dispatches batches into it. The third option is probably the cleanest — it keeps SimpleScheduler genuinely simple and doesn't fork sglang's TP execution path.

2. **Stable upstream import surface.** Same coupling concern as the OmniScheduler ↔ sglang Scheduler discussion in #188 — the more sglang internals we touch, the more we break on upgrade. Encoder modules are probably more stable than scheduler internals, but we should still treat them as a versioned contract surface (which import paths we depend on, which class signatures we promise to handle), not a free dependency.

3. **Encoders not yet upstreamed.** Some encoders we use are in sglang-omni but not yet in sglang main (custom Fish audio tokenizer, future Boson audio encoder). For those, the choice is upstream them first or live with Path A locally for them. The framework needs to support both modes side by side: encoder loaded from sglang main with native TP, *or* encoder loaded from sglang-omni `components/` without it. The pipeline config should not have to know which mode a stage is in.

4. **Memory accounting under TP.** sglang main's TP encoders assume ownership of an allocator budget on each TP rank. Our co-location story (encoder + thinker on the same GPU, #339) collides with this directly. The `total_gpu_mem_fraction` / `mem_fraction_static` design from Ratish's comment in #188 has to extend to encoder stages cleanly, including the case where the encoder is itself sharded across N GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Multimodal encoder TP — in-tree implementation vs. upstream import #375

Why encoder TP matters more than text-side TP

Two paths

Why I lean toward Path B as the long-term direction

What Path B actually needs (the hard part)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC] Multimodal encoder TP — in-tree implementation vs. upstream import #375

Description

Why encoder TP matters more than text-side TP

Two paths

Why I lean toward Path B as the long-term direction

What Path B actually needs (the hard part)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions