Motivation
sgl-omni-serve currently exposes only high-level serving options. Common SGLang runtime settings such as tensor parallel size, CPU offload, and static memory fraction can only be set by manually editing a pipeline config YAML
This affects Ming-flash-omni-2.0 immediately because the default generated config for sgl-omni serve uses cpu_offload_gb=0 and mem_fraction_static=0.7, which is often not enough to launch the model. But underlying problem is CLI-level: users need a stable way to pass common SGLang ServerArgs overrides through sgl-omni serve.
Proposed CLI options
--tp-size
--cpu-offload-gb
--mem-fraction-static
Mapping Semantics
Initial implementation can apply these options to the pipeline's primary SGLang generation stage. For Qwen-style and Ming omni pipelines, this is the hinker stage.
For pipelines with multiple SGLang-backed stages, we could:
- apply only to the primary generation stage and document that behavior
- add a stage-targeting mechanism later
Example Target UX
TP=1:
CUDA_VISIBLE_DEVICES=0 \
sgl-omni serve \
--model-path inclusionAI/Ming-flash-omni-2.0 \
--port 8000 \
--model-name ming-omni \
--cpu-offload-gb 80 \
--mem-fraction-static 0.92
TP=2:
CUDA_VISIBLE_DEVICES=0,1 \
sgl-omni serve \
--model-path inclusionAI/Ming-flash-omni-2.0 \
--port 8000 \
--model-name ming-omni \
--tp-size 2 \
--cpu-offload-gb 0 \
--mem-fraction-static 0.80
Motivation
sgl-omni-servecurrently exposes only high-level serving options. Common SGLang runtime settings such as tensor parallel size, CPU offload, and static memory fraction can only be set by manually editing a pipeline config YAMLThis affects Ming-flash-omni-2.0 immediately because the default generated config for
sgl-omni serveusescpu_offload_gb=0andmem_fraction_static=0.7, which is often not enough to launch the model. But underlying problem is CLI-level: users need a stable way to pass common SGLang ServerArgs overrides throughsgl-omni serve.Proposed CLI options
--tp-size--cpu-offload-gb--mem-fraction-staticMapping Semantics
Initial implementation can apply these options to the pipeline's primary SGLang generation stage. For Qwen-style and Ming omni pipelines, this is the hinker stage.
For pipelines with multiple SGLang-backed stages, we could:
Example Target UX
TP=1:
TP=2: