Skip to content

[Feature] Add common SGLang runtime override options to sgl-omni serve #296

@edwingao28

Description

@edwingao28

Motivation

sgl-omni-serve currently exposes only high-level serving options. Common SGLang runtime settings such as tensor parallel size, CPU offload, and static memory fraction can only be set by manually editing a pipeline config YAML

This affects Ming-flash-omni-2.0 immediately because the default generated config for sgl-omni serve uses cpu_offload_gb=0 and mem_fraction_static=0.7, which is often not enough to launch the model. But underlying problem is CLI-level: users need a stable way to pass common SGLang ServerArgs overrides through sgl-omni serve.

Proposed CLI options

  • --tp-size
  • --cpu-offload-gb
  • --mem-fraction-static

Mapping Semantics

Initial implementation can apply these options to the pipeline's primary SGLang generation stage. For Qwen-style and Ming omni pipelines, this is the hinker stage.

For pipelines with multiple SGLang-backed stages, we could:

  • apply only to the primary generation stage and document that behavior
  • add a stage-targeting mechanism later

Example Target UX

TP=1:

  CUDA_VISIBLE_DEVICES=0 \
    sgl-omni serve \
      --model-path inclusionAI/Ming-flash-omni-2.0 \
      --port 8000 \
      --model-name ming-omni \
      --cpu-offload-gb 80 \
      --mem-fraction-static 0.92

TP=2:

  CUDA_VISIBLE_DEVICES=0,1 \
    sgl-omni serve \
      --model-path inclusionAI/Ming-flash-omni-2.0 \
      --port 8000 \
      --model-name ming-omni \
      --tp-size 2 \
      --cpu-offload-gb 0 \
      --mem-fraction-static 0.80

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions