Skip to content

[CLI] Expose --cpu-offload-gb, --tp-size, and --mem-fraction-static on sgl-omni#308

Open
edwingao28 wants to merge 1 commit intosgl-project:mainfrom
edwingao28:feat/cli-runtime-overrides-a
Open

[CLI] Expose --cpu-offload-gb, --tp-size, and --mem-fraction-static on sgl-omni#308
edwingao28 wants to merge 1 commit intosgl-project:mainfrom
edwingao28:feat/cli-runtime-overrides-a

Conversation

@edwingao28
Copy link
Copy Markdown
Collaborator

@edwingao28 edwingao28 commented Apr 17, 2026

Motivation

sgl-omni serve only exposed high-level options; common SGLang runtime settings like cpu_offload_gb, mem_fraction_static, and tp_size required editing pipeline config YAML by hand. This blocks Ming-flash-omni-2.0 launches, where the generated defaults (cpu_offload_gb=0, mem_fraction_static=0.7) are often too tight, and blocks multi-GPU runs that need tp_size>1.

4.22 put this PR on hold since we are facing major codebase refactor, will apply changes after refactored code pushed to main branch

Modifications

  • sglang_omni/cli/serve.py: Add --cpu-offload-gb, --mem-fraction-static, and --tp-size; pass as server_args_overrides via ConfigManager.from_model_path or --config.
  • sglang_omni/config/schema.py: Add server_args_overrides and _apply_server_args_overrides to PipelineConfig; route via primary_sglang_stage.
  • sglang_omni/config/manager.py: Plumb server_args_overrides through ConfigManager.from_model_path.
  • sglang_omni/models/ming_omni/config.py: Set primary_sglang_stage = THINKER_STAGE; auto-inject disable_custom_all_reduce=True when tp_size>1 (Ming custom all-reduce kernel hangs under TP); remove legacy local override logic.

Example

TP=1:

CUDA_VISIBLE_DEVICES=0 \
sgl-omni serve \
  --model-path inclusionAI/Ming-flash-omni-2.0 \
  --port 8000 \
  --model-name ming-omni \
  --cpu-offload-gb 80 \
  --mem-fraction-static 0.92

TP=2 (Ming-Omni):

CUDA_VISIBLE_DEVICES=0,1 \
sgl-omni serve \
  --model-path inclusionAI/Ming-flash-omni-2.0 \
  --port 8000 \
  --model-name ming-omni \
  --tp-size 2 \
  --cpu-offload-gb 0 \
  --mem-fraction-static 0.80

Related Issues

Closes #296

Checklist

  • Format your code according with pre-commit.
  • Add unit tests.
  • Update documentation / docstrings / example tutorials as needed.
  • Provide throughput / latency benchmark results and accuracy evaluation results as needed.
  • For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

@edwingao28 edwingao28 changed the title [CLI] Expose --cpu-offload-gb and --mem-fraction-static on sgl-omni s… [CLI] Expose --cpu-offload-gb and --mem-fraction-static on sgl-omni Apr 17, 2026
@edwingao28 edwingao28 force-pushed the feat/cli-runtime-overrides-a branch from 16f795f to ee90710 Compare April 21, 2026 00:32
@edwingao28 edwingao28 marked this pull request as ready for review April 21, 2026 00:33
@edwingao28 edwingao28 force-pushed the feat/cli-runtime-overrides-a branch from ee90710 to 53a612b Compare April 21, 2026 00:39
@edwingao28 edwingao28 changed the title [CLI] Expose --cpu-offload-gb and --mem-fraction-static on sgl-omni [CLI] Expose --cpu-offload-gb, --tp-size, and --mem-fraction-static on sgl-omni Apr 21, 2026
@edwingao28 edwingao28 force-pushed the feat/cli-runtime-overrides-a branch from 53a612b to 6ae047c Compare April 22, 2026 01:26
@edwingao28 edwingao28 force-pushed the feat/cli-runtime-overrides-a branch from 6ae047c to f2703c2 Compare April 22, 2026 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add common SGLang runtime override options to sgl-omni serve

1 participant