[CI] Calibrate v1 thresholds for cuda graph at 2026.05.06 by zhaochenyang20 · Pull Request #403 · sgl-project/sglang-omni

zhaochenyang20 · 2026-05-06T22:03:11Z

Qwen3 Omni V1 CUDA Graph Calibration Report

This lightweight report records the Qwen3 Omni V1 threshold calibration after
verifying CUDA Graph replay for thinker/talker decode paths.

Model: qwen3-omni-v1
Repeats: 5
Stages: mmmu, mmmu_talker, mmsu, mmsu_talker, tts, videoamme, videoamme_talker, videomme, videomme_talker
Excluded: docs smoke tests
Local artifact directory: .tune-runs/20260506T220900Z_qwen3-omni-v1_cuda-graph_no-docs_r5
Full raw logs and result JSONs are intentionally kept local under .tune-runs/ and are not included in git.
Runtime evidence: all final-repeat pytest logs contain cuda graph: True decode batches.

Accuracy and WER

Stage	Worst-of-5
MMMU accuracy	56.00%
MMMU talker accuracy	70.00%
MMMU talker WER	19.81% corpus WER, 3 samples >50% WER
MMSU accuracy	69.60%
MMSU talker accuracy	60.00%
MMSU talker WER	7.84% corpus WER, 3 samples >50% WER
TTS WER	2.66% corpus WER, 1 sample >50% WER
Video-AMME accuracy	66.67%
Video-AMME talker accuracy	50.00%
Video-AMME talker WER	6.37% corpus WER, 2 samples >50% WER
Video-MME accuracy	53.33%
Video-MME talker accuracy	50.00%
Video-MME talker WER	6.28% corpus WER, 0 samples >50% WER

Speed Worst-of-5

Stage	Throughput	Tok/s	Latency	RTF
MMMU	0.677 req/s	53.10	11.230s	-
MMMU talker	0.159 req/s	5.70	23.796s	0.3550
MMSU	29.911 req/s	7.70	0.267s	-
MMSU talker	0.280 req/s	3.80	16.154s	0.3895
TTS	3.986 req/s	7.50	1.950s	0.5828
Video-AMME	0.236 req/s	0.90	51.633s	-
Video-AMME talker	0.126 req/s	1.30	35.860s	6.8759
Video-MME	0.219 req/s	2.00	56.231s	-
Video-MME talker	0.130 req/s	1.30	33.213s	3.8507

Applied Threshold Policy

Smart apply was used. Automatically tightened speed thresholds were applied,
and user-selected custom/confirmed values were applied for the remaining
interactive metrics. Metrics explicitly kept at the current threshold are not
listed below.

Stage	Metric	New threshold
MMMU speed	throughput / tok/s / latency	0.70 / 53.1 / 10.6
MMMU talker WER	corpus WER	0.20
MMMU talker speed	throughput / tok/s / latency / RTF	0.159 / 5.7 / 23.796 / 0.355
MMSU speed	throughput / tok/s / latency	29.911 / 7.7 / 0.267
MMSU talker accuracy	accuracy floor	0.6
MMSU talker speed	tok/s / latency / RTF	5.0 / 10.08 / 0.3895
TTS WER	corpus WER	0.03
TTS speed	throughput / tok/s / latency / RTF	3.986 / 7.5 / 1.95 / 0.5828
Video-AMME talker speed	RTF	6.8759
Video-MME speed	tok/s	2.0
Video-MME talker speed	RTF	3.8507

Notes

CUDA Graph produced large gains for TTS and several talker/text paths. Video
stages remain mixed because preprocessing, long prefill, audio synthesis, and
ASR can dominate over decode replay.

Apply the local worst-of-5 calibration observations so V1 CI thresholds match the measured H20 reproduction run, and include a lightweight report pointing to the retained raw artifacts. Co-authored-by: Cursor <[email protected]>

zhaochenyang20 · 2026-05-07T00:18:15Z

Skill improvements made:

Added a reusable Qwen3-Omni V1 no-docs calibration preset for all threshold stages, so future runs do not need to manually reconstruct the stage list.
Added generic networking guidance for CI-reproduction hosts: proxies and HuggingFace mirrors may be needed, but real proxy hosts, ports, tokens, usernames, and personal paths must never be committed to the skill.
Replaced environment-specific values with placeholders such as <proxy-url>, <hf-endpoint>, <hf-cache-dir>, and <venv-python> to keep the skill shareable across users and machines.
Documented that pytest should not be wrapped with proxychains4, because that can proxy localhost health checks and break local server startup.
Added a general performance optimization verification flow: compare commits between the previous calibration and the current calibration, identify which optimizations changed, and verify each optimization with runtime evidence rather than config alone.
Generalized the optimization check beyond CUDA Graph, with examples such as CUDA Graph replay, torch.compile, fused kernels, batching/concurrency changes, cache changes, scheduler changes, and preprocessing/audio/video pipeline changes.
Added short polling and resume guidance for long calibration runs: use frequent progress checks, inspect per-test logs before declaring a hang, and resume interrupted runs with the same output directory.
Clarified smart-apply behavior for custom values: user-provided custom thresholds are written as raw values and are not display-scaled or re-rounded.
Added optional version-control guidance: only commit/push after explicit authorization, keep .tune-runs/ local, commit only lightweight reports and relevant source changes, and include calibration evidence in the PR description.

zhaochenyang20 · 2026-05-07T01:00:42Z

Qwen3 Omni V1 CUDA Graph Calibration Report

This lightweight report records the second Qwen3 Omni V1 threshold calibration
run, after verifying that the optimized decode path was active at runtime.

Model: qwen3-omni-v1
Repeats: 5
Stages: mmmu, mmmu_talker, mmsu, mmsu_talker, tts, videoamme, videoamme_talker, videomme, videomme_talker
Excluded: docs smoke tests
Local artifact directory: .tune-runs/20260506T220900Z_qwen3-omni-v1_cuda-graph_no-docs_r5
Full raw logs and JSON results are intentionally kept local under .tune-runs/ and are not included in git.
Runtime evidence: final-repeat pytest logs for all calibrated stages contain cuda graph: True decode batches.

Accuracy and WER

Stage	Worst-of-5
MMMU accuracy	56.00%
MMMU talker accuracy	70.00%
MMMU talker WER	19.81% corpus WER, 3 samples >50% WER
MMSU accuracy	69.60%
MMSU talker accuracy	60.00%
MMSU talker WER	7.84% corpus WER, 3 samples >50% WER
TTS WER	2.66% corpus WER, 1 sample >50% WER
Video-AMME accuracy	66.67%
Video-AMME talker accuracy	50.00%
Video-AMME talker WER	6.37% corpus WER, 2 samples >50% WER
Video-MME accuracy	53.33%
Video-MME talker accuracy	50.00%
Video-MME talker WER	6.28% corpus WER, 0 samples >50% WER

Speed Worst-of-5

Stage	Throughput	Tok/s	Latency	RTF
MMMU	0.677 req/s	53.10	11.230s	-
MMMU talker	0.159 req/s	5.70	23.796s	0.3550
MMSU	29.911 req/s	7.70	0.267s	-
MMSU talker	0.280 req/s	3.80	16.154s	0.3895
TTS	3.986 req/s	7.50	1.950s	0.5828
Video-AMME	0.236 req/s	0.90	51.633s	-
Video-AMME talker	0.126 req/s	1.30	35.860s	6.8759
Video-MME	0.219 req/s	2.00	56.231s	-
Video-MME talker	0.130 req/s	1.30	33.213s	3.8507

Applied Threshold Policy

Smart apply was used. Automatically tightened speed thresholds were applied,
and user-selected custom or confirmed values were applied for the remaining
interactive metrics. Metrics explicitly kept at the current threshold are not
listed below.

Stage	Metric	New threshold
MMMU speed	throughput / tok/s / latency	0.70 / 53.1 / 10.6
MMMU talker WER	corpus WER	0.20
MMMU talker speed	throughput / tok/s / latency / RTF	0.159 / 5.7 / 23.796 / 0.355
MMSU speed	throughput / tok/s / latency	29.911 / 7.7 / 0.267
MMSU talker accuracy	accuracy floor	0.6
MMSU talker speed	tok/s / latency / RTF	5.0 / 10.08 / 0.3895
TTS WER	corpus WER	0.03
TTS speed	throughput / tok/s / latency / RTF	3.986 / 7.5 / 1.95 / 0.5828
Video-AMME talker speed	RTF	6.8759
Video-MME speed	tok/s	2.0
Video-MME talker speed	RTF	3.8507

Notes

Accuracy did not show a broad regression in this run. MMSU text-only was
slightly below the existing 70% threshold at 69.60%, while MMSU talker improved
to 60.00%.

Performance improved strongly for TTS and several talker/text paths. Video
stages were mixed, likely because preprocessing, long prefill, audio synthesis,
ASR, or video decoding can dominate over decode replay.

zhaochenyang20 and others added 2 commits May 6, 2026 21:57

turn on cuda graph for thinker

5c49bf0

Calibrate Qwen3 Omni V1 CI thresholds

7760bad

Apply the local worst-of-5 calibration observations so V1 CI thresholds match the measured H20 reproduction run, and include a lightweight report pointing to the retained raw artifacts. Co-authored-by: Cursor <[email protected]>

zhaochenyang20 mentioned this pull request May 6, 2026

[Feature] Enable Thinker CUDA Graph by Default #404

Merged

zhaochenyang20 added 2 commits May 7, 2026 00:17

upd skills

340c47f

remove docs

1d9bf74

zhaochenyang20 changed the title ~~[WIP] Calibrate v1 thresholds cuda graph 20260506~~ [WIP] Calibrate v1 thresholds for cuda graph at 2026.05.06 May 7, 2026

zhaochenyang20 changed the title ~~[WIP] Calibrate v1 thresholds for cuda graph at 2026.05.06~~ [CI] Calibrate v1 thresholds for cuda graph at 2026.05.06 May 7, 2026

Merge branch 'main' into calibrate-v1-thresholds-cuda-graph-20260506

a1a0ca2

zhaochenyang20 added the run-ci Triggers GPU CI workflows label May 7, 2026

This was referenced May 7, 2026

[V1]: Isolate IPC endpoints per server run #402

Open

[Feat] add streaming TTS and test to Ming Omni #378

Open

PasserBy4 approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Calibrate v1 thresholds for cuda graph at 2026.05.06#403

[CI] Calibrate v1 thresholds for cuda graph at 2026.05.06#403
zhaochenyang20 wants to merge 5 commits intomainfrom
calibrate-v1-thresholds-cuda-graph-20260506

zhaochenyang20 commented May 6, 2026 •

edited

Loading

Uh oh!

zhaochenyang20 commented May 7, 2026

Uh oh!

zhaochenyang20 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhaochenyang20 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3 Omni V1 CUDA Graph Calibration Report

Accuracy and WER

Speed Worst-of-5

Applied Threshold Policy

Notes

Uh oh!

zhaochenyang20 commented May 7, 2026

Uh oh!

zhaochenyang20 commented May 7, 2026

Qwen3 Omni V1 CUDA Graph Calibration Report

Accuracy and WER

Speed Worst-of-5

Applied Threshold Policy

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhaochenyang20 commented May 6, 2026 •

edited

Loading