[Benchmark] Add Video-MME for Qwen3-Omni by Jayon02 · Pull Request #327 · sgl-project/sglang-omni

Jayon02 · 2026-04-20T11:03:35Z

Motivation

Modifications

Related Issues

Fixes #253

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according with pre-commit.
Add unit tests.
Update documentation / docstrings / example tutorials as needed.
Provide throughput / latency benchmark results and accuracy evaluation results as needed.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

zhaochenyang20 · 2026-04-21T16:39:38Z

Hey @Jayon02, thanks for pushing this through. A few thoughts:

On thinker_max_seq_len: You diagnosed it right — hardcoded 8192 with no assert means video inputs silently crash the CUDA kernel instead of failing at the request boundary. But please don't unify the parameter-passing story here. #318 is actively reworking the override layer (apply_server_args_overrides primitive, CLI precedence, schema layering); landing a second parameter through that mechanism while it's in flight will conflict. Keep the temporary path for now, add an explicit TODO comment noting it should migrate onto the #318 primitive afterward (same for video_fps and similar runtime params).

Please split this into three PRs:

Bug fix — the prompt-length assert that turns the CUDA crash into a clean request-level error. Ship this fast, it's the user-visible fix.
Benchmark — benchmark_omni_videomme on the full set (~2h), following the MMMU / MMSU pattern with accuracy + speed numbers in the PR description.
CI — the ~10min subset imported from the benchmark script, same pattern as MMSU / MMMU.

Decouples bug fix from full-set runs, and CI from benchmark review.

On the dataset: YouTube link-rot is real. "First N reachable" is fine for getting it running, but CI needs a frozen snapshot — please push what's currently reachable to an HF dataset. Already-delisted videos are unrecoverable, don't worry about those.

On code quality: The existing video code paths were written early and it shows. Don't rescue all of it here — stand up the benchmark + CI, mark rough edges with TODOs, refactor incrementally. Priority: bug fix → benchmark → CI → cleanup.

Concrete next step: split out the prompt-length assert as PR 1 and get it merged fast.

Jayon02 · 2026-04-23T18:36:42Z

Discussion

The full benchmark is complete, but there are some regressions. After splitting the PR into three phases, many CI data points that previously passed are now failing. Interestingly, these failures seem to be state-related: a failed data point runs successfully on a fresh server, but in the pipeline, everything starts failing after a certain number of processed items. I suspect there’s a state-leak or an issue in the preprocessing stage that's causing this cumulative failure.

Previous Results

====================================================
  Video-MME Accuracy — qwen3-omni
====================================================
  Total samples:               50
  Correct:                     29
  Accuracy:                    0.5800 (58.0%)
  Failed requests:             0
  MC parse fallback:           1
====================================================

============================================================
                      Video-MME Speed                       
============================================================
  Model:                         qwen3-omni
  Concurrency:                   1
  Completed requests:            50
  Failed requests:               0
------------------------------------------------------------
  Latency mean (s):              19.719
  Latency median (s):            18.762
  Latency p95 (s):               28.401
  Latency p99 (s):               31.016
  Tok/s (per-req mean):          6.0
  Tok/s (per-req median):        5.7
  Tok/s (aggregate):             6.0
  Gen tokens (mean):             119
  Gen tokens (total):            5964
  Prompt tokens (mean):          10724
  Prompt tokens (total):         536220
  Throughput (req/s):            0.051
============================================================

Current Results

====================================================
  Video-MME Accuracy — qwen3-omni
====================================================
  Total samples:               50
  Correct:                     14
  Accuracy:                    0.2800 (28.0%)
  Failed requests:             26
  MC parse fallback:           0
====================================================

============================================================
                      Video-MME Speed                       
============================================================
  Model:                         qwen3-omni
  Concurrency:                   1
  Completed requests:            24
  Failed requests:               26
------------------------------------------------------------
  Latency mean (s):              24.702
  Latency median (s):            19.479
  Latency p95 (s):               54.725
  Latency p99 (s):               67.266
  Tok/s (per-req mean):          4.9
  Tok/s (per-req median):        4.3
  Tok/s (aggregate):             4.3
  Gen tokens (mean):             106
  Gen tokens (total):            2555
  Prompt tokens (mean):          10442
  Prompt tokens (total):         250612
  Throughput (req/s):            0.032
============================================================

How to run

python -m sglang_omni.cli.cli serve \
  --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --text-only \
  --thinker-max-seq-len 18000 \
  --port 8000

python -m benchmarks.eval.benchmark_omni_videomme \
  --model qwen3-omni \
  --port 8000 \
  --repo-id zhaochenyang20/Video_MME_ci

Ratish1 · 2026-04-23T19:08:50Z

Discussion

The full benchmark is complete, but there are some regressions. After splitting the PR into three phases, many CI data points that previously passed are now failing. Interestingly, these failures seem to be state-related: a failed data point runs successfully on a fresh server, but in the pipeline, everything starts failing after a certain number of processed items. I suspect there’s a state-leak or an issue in the preprocessing stage that's causing this cumulative failure.

Previous Results

====================================================
  Video-MME Accuracy — qwen3-omni
====================================================
  Total samples:               50
  Correct:                     29
  Accuracy:                    0.5800 (58.0%)
  Failed requests:             0
  MC parse fallback:           1
====================================================

============================================================
                      Video-MME Speed                       
============================================================
  Model:                         qwen3-omni
  Concurrency:                   1
  Completed requests:            50
  Failed requests:               0
------------------------------------------------------------
  Latency mean (s):              19.719
  Latency median (s):            18.762
  Latency p95 (s):               28.401
  Latency p99 (s):               31.016
  Tok/s (per-req mean):          6.0
  Tok/s (per-req median):        5.7
  Tok/s (aggregate):             6.0
  Gen tokens (mean):             119
  Gen tokens (total):            5964
  Prompt tokens (mean):          10724
  Prompt tokens (total):         536220
  Throughput (req/s):            0.051
============================================================

Current Results

====================================================
  Video-MME Accuracy — qwen3-omni
====================================================
  Total samples:               50
  Correct:                     14
  Accuracy:                    0.2800 (28.0%)
  Failed requests:             26
  MC parse fallback:           0
====================================================

============================================================
                      Video-MME Speed                       
============================================================
  Model:                         qwen3-omni
  Concurrency:                   1
  Completed requests:            24
  Failed requests:               26
------------------------------------------------------------
  Latency mean (s):              24.702
  Latency median (s):            19.479
  Latency p95 (s):               54.725
  Latency p99 (s):               67.266
  Tok/s (per-req mean):          4.9
  Tok/s (per-req median):        4.3
  Tok/s (aggregate):             4.3
  Gen tokens (mean):             106
  Gen tokens (total):            2555
  Prompt tokens (mean):          10442
  Prompt tokens (total):         250612
  Throughput (req/s):            0.032
============================================================

How to run

python -m sglang_omni.cli.cli serve \
  --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --text-only \
  --thinker-max-seq-len 18000 \
  --port 8000

python -m benchmarks.eval.benchmark_omni_videomme \
  --model qwen3-omni \
  --port 8000 \
  --repo-id zhaochenyang20/Video_MME_ci

Hey @Jayon02 , did you find any reason that this is happening?. Could you look into this in detail if you are able to find out why this is happening. Thanks

zhaochenyang20 · 2026-04-24T00:07:27Z

+# User-pinned mem_fraction_static bypasses this reserve.

-OMNI_ENCODER_MEM_FRACTION_STATIC_RESERVE = 0.05
+OMNI_ENCODER_MEM_FRACTION_STATIC_RESERVE = 0.20


This is my only concern. If CI pass, we can let this be and keep this for Video use case.

Adds a 2520-sample Video-MME benchmark for sglang-omni AR engines: - benchmarks/dataset/videomme.py: loads zhaochenyang20/Video_MME via snapshot_download; resolves per-sample video path + A-D choices. - benchmarks/tasks/video_understanding.py: per-sample prompt builder, answer parser (choice extraction with MC-fallback), and output-format summaries for accuracy and per-duration / per-domain breakdowns. - benchmarks/eval/benchmark_omni_videomme.py: driver script wiring the dataset, the runner, and the scoring/speed-summary tasks together. - benchmarks/dataset/prepare.py / benchmarks/README.md: register 'videomme' in the prepare CLI and doc it in the dataset index. The docstring at the top of the eval script documents the canonical launch (--thinker-max-seq-len 32768, --encoder-mem-reserve 0.20) and the c=4 / max-tokens=256 bench command; full-set reference numbers will land in a follow-up commit after the run completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve 0.20) and asserts accuracy, failed-request budget, and per-concurrency speed thresholds derived from a 5-run H200 calibration on the rebased main with apply_slack (0.75/1.25). Thresholds (worst-of-5, no slack on accuracy/failed): VIDEOMME_MIN_ACCURACY 0.56 VIDEOMME_MAX_FAILED 5 (see caveat below) _VIDEOMME_P95.throughput 0.084 _VIDEOMME_P95.toks_agg 2.5 _VIDEOMME_P95.latency_s 46.3 5-run H200 data on the rebased main (PR sgl-project#327 + PR sgl-project#339 landed; same fixture as before): run_1: acc=0.66 correct=33/50 failed=0 tput=0.087 toks=2.6 lat=45.47 run_2: acc=0.56 correct=28/50 failed=5 tput=0.084 toks=2.6 lat=46.27 run_3: acc=0.60 correct=30/50 failed=0 tput=0.085 toks=2.5 lat=46.33 run_4: acc=0.64 correct=32/50 failed=0 tput=0.086 toks=2.7 lat=45.56 run_5: acc=0.60 correct=30/50 failed=0 tput=0.087 toks=2.7 lat=44.95 Versus the earlier pre-rebase snapshot {0.54, 0.58, 0.58, 0.62, 0.62} with all-0 failed, current main's accuracy band shifted up while one of the five cold runs dropped five requests to a CUDA OOM mid-run at the pinned mem_fraction_static=0.729. Other four runs on that same fixture completed with 0 failures, so this reads as a ~20% cold-run flake rather than a systematic regression. VIDEOMME_MAX_FAILED is therefore 5 (worst-of-5) rather than 0 — a PR that breaches this gate is one that pushes failures strictly above the worst cold-run we have evidence of. The server fixture is module-scoped and pins both CLI flags so that the test is anchored to the configuration that produced the calibration, independent of future factory-default changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This comment was marked as outdated.

Sign in to view

Jayon02 changed the title ~~[WIP] Add Video-MME support for Qwen3-Omni CI~~ Add Video-MME benchmark for Qwen3-Omni Apr 21, 2026

zhaochenyang20 mentioned this pull request Apr 21, 2026

[Stage]: Use hardware-aware mem_fraction_static defaults for omni AR stages #318

Merged

4 tasks

zhaochenyang20 mentioned this pull request Apr 21, 2026

SGLang-Omni Refactoring Proposal #188

Open

Jayon02 mentioned this pull request Apr 21, 2026

[Fix] Add thinker input length check and parameter passing #330

Merged

5 tasks

Ratish1 mentioned this pull request Apr 21, 2026

[Feature]: Colocated Stage Execution for Omni v1 #329

Open

35 tasks

Jayon02 force-pushed the issue-253 branch from 8405785 to c194880 Compare April 23, 2026 05:14

Jayon02 changed the title ~~Add Video-MME benchmark for Qwen3-Omni~~ [Benchmark] Add Video-MME for Qwen3-Omni Apr 23, 2026

Jayon02 requested a review from zhaochenyang20 April 23, 2026 06:23

Jayon02 marked this pull request as ready for review April 23, 2026 06:23

Jayon02 mentioned this pull request Apr 23, 2026

[CI] Add Video MME for Qwen Omni Thinker Only Test #338

Merged

zhaochenyang20 requested a review from shuaills as a code owner April 23, 2026 20:56

zhaochenyang20 approved these changes Apr 24, 2026

View reviewed changes

This was referenced Apr 24, 2026

[Feat] Expose encoder mem reserve as --encoder-mem-reserve CLI flag #339

Merged

[Investigation] Align --encoder-mem-reserve with upstream SGLang's multimodal memory handling #345

Open

zhaochenyang20 force-pushed the issue-253 branch 4 times, most recently from 68f8651 to 622cb96 Compare April 24, 2026 21:48

zhaochenyang20 force-pushed the issue-253 branch from 622cb96 to 2495124 Compare April 24, 2026 22:59

zhaochenyang20 added 2 commits April 24, 2026 22:59

rebase with main

61000a9

delete unused parameters

b5e3ff1

zhaochenyang20 mentioned this pull request Apr 24, 2026

[V0] Video encoder memory fragmentation, Talker instability, and the V1 bar #347

Closed

11 tasks

print form

9daedbd

zhaochenyang20 mentioned this pull request Apr 25, 2026

Cleanup: collapse the duplicate sglang_omni.cli.cli module path #349

Closed

clean up cli

3ce1ff4

zhaochenyang20 approved these changes Apr 25, 2026

View reviewed changes

zhaochenyang20 merged commit ac0e112 into sgl-project:main Apr 25, 2026
6 checks passed

zhaochenyang20 mentioned this pull request May 1, 2026

[RFC] Multimodal encoder TP — in-tree implementation vs. upstream import #375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Add Video-MME for Qwen3-Omni#327

[Benchmark] Add Video-MME for Qwen3-Omni#327
zhaochenyang20 merged 6 commits intosgl-project:mainfrom
Jayon02:issue-253

Jayon02 commented Apr 20, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

zhaochenyang20 commented Apr 21, 2026

Uh oh!

Jayon02 commented Apr 23, 2026 •

edited

Loading

Uh oh!

Ratish1 commented Apr 23, 2026

Discussion

Previous Results

Current Results

How to run

Uh oh!

zhaochenyang20 Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Jayon02 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

This comment was marked as outdated.

zhaochenyang20 commented Apr 21, 2026

Uh oh!

Jayon02 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Discussion

Previous Results

Current Results

How to run

Uh oh!

Ratish1 commented Apr 23, 2026

Discussion

Previous Results

Current Results

How to run

Uh oh!

zhaochenyang20 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jayon02 commented Apr 20, 2026 •

edited

Loading

Jayon02 commented Apr 23, 2026 •

edited

Loading