Skip to content

[Benchmark] Add Video-MME for Qwen3-Omni#327

Merged
zhaochenyang20 merged 6 commits intosgl-project:mainfrom
Jayon02:issue-253
Apr 25, 2026
Merged

[Benchmark] Add Video-MME for Qwen3-Omni#327
zhaochenyang20 merged 6 commits intosgl-project:mainfrom
Jayon02:issue-253

Conversation

@Jayon02
Copy link
Copy Markdown
Collaborator

@Jayon02 Jayon02 commented Apr 20, 2026

Motivation

Modifications

Related Issues

Fixes #253

Accuracy Test

Benchmark & Profiling

Checklist

  • Format your code according with pre-commit.
  • Add unit tests.
  • Update documentation / docstrings / example tutorials as needed.
  • Provide throughput / latency benchmark results and accuracy evaluation results as needed.
  • For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

@Jayon02

This comment was marked as outdated.

@Jayon02 Jayon02 changed the title [WIP] Add Video-MME support for Qwen3-Omni CI Add Video-MME benchmark for Qwen3-Omni Apr 21, 2026
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Hey @Jayon02, thanks for pushing this through. A few thoughts:

On thinker_max_seq_len: You diagnosed it right — hardcoded 8192 with no assert means video inputs silently crash the CUDA kernel instead of failing at the request boundary. But please don't unify the parameter-passing story here. #318 is actively reworking the override layer (apply_server_args_overrides primitive, CLI precedence, schema layering); landing a second parameter through that mechanism while it's in flight will conflict. Keep the temporary path for now, add an explicit TODO comment noting it should migrate onto the #318 primitive afterward (same for video_fps and similar runtime params).

Please split this into three PRs:

  1. Bug fix — the prompt-length assert that turns the CUDA crash into a clean request-level error. Ship this fast, it's the user-visible fix.
  2. Benchmarkbenchmark_omni_videomme on the full set (~2h), following the MMMU / MMSU pattern with accuracy + speed numbers in the PR description.
  3. CI — the ~10min subset imported from the benchmark script, same pattern as MMSU / MMMU.

Decouples bug fix from full-set runs, and CI from benchmark review.

On the dataset: YouTube link-rot is real. "First N reachable" is fine for getting it running, but CI needs a frozen snapshot — please push what's currently reachable to an HF dataset. Already-delisted videos are unrecoverable, don't worry about those.

On code quality: The existing video code paths were written early and it shows. Don't rescue all of it here — stand up the benchmark + CI, mark rough edges with TODOs, refactor incrementally. Priority: bug fix → benchmark → CI → cleanup.

Concrete next step: split out the prompt-length assert as PR 1 and get it merged fast.

@Jayon02 Jayon02 changed the title Add Video-MME benchmark for Qwen3-Omni [Benchmark] Add Video-MME for Qwen3-Omni Apr 23, 2026
@Jayon02 Jayon02 requested a review from zhaochenyang20 April 23, 2026 06:23
@Jayon02 Jayon02 marked this pull request as ready for review April 23, 2026 06:23
@Jayon02
Copy link
Copy Markdown
Collaborator Author

Jayon02 commented Apr 23, 2026

Discussion

The full benchmark is complete, but there are some regressions. After splitting the PR into three phases, many CI data points that previously passed are now failing. Interestingly, these failures seem to be state-related: a failed data point runs successfully on a fresh server, but in the pipeline, everything starts failing after a certain number of processed items. I suspect there’s a state-leak or an issue in the preprocessing stage that's causing this cumulative failure.

Previous Results

====================================================
  Video-MME Accuracy — qwen3-omni
====================================================
  Total samples:               50
  Correct:                     29
  Accuracy:                    0.5800 (58.0%)
  Failed requests:             0
  MC parse fallback:           1
====================================================

============================================================
                      Video-MME Speed                       
============================================================
  Model:                         qwen3-omni
  Concurrency:                   1
  Completed requests:            50
  Failed requests:               0
------------------------------------------------------------
  Latency mean (s):              19.719
  Latency median (s):            18.762
  Latency p95 (s):               28.401
  Latency p99 (s):               31.016
  Tok/s (per-req mean):          6.0
  Tok/s (per-req median):        5.7
  Tok/s (aggregate):             6.0
  Gen tokens (mean):             119
  Gen tokens (total):            5964
  Prompt tokens (mean):          10724
  Prompt tokens (total):         536220
  Throughput (req/s):            0.051
============================================================

Current Results

====================================================
  Video-MME Accuracy — qwen3-omni
====================================================
  Total samples:               50
  Correct:                     14
  Accuracy:                    0.2800 (28.0%)
  Failed requests:             26
  MC parse fallback:           0
====================================================

============================================================
                      Video-MME Speed                       
============================================================
  Model:                         qwen3-omni
  Concurrency:                   1
  Completed requests:            24
  Failed requests:               26
------------------------------------------------------------
  Latency mean (s):              24.702
  Latency median (s):            19.479
  Latency p95 (s):               54.725
  Latency p99 (s):               67.266
  Tok/s (per-req mean):          4.9
  Tok/s (per-req median):        4.3
  Tok/s (aggregate):             4.3
  Gen tokens (mean):             106
  Gen tokens (total):            2555
  Prompt tokens (mean):          10442
  Prompt tokens (total):         250612
  Throughput (req/s):            0.032
============================================================

How to run

python -m sglang_omni.cli.cli serve \
  --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --text-only \
  --thinker-max-seq-len 18000 \
  --port 8000

python -m benchmarks.eval.benchmark_omni_videomme \
  --model qwen3-omni \
  --port 8000 \
  --repo-id zhaochenyang20/Video_MME_ci

@Ratish1
Copy link
Copy Markdown
Collaborator

Ratish1 commented Apr 23, 2026

Discussion

The full benchmark is complete, but there are some regressions. After splitting the PR into three phases, many CI data points that previously passed are now failing. Interestingly, these failures seem to be state-related: a failed data point runs successfully on a fresh server, but in the pipeline, everything starts failing after a certain number of processed items. I suspect there’s a state-leak or an issue in the preprocessing stage that's causing this cumulative failure.

Previous Results

====================================================
  Video-MME Accuracy — qwen3-omni
====================================================
  Total samples:               50
  Correct:                     29
  Accuracy:                    0.5800 (58.0%)
  Failed requests:             0
  MC parse fallback:           1
====================================================

============================================================
                      Video-MME Speed                       
============================================================
  Model:                         qwen3-omni
  Concurrency:                   1
  Completed requests:            50
  Failed requests:               0
------------------------------------------------------------
  Latency mean (s):              19.719
  Latency median (s):            18.762
  Latency p95 (s):               28.401
  Latency p99 (s):               31.016
  Tok/s (per-req mean):          6.0
  Tok/s (per-req median):        5.7
  Tok/s (aggregate):             6.0
  Gen tokens (mean):             119
  Gen tokens (total):            5964
  Prompt tokens (mean):          10724
  Prompt tokens (total):         536220
  Throughput (req/s):            0.051
============================================================

Current Results

====================================================
  Video-MME Accuracy — qwen3-omni
====================================================
  Total samples:               50
  Correct:                     14
  Accuracy:                    0.2800 (28.0%)
  Failed requests:             26
  MC parse fallback:           0
====================================================

============================================================
                      Video-MME Speed                       
============================================================
  Model:                         qwen3-omni
  Concurrency:                   1
  Completed requests:            24
  Failed requests:               26
------------------------------------------------------------
  Latency mean (s):              24.702
  Latency median (s):            19.479
  Latency p95 (s):               54.725
  Latency p99 (s):               67.266
  Tok/s (per-req mean):          4.9
  Tok/s (per-req median):        4.3
  Tok/s (aggregate):             4.3
  Gen tokens (mean):             106
  Gen tokens (total):            2555
  Prompt tokens (mean):          10442
  Prompt tokens (total):         250612
  Throughput (req/s):            0.032
============================================================

How to run

python -m sglang_omni.cli.cli serve \
  --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --text-only \
  --thinker-max-seq-len 18000 \
  --port 8000

python -m benchmarks.eval.benchmark_omni_videomme \
  --model qwen3-omni \
  --port 8000 \
  --repo-id zhaochenyang20/Video_MME_ci

Hey @Jayon02 , did you find any reason that this is happening?. Could you look into this in detail if you are able to find out why this is happening. Thanks

# User-pinned mem_fraction_static bypasses this reserve.

OMNI_ENCODER_MEM_FRACTION_STATIC_RESERVE = 0.05
OMNI_ENCODER_MEM_FRACTION_STATIC_RESERVE = 0.20
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my only concern. If CI pass, we can let this be and keep this for Video use case.

Adds a 2520-sample Video-MME benchmark for sglang-omni AR engines:

- benchmarks/dataset/videomme.py: loads zhaochenyang20/Video_MME via
  snapshot_download; resolves per-sample video path + A-D choices.
- benchmarks/tasks/video_understanding.py: per-sample prompt builder,
  answer parser (choice extraction with MC-fallback), and output-format
  summaries for accuracy and per-duration / per-domain breakdowns.
- benchmarks/eval/benchmark_omni_videomme.py: driver script wiring the
  dataset, the runner, and the scoring/speed-summary tasks together.
- benchmarks/dataset/prepare.py / benchmarks/README.md: register
  'videomme' in the prepare CLI and doc it in the dataset index.

The docstring at the top of the eval script documents the canonical
launch (--thinker-max-seq-len 32768, --encoder-mem-reserve 0.20) and
the c=4 / max-tokens=256 bench command; full-set reference numbers
will land in a follow-up commit after the run completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zhaochenyang20 zhaochenyang20 force-pushed the issue-253 branch 4 times, most recently from 68f8651 to 622cb96 Compare April 24, 2026 21:48
Adds a 2520-sample Video-MME benchmark for sglang-omni AR engines:

- benchmarks/dataset/videomme.py: loads zhaochenyang20/Video_MME via
  snapshot_download; resolves per-sample video path + A-D choices.
- benchmarks/tasks/video_understanding.py: per-sample prompt builder,
  answer parser (choice extraction with MC-fallback), and output-format
  summaries for accuracy and per-duration / per-domain breakdowns.
- benchmarks/eval/benchmark_omni_videomme.py: driver script wiring the
  dataset, the runner, and the scoring/speed-summary tasks together.
- benchmarks/dataset/prepare.py / benchmarks/README.md: register
  'videomme' in the prepare CLI and doc it in the dataset index.

The docstring at the top of the eval script documents the canonical
launch (--thinker-max-seq-len 32768, --encoder-mem-reserve 0.20) and
the c=4 / max-tokens=256 bench command; full-set reference numbers
will land in a follow-up commit after the run completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zhaochenyang20 zhaochenyang20 merged commit ac0e112 into sgl-project:main Apr 25, 2026
6 checks passed
zhaochenyang20 added a commit to Jayon02/sglang-omni that referenced this pull request Apr 25, 2026
Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the
thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve
0.20) and asserts accuracy, failed-request budget, and per-concurrency
speed thresholds derived from a 5-run H200 calibration on the rebased
main with apply_slack (0.75/1.25).

Thresholds (worst-of-5, no slack on accuracy/failed):

    VIDEOMME_MIN_ACCURACY     0.56
    VIDEOMME_MAX_FAILED       5     (see caveat below)
    _VIDEOMME_P95.throughput  0.084
    _VIDEOMME_P95.toks_agg    2.5
    _VIDEOMME_P95.latency_s   46.3

5-run H200 data on the rebased main (PR sgl-project#327 + PR sgl-project#339 landed; same
fixture as before):

    run_1: acc=0.66 correct=33/50 failed=0 tput=0.087 toks=2.6 lat=45.47
    run_2: acc=0.56 correct=28/50 failed=5 tput=0.084 toks=2.6 lat=46.27
    run_3: acc=0.60 correct=30/50 failed=0 tput=0.085 toks=2.5 lat=46.33
    run_4: acc=0.64 correct=32/50 failed=0 tput=0.086 toks=2.7 lat=45.56
    run_5: acc=0.60 correct=30/50 failed=0 tput=0.087 toks=2.7 lat=44.95

Versus the earlier pre-rebase snapshot {0.54, 0.58, 0.58, 0.62, 0.62}
with all-0 failed, current main's accuracy band shifted up while one
of the five cold runs dropped five requests to a CUDA OOM mid-run at
the pinned mem_fraction_static=0.729. Other four runs on that same
fixture completed with 0 failures, so this reads as a ~20% cold-run
flake rather than a systematic regression. VIDEOMME_MAX_FAILED is
therefore 5 (worst-of-5) rather than 0 — a PR that breaches this gate
is one that pushes failures strictly above the worst cold-run we have
evidence of.

The server fixture is module-scoped and pins both CLI flags so that
the test is anchored to the configuration that produced the
calibration, independent of future factory-default changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhaochenyang20 added a commit to Jayon02/sglang-omni that referenced this pull request Apr 25, 2026
Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the
thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve
0.20) and asserts accuracy, failed-request budget, and per-concurrency
speed thresholds derived from a 5-run H200 calibration on the rebased
main with apply_slack (0.75/1.25).

Thresholds (worst-of-5, no slack on accuracy/failed):

    VIDEOMME_MIN_ACCURACY     0.56
    VIDEOMME_MAX_FAILED       5     (see caveat below)
    _VIDEOMME_P95.throughput  0.084
    _VIDEOMME_P95.toks_agg    2.5
    _VIDEOMME_P95.latency_s   46.3

5-run H200 data on the rebased main (PR sgl-project#327 + PR sgl-project#339 landed; same
fixture as before):

    run_1: acc=0.66 correct=33/50 failed=0 tput=0.087 toks=2.6 lat=45.47
    run_2: acc=0.56 correct=28/50 failed=5 tput=0.084 toks=2.6 lat=46.27
    run_3: acc=0.60 correct=30/50 failed=0 tput=0.085 toks=2.5 lat=46.33
    run_4: acc=0.64 correct=32/50 failed=0 tput=0.086 toks=2.7 lat=45.56
    run_5: acc=0.60 correct=30/50 failed=0 tput=0.087 toks=2.7 lat=44.95

Versus the earlier pre-rebase snapshot {0.54, 0.58, 0.58, 0.62, 0.62}
with all-0 failed, current main's accuracy band shifted up while one
of the five cold runs dropped five requests to a CUDA OOM mid-run at
the pinned mem_fraction_static=0.729. Other four runs on that same
fixture completed with 0 failures, so this reads as a ~20% cold-run
flake rather than a systematic regression. VIDEOMME_MAX_FAILED is
therefore 5 (worst-of-5) rather than 0 — a PR that breaches this gate
is one that pushes failures strictly above the worst cold-run we have
evidence of.

The server fixture is module-scoped and pins both CLI flags so that
the test is anchored to the configuration that produced the
calibration, independent of future factory-default changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Comprehensive CI Coverage for Qwen3 Omni: All Modality Combinations

3 participants