[Benchmark] Add Video-MME for Qwen3-Omni#327
[Benchmark] Add Video-MME for Qwen3-Omni#327zhaochenyang20 merged 6 commits intosgl-project:mainfrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
Hey @Jayon02, thanks for pushing this through. A few thoughts: On Please split this into three PRs:
Decouples bug fix from full-set runs, and CI from benchmark review. On the dataset: YouTube link-rot is real. "First N reachable" is fine for getting it running, but CI needs a frozen snapshot — please push what's currently reachable to an HF dataset. Already-delisted videos are unrecoverable, don't worry about those. On code quality: The existing video code paths were written early and it shows. Don't rescue all of it here — stand up the benchmark + CI, mark rough edges with TODOs, refactor incrementally. Priority: bug fix → benchmark → CI → cleanup. Concrete next step: split out the prompt-length assert as PR 1 and get it merged fast. |
DiscussionThe full benchmark is complete, but there are some regressions. After splitting the PR into three phases, many CI data points that previously passed are now failing. Interestingly, these failures seem to be state-related: a failed data point runs successfully on a fresh server, but in the pipeline, everything starts failing after a certain number of processed items. I suspect there’s a state-leak or an issue in the preprocessing stage that's causing this cumulative failure. Previous ResultsCurrent ResultsHow to runpython -m sglang_omni.cli.cli serve \
--model-path Qwen/Qwen3-Omni-30B-A3B-Instruct \
--text-only \
--thinker-max-seq-len 18000 \
--port 8000
python -m benchmarks.eval.benchmark_omni_videomme \
--model qwen3-omni \
--port 8000 \
--repo-id zhaochenyang20/Video_MME_ci |
Hey @Jayon02 , did you find any reason that this is happening?. Could you look into this in detail if you are able to find out why this is happening. Thanks |
| # User-pinned mem_fraction_static bypasses this reserve. | ||
|
|
||
| OMNI_ENCODER_MEM_FRACTION_STATIC_RESERVE = 0.05 | ||
| OMNI_ENCODER_MEM_FRACTION_STATIC_RESERVE = 0.20 |
There was a problem hiding this comment.
This is my only concern. If CI pass, we can let this be and keep this for Video use case.
Adds a 2520-sample Video-MME benchmark for sglang-omni AR engines: - benchmarks/dataset/videomme.py: loads zhaochenyang20/Video_MME via snapshot_download; resolves per-sample video path + A-D choices. - benchmarks/tasks/video_understanding.py: per-sample prompt builder, answer parser (choice extraction with MC-fallback), and output-format summaries for accuracy and per-duration / per-domain breakdowns. - benchmarks/eval/benchmark_omni_videomme.py: driver script wiring the dataset, the runner, and the scoring/speed-summary tasks together. - benchmarks/dataset/prepare.py / benchmarks/README.md: register 'videomme' in the prepare CLI and doc it in the dataset index. The docstring at the top of the eval script documents the canonical launch (--thinker-max-seq-len 32768, --encoder-mem-reserve 0.20) and the c=4 / max-tokens=256 bench command; full-set reference numbers will land in a follow-up commit after the run completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
68f8651 to
622cb96
Compare
Adds a 2520-sample Video-MME benchmark for sglang-omni AR engines: - benchmarks/dataset/videomme.py: loads zhaochenyang20/Video_MME via snapshot_download; resolves per-sample video path + A-D choices. - benchmarks/tasks/video_understanding.py: per-sample prompt builder, answer parser (choice extraction with MC-fallback), and output-format summaries for accuracy and per-duration / per-domain breakdowns. - benchmarks/eval/benchmark_omni_videomme.py: driver script wiring the dataset, the runner, and the scoring/speed-summary tasks together. - benchmarks/dataset/prepare.py / benchmarks/README.md: register 'videomme' in the prepare CLI and doc it in the dataset index. The docstring at the top of the eval script documents the canonical launch (--thinker-max-seq-len 32768, --encoder-mem-reserve 0.20) and the c=4 / max-tokens=256 bench command; full-set reference numbers will land in a follow-up commit after the run completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
622cb96 to
2495124
Compare
Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the
thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve
0.20) and asserts accuracy, failed-request budget, and per-concurrency
speed thresholds derived from a 5-run H200 calibration on the rebased
main with apply_slack (0.75/1.25).
Thresholds (worst-of-5, no slack on accuracy/failed):
VIDEOMME_MIN_ACCURACY 0.56
VIDEOMME_MAX_FAILED 5 (see caveat below)
_VIDEOMME_P95.throughput 0.084
_VIDEOMME_P95.toks_agg 2.5
_VIDEOMME_P95.latency_s 46.3
5-run H200 data on the rebased main (PR sgl-project#327 + PR sgl-project#339 landed; same
fixture as before):
run_1: acc=0.66 correct=33/50 failed=0 tput=0.087 toks=2.6 lat=45.47
run_2: acc=0.56 correct=28/50 failed=5 tput=0.084 toks=2.6 lat=46.27
run_3: acc=0.60 correct=30/50 failed=0 tput=0.085 toks=2.5 lat=46.33
run_4: acc=0.64 correct=32/50 failed=0 tput=0.086 toks=2.7 lat=45.56
run_5: acc=0.60 correct=30/50 failed=0 tput=0.087 toks=2.7 lat=44.95
Versus the earlier pre-rebase snapshot {0.54, 0.58, 0.58, 0.62, 0.62}
with all-0 failed, current main's accuracy band shifted up while one
of the five cold runs dropped five requests to a CUDA OOM mid-run at
the pinned mem_fraction_static=0.729. Other four runs on that same
fixture completed with 0 failures, so this reads as a ~20% cold-run
flake rather than a systematic regression. VIDEOMME_MAX_FAILED is
therefore 5 (worst-of-5) rather than 0 — a PR that breaches this gate
is one that pushes failures strictly above the worst cold-run we have
evidence of.
The server fixture is module-scoped and pins both CLI flags so that
the test is anchored to the configuration that produced the
calibration, independent of future factory-default changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs the 50-sample videomme-ci-50 subset at concurrency=4 with the
thinker-only server (--thinker-max-seq-len 32768 --encoder-mem-reserve
0.20) and asserts accuracy, failed-request budget, and per-concurrency
speed thresholds derived from a 5-run H200 calibration on the rebased
main with apply_slack (0.75/1.25).
Thresholds (worst-of-5, no slack on accuracy/failed):
VIDEOMME_MIN_ACCURACY 0.56
VIDEOMME_MAX_FAILED 5 (see caveat below)
_VIDEOMME_P95.throughput 0.084
_VIDEOMME_P95.toks_agg 2.5
_VIDEOMME_P95.latency_s 46.3
5-run H200 data on the rebased main (PR sgl-project#327 + PR sgl-project#339 landed; same
fixture as before):
run_1: acc=0.66 correct=33/50 failed=0 tput=0.087 toks=2.6 lat=45.47
run_2: acc=0.56 correct=28/50 failed=5 tput=0.084 toks=2.6 lat=46.27
run_3: acc=0.60 correct=30/50 failed=0 tput=0.085 toks=2.5 lat=46.33
run_4: acc=0.64 correct=32/50 failed=0 tput=0.086 toks=2.7 lat=45.56
run_5: acc=0.60 correct=30/50 failed=0 tput=0.087 toks=2.7 lat=44.95
Versus the earlier pre-rebase snapshot {0.54, 0.58, 0.58, 0.62, 0.62}
with all-0 failed, current main's accuracy band shifted up while one
of the five cold runs dropped five requests to a CUDA OOM mid-run at
the pinned mem_fraction_static=0.729. Other four runs on that same
fixture completed with 0 failures, so this reads as a ~20% cold-run
flake rather than a systematic regression. VIDEOMME_MAX_FAILED is
therefore 5 (worst-of-5) rather than 0 — a PR that breaches this gate
is one that pushes failures strictly above the worst cold-run we have
evidence of.
The server fixture is module-scoped and pins both CLI flags so that
the test is anchored to the configuration that produced the
calibration, independent of future factory-default changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Motivation
Modifications
Related Issues
Fixes #253
Accuracy Test
Benchmark & Profiling
Checklist