Skip to content

Commit 68f8651

Browse files
[Benchmark] Add Video-MME for Qwen3-Omni thinker-only
Adds a 2520-sample Video-MME benchmark for sglang-omni AR engines: - benchmarks/dataset/videomme.py: loads zhaochenyang20/Video_MME via snapshot_download; resolves per-sample video path + A-D choices. - benchmarks/tasks/video_understanding.py: per-sample prompt builder, answer parser (choice extraction with MC-fallback), and output-format summaries for accuracy and per-duration / per-domain breakdowns. - benchmarks/eval/benchmark_omni_videomme.py: driver script wiring the dataset, the runner, and the scoring/speed-summary tasks together. - benchmarks/dataset/prepare.py / benchmarks/README.md: register 'videomme' in the prepare CLI and doc it in the dataset index. The docstring at the top of the eval script documents the canonical launch (--thinker-max-seq-len 32768, --encoder-mem-reserve 0.20) and the c=4 / max-tokens=256 bench command; full-set reference numbers will land in a follow-up commit after the run completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f5b607c commit 68f8651

5 files changed

Lines changed: 612 additions & 5 deletions

File tree

benchmarks/README.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# SGLang Omni Benchmarks
22

33
Benchmark suite for SGLang Omni, covering performance (latency, throughput, RTF)
4-
and accuracy (WER, MMSU, MMMU) across supported modality combinations.
4+
and accuracy (WER, MMSU, MMMU, Video-MME) across supported modality combinations.
55

66
## Directory Structure
77

88
```
99
benchmarks/
10-
├── tasks/ # Per-task logic (tts, audio_understanding, visual_understand)
10+
├── tasks/ # Per-task logic (tts, audio_understanding, visual_understand, video_understanding)
1111
├── metrics/ # Metric computation (performance, accuracy)
1212
├── dataset/ # Dataset loaders + download helpers
1313
├── benchmarker/ # Framework: runner, data structures, utilities
@@ -98,6 +98,10 @@ python -m benchmarks.eval.benchmark_omni_mmsu \
9898
# 5. Qwen3-Omni — MMMU (VLM accuracy, image input)
9999
python -m benchmarks.eval.benchmark_omni_mmmu \
100100
--model qwen3-omni --port 8000 --max-samples 50 --max-concurrency 16
101+
102+
# 6. Qwen3-Omni — Video-MME (video understanding)
103+
python -m benchmarks.eval.benchmark_omni_videomme \
104+
--model qwen3-omni --port 8000 --max-samples 50
101105
```
102106

103107
## Eval Scripts
@@ -108,6 +112,7 @@ python -m benchmarks.eval.benchmark_omni_mmmu \
108112
| `eval/benchmark_omni_seedtts.py` | TTS speed + WER (unified) | Qwen3-Omni | `/v1/chat/completions` |
109113
| `eval/benchmark_omni_mmsu.py` | MMSU (audio comprehension) | Qwen3-Omni | `/v1/chat/completions` |
110114
| `eval/benchmark_omni_mmmu.py` | MMMU (VLM accuracy + speed) | Qwen3-Omni | `/v1/chat/completions` |
115+
| `eval/benchmark_omni_videomme.py` | Video-MME (video understanding) | Qwen3-Omni | `/v1/chat/completions` |
111116

112117
The two `*_seedtts.py` scripts merge the previous `benchmark_*_tts_speed.py`
113118
and `voice_clone_*_wer.py` pairs into a single two-phase pipeline: phase 1
@@ -138,9 +143,11 @@ python -m benchmarks.dataset.prepare --dataset seedtts-50 # 50-sample subset
138143
python -m benchmarks.dataset.prepare --dataset mmmu # full MMMU (30 subjects)
139144
python -m benchmarks.dataset.prepare --dataset mmmu-ci-50 # MMMU CI subset
140145
python -m benchmarks.dataset.prepare --dataset mmsu # full MMSU (ddwang2000/MMSU)
146+
python -m benchmarks.dataset.prepare --dataset videomme-ci-50 # Video-MME CI subset
147+
python -m benchmarks.dataset.prepare --dataset videomme # full Video-MME
141148
```
142149

143150
SeedTTS datasets are materialized into `./seedtts_testset/` (override with
144-
`--local-dir`). MMMU/MMSU datasets are pre-warmed into the default
145-
HuggingFace cache and consumed via `datasets.load_dataset(repo_id)`, so
151+
`--local-dir`). MMMU/MMSU/Video-MME datasets are pre-warmed into the default
152+
HuggingFace cache and then consumed via `datasets.load_dataset(repo_id)`, so
146153
`--local-dir` is a no-op for them.

benchmarks/dataset/prepare.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,12 @@
77
python -m benchmarks.dataset.prepare --dataset seedtts-mini
88
python -m benchmarks.dataset.prepare --dataset seedtts-50
99
10-
# MMMU / MMSU (pre-warm the HuggingFace datasets cache)
10+
# MMMU / MMSU / Video-MME (pre-warm the HuggingFace datasets cache)
1111
python -m benchmarks.dataset.prepare --dataset mmmu
1212
python -m benchmarks.dataset.prepare --dataset mmmu-ci-50
1313
python -m benchmarks.dataset.prepare --dataset mmsu
14+
python -m benchmarks.dataset.prepare --dataset videomme
15+
python -m benchmarks.dataset.prepare --dataset videomme-ci-50
1416
"""
1517

1618
from __future__ import annotations
@@ -30,6 +32,8 @@
3032
"mmmu-ci-50": "zhaochenyang20/mmmu-ci-50",
3133
"mmsu": "ddwang2000/MMSU",
3234
"mmsu-ci-2000": "zhaochenyang20/mmsu-ci-2000",
35+
"videomme": "zhaochenyang20/Video_MME",
36+
"videomme-ci-50": "zhaochenyang20/Video_MME_ci",
3337
}
3438

3539
_CLI_LOCAL_DIRS: dict[str, str] = {

benchmarks/dataset/videomme.py

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
"""Video-MME dataset loader for local benchmarks."""
3+
4+
from __future__ import annotations
5+
6+
import logging
7+
import re
8+
from dataclasses import dataclass, field
9+
from pathlib import Path
10+
11+
from datasets import load_dataset
12+
from huggingface_hub import snapshot_download
13+
14+
logger = logging.getLogger(__name__)
15+
16+
17+
@dataclass
18+
class VideoMMESample:
19+
sample_id: str
20+
video_path: str
21+
question: str
22+
options: list[str]
23+
answer: str
24+
url: str = ""
25+
video_id: str = ""
26+
question_id: str = ""
27+
duration: str = "short"
28+
domain: str = "unknown"
29+
task_type: str = "understanding"
30+
sub_category: str = ""
31+
prompt: str = ""
32+
all_choices: list[str] = field(default_factory=list)
33+
index2ans: dict[str, str] = field(default_factory=dict)
34+
35+
36+
def _strip_option_prefix(option: str) -> str:
37+
return re.sub(r"^[A-D]\.\s*", "", option.strip())
38+
39+
40+
def format_videomme_prompt(question: str, options: list[str]) -> str:
41+
prompt = f"{question.strip()}\n"
42+
for index, option in enumerate(options):
43+
letter = chr(ord("A") + index)
44+
prompt += f"{letter}. {option}\n"
45+
prompt += (
46+
"\nAnswer the following multiple-choice question. "
47+
"The last line of your response should be of the "
48+
"following format: 'Answer: $LETTER' (without quotes) "
49+
"where LETTER is one of the options. "
50+
"Think step by step before answering."
51+
)
52+
return prompt
53+
54+
55+
def _resolve_video_path(snapshot_dir: Path, row: dict, question_id: str) -> str | None:
56+
relative_path = row.get("video_path")
57+
if not relative_path:
58+
logger.warning(
59+
"Skipping Video-MME sample %s because the dataset row has no video_path",
60+
question_id,
61+
)
62+
return None
63+
absolute_path = snapshot_dir / str(relative_path)
64+
if not absolute_path.exists():
65+
logger.warning(
66+
"Skipping Video-MME sample %s because the video file does not exist at %s",
67+
question_id,
68+
absolute_path,
69+
)
70+
return None
71+
return str(absolute_path)
72+
73+
74+
def _dataset_to_samples(
75+
dataset,
76+
*,
77+
snapshot_dir: Path,
78+
max_samples: int | None,
79+
) -> list[VideoMMESample]:
80+
samples: list[VideoMMESample] = []
81+
for row_index, row in enumerate(dataset):
82+
duration = str(row.get("duration", "short")).strip()
83+
question_id = str(row.get("question_id", f"videomme:{row_index}")).strip()
84+
85+
options = [_strip_option_prefix(str(option)) for option in row["options"]]
86+
all_choices = [chr(ord("A") + i) for i in range(len(options))]
87+
index2ans = {choice: option for choice, option in zip(all_choices, options)}
88+
video_id = str(row["video_id"]).strip()
89+
url = str(row["url"]).strip()
90+
video_path = _resolve_video_path(snapshot_dir, row, question_id)
91+
if not video_path:
92+
continue
93+
94+
samples.append(
95+
VideoMMESample(
96+
sample_id=question_id,
97+
video_path=video_path,
98+
question=str(row["question"]).strip(),
99+
options=options,
100+
answer=str(row["answer"]).strip(),
101+
url=url,
102+
video_id=video_id,
103+
question_id=question_id,
104+
duration=duration,
105+
domain=str(row.get("domain", "unknown")).strip(),
106+
task_type=str(row.get("task_type", "understanding")).strip(),
107+
sub_category=str(row.get("sub_category", "")).strip(),
108+
prompt=format_videomme_prompt(str(row["question"]).strip(), options),
109+
all_choices=all_choices,
110+
index2ans=index2ans,
111+
)
112+
)
113+
if max_samples is not None and len(samples) >= max_samples:
114+
break
115+
116+
return samples
117+
118+
119+
def _load_metadata_dataset(snapshot_dir: Path, split: str):
120+
data_dir = snapshot_dir / "data"
121+
split_parts = sorted(data_dir.glob(f"{split}_part_*.jsonl"))
122+
if split_parts:
123+
return load_dataset(
124+
"json",
125+
data_files=[str(path) for path in split_parts],
126+
split="train",
127+
)
128+
129+
split_file = data_dir / f"{split}.jsonl"
130+
if split_file.exists():
131+
return load_dataset("json", data_files=str(split_file), split="train")
132+
133+
available = sorted(path.name for path in data_dir.glob("*.jsonl"))
134+
raise ValueError(
135+
f"Split '{split}' not found under {data_dir}. Available files: {available}"
136+
)
137+
138+
139+
def load_videomme_samples(
140+
max_samples: int | None = None,
141+
*,
142+
repo_id: str | None = None,
143+
split: str = "test",
144+
) -> list[VideoMMESample]:
145+
resolved_repo_id = repo_id or "zhaochenyang20/Video_MME"
146+
snapshot_dir = Path(
147+
snapshot_download(repo_id=resolved_repo_id, repo_type="dataset")
148+
)
149+
dataset = _load_metadata_dataset(snapshot_dir, split)
150+
samples = _dataset_to_samples(
151+
dataset,
152+
snapshot_dir=snapshot_dir,
153+
max_samples=max_samples,
154+
)
155+
logger.info("Loaded %d Video-MME samples", len(samples))
156+
return samples

0 commit comments

Comments
 (0)