Add a PR-ready SocialOmni benchmark path for sglang-omni#352
Add a PR-ready SocialOmni benchmark path for sglang-omni#352Alexisxty wants to merge 1 commit intosgl-project:mainfrom
Conversation
This wires SocialOmni into the existing benchmarks surface with a single entrypoint that supports both level1 and level2, deterministic mini data materialization, and 1-judge/3-judge evaluation flows. The implementation is flattened into benchmark-style single files so it matches the surrounding repo conventions instead of introducing a nested package layout. The integration also hardens the live path by validating judge configuration up front, preflighting OpenAI-compatible endpoints before long runs, and keeping Q2 averages faithful to returned zero scores rather than silently dropping them. README and dataset preparation guidance are updated so maintainers can prepare full or mini SocialOmni data with the same benchmark workflow used by the other benchmarks. Constraint: The upstream benchmark surface should stay aligned with existing single-file benchmark conventions Constraint: SocialOmni must support both level1/level2 and 1-judge/3-judge flows from one entrypoint Rejected: Split SocialOmni into separate level1 and level2 benchmark families | user and plan required a single benchmark surface Rejected: Keep nested dataset/task package directories | did not match upstream benchmark file conventions Confidence: high Scope-risk: moderate Reversibility: clean Directive: Keep socialomni-mini deterministic and update its manifest/tests together if dataset selection changes Tested: ./ .venv/bin/pytest tests/test_socialomni_benchmark.py -q Tested: ./.venv/bin/python -m py_compile benchmarks/dataset/prepare.py benchmarks/dataset/socialomni.py benchmarks/tasks/socialomni.py benchmarks/eval/benchmark_omni_socialomni.py tests/test_socialomni_benchmark.py Tested: git diff --check -- benchmarks/README.md benchmarks/dataset/prepare.py benchmarks/dataset/socialomni.py benchmarks/tasks/socialomni.py benchmarks/eval/benchmark_omni_socialomni.py tests/test_socialomni_benchmark.py Not-tested: Live smoke against a real sglang-omni served model (local OpenAI-compatible stubs were used previously for flow validation)
There was a problem hiding this comment.
Pull request overview
Adds a PR-ready SocialOmni benchmark flow (dataset prep + loaders + eval entrypoint) so omni models can be evaluated through the existing OpenAI-compatible /v1/chat/completions serving API.
Changes:
- Introduces SocialOmni dataset preparation (full + deterministic
socialomni-mini) and sample loaders. - Adds SocialOmni benchmark task logic for level1 scoring and level2 Q1/Q2 + judge workflow.
- Adds a new user-facing eval entrypoint plus documentation and unit tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
benchmarks/dataset/socialomni.py |
Implements SocialOmni dataset materialization (full/mini) and level1/level2 sample loaders. |
benchmarks/tasks/socialomni.py |
Adds request helpers, prompt building, parsing/scoring, level2 workflow, judge orchestration, and metrics. |
benchmarks/eval/benchmark_omni_socialomni.py |
Adds CLI entrypoint to run level1/level2 benchmarks against OpenAI-compatible endpoints (plus judge preflight). |
benchmarks/dataset/prepare.py |
Extends dataset preparation CLI to support socialomni and socialomni-mini via a dedicated handler. |
benchmarks/README.md |
Documents SocialOmni dataset preparation and eval usage. |
tests/test_socialomni_benchmark.py |
Adds unit tests for dataset prep/loading and level1/level2 benchmark flows (including judge preflight). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| completed = await asyncio.gather(*[_limited(sample) for sample in samples]) | ||
|
|
||
| per_sample = [item[0] for item in completed] | ||
| primary_requests = [request for _, requests, _ in completed for request in requests] | ||
| judge_requests = [request for _, _, requests in completed for request in requests] |
There was a problem hiding this comment.
run_socialomni_level2_benchmark builds a list of coroutines for all samples and submits them to asyncio.gather(...) at once. For large SocialOmni runs this can create thousands of pending tasks and significantly increase memory/overhead. Consider a bounded producer/consumer pattern (e.g., asyncio.as_completed with a fixed task set, or chunking) so only ~max_concurrency tasks are live at a time.
| completed = await asyncio.gather(*[_limited(sample) for sample in samples]) | |
| per_sample = [item[0] for item in completed] | |
| primary_requests = [request for _, requests, _ in completed for request in requests] | |
| judge_requests = [request for _, _, requests in completed for request in requests] | |
| completed: list[ | |
| tuple[dict[str, Any], list[RequestResult], list[RequestResult]] | None | |
| ] = [None] * len(samples) | |
| sample_iter = iter(enumerate(samples)) | |
| pending: dict[asyncio.Task[tuple[int, tuple[dict[str, Any], list[RequestResult], list[RequestResult]]]], int] = {} | |
| def _schedule_next() -> bool: | |
| try: | |
| index, sample = next(sample_iter) | |
| except StopIteration: | |
| return False | |
| async def _run_one( | |
| sample_index: int, sample_value: SocialOmniLevel2Sample | |
| ) -> tuple[int, tuple[dict[str, Any], list[RequestResult], list[RequestResult]]]: | |
| return sample_index, await _limited(sample_value) | |
| task = asyncio.create_task(_run_one(index, sample)) | |
| pending[task] = index | |
| return True | |
| for _ in range(min(max(1, max_concurrency), len(samples))): | |
| if not _schedule_next(): | |
| break | |
| while pending: | |
| done, _ = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED) | |
| for task in done: | |
| pending.pop(task, None) | |
| index, result = await task | |
| completed[index] = result | |
| _schedule_next() | |
| ordered_completed = [item for item in completed if item is not None] | |
| per_sample = [item[0] for item in ordered_completed] | |
| primary_requests = [request for _, requests, _ in ordered_completed for request in requests] | |
| judge_requests = [request for _, _, requests in ordered_completed for request in requests] |
| rel_name = Path(str(row["video_path"]).strip()).name | ||
| _copy_file( | ||
| source_root / LEVEL1_VIDEOS_DIR / rel_name, level1_videos_dir / rel_name |
There was a problem hiding this comment.
In _materialize_mini_dataset, level1 videos are copied using only Path(row["video_path"]).name, but the written dataset.json keeps the original video_path values unchanged. If the upstream dataset contains any subdirectory components in video_path, the prepared mini dataset will reference paths that were not copied (and load_socialomni_level1_samples will fail to find the files). Consider either (a) preserving the relative directory structure when copying, or (b) normalizing/re-writing each row’s video_path to the basename you actually copy.
| rel_name = Path(str(row["video_path"]).strip()).name | |
| _copy_file( | |
| source_root / LEVEL1_VIDEOS_DIR / rel_name, level1_videos_dir / rel_name | |
| rel_path = Path(str(row["video_path"]).strip()) | |
| destination_path = level1_videos_dir / rel_path | |
| destination_path.parent.mkdir(parents=True, exist_ok=True) | |
| _copy_file( | |
| source_root / LEVEL1_VIDEOS_DIR / rel_path, destination_path |
| for row in filtered_level2_rows: | ||
| rel_name = Path(str(row["video_file"]).strip()).name | ||
| _copy_file( | ||
| source_root / LEVEL2_VIDEOS_DIR / rel_name, level2_videos_dir / rel_name | ||
| ) |
There was a problem hiding this comment.
Same issue for level2: videos are copied using Path(row["video_file"]).name but the annotations.json rows are written without updating video_file. If video_file contains directories in the source annotations, the mini dataset will end up with broken references. Please normalize video_file in the emitted rows (or copy using the original relative path).
| with tempfile.TemporaryDirectory( | ||
| prefix=f"socialomni_{sample.sample_id}_" | ||
| ) as tmpdir: | ||
| cut_path = str(Path(tmpdir) / Path(sample.video_path).name) | ||
| cut_video_prefix( | ||
| sample.video_path, | ||
| parse_level2_timestamp_to_seconds(sample.question_1.timestamp), | ||
| cut_path, | ||
| ) |
There was a problem hiding this comment.
cut_video_prefix() is invoked via subprocess.run() inside run_socialomni_level2_sample() (an async function). This blocks the event loop during ffmpeg execution, so increasing max_concurrency won’t help and other in-flight HTTP requests will stall. Consider running the prefix cut via asyncio.to_thread(...) / an executor, or using asyncio.create_subprocess_exec to avoid blocking the loop.
| session, | ||
| judge=judge, | ||
| sample_id=sample.sample_id, | ||
| video_path=sample.video_path, |
There was a problem hiding this comment.
The judge requests are sent with video_path=sample.video_path (the full, uncut video), while the primary model Q1/Q2 requests use cut_path (prefix-cut video). This means judges can see post-timestamp context the model did not, which can skew scores and diverges from the intended “judge the interruption at that moment” protocol. Pass cut_path (or otherwise ensure judges evaluate the same prefix clip) to run_socialomni_judge.
| video_path=sample.video_path, | |
| video_path=cut_path, |
|
could you submit your results on the full set? @Alexisxty |
Motivation
Add a PR-ready SocialOmni benchmark path for sglang-omni so users can evaluate omni models through the existing OpenAI-compatible serving API.
The benchmark follows the existing repository pattern: users start the target model service separately, then point the benchmark entrypoint to that endpoint with
--base-url/--model.Modifications
socialomniand deterministicsocialomni-minipreparation.benchmarks/dataset/socialomni.py.benchmarks/tasks/socialomni.pyfor:benchmarks/eval/benchmark_omni_socialomni.py.benchmarks/README.md.Related Issues
N/A
Accuracy Test
N/A. This PR adds benchmark plumbing and does not modify model architecture, kernels, or inference numerics.
Benchmark & Profiling
N/A. This PR adds a benchmark entrypoint and is not expected to change model serving throughput or latency.
Local API-flow smoke was validated with OpenAI-compatible mock main and judge endpoints for both SocialOmni level1 and level2. The benchmark relies on the existing sglang-omni serving stack for real model startup.
Checklist
Verification: