Add a PR-ready SocialOmni benchmark path for sglang-omni by Alexisxty · Pull Request #352 · sgl-project/sglang-omni

Alexisxty · 2026-04-25T06:21:31Z

Motivation

Add a PR-ready SocialOmni benchmark path for sglang-omni so users can evaluate omni models through the existing OpenAI-compatible serving API.

The benchmark follows the existing repository pattern: users start the target model service separately, then point the benchmark entrypoint to that endpoint with --base-url / --model.

Modifications

Add SocialOmni dataset preparation and loading support, including socialomni and deterministic socialomni-mini preparation.
Add a single-file SocialOmni dataset module at benchmarks/dataset/socialomni.py.
Add a single-file SocialOmni task module at benchmarks/tasks/socialomni.py for:
- level1 prompt construction, request sending, answer parsing, and accuracy summary;
- level2 Q1/Q2 workflow, video prefix cutting, judge requests, and result aggregation;
- main endpoint and judge endpoint preflight checks.
Add the user-facing entrypoint benchmarks/eval/benchmark_omni_socialomni.py.
Document SocialOmni usage in benchmarks/README.md.
Add unit tests covering dataset materialization, endpoint normalization/preflight, level1 scoring, level2 branching, judge parsing/aggregation, and entrypoint behavior.

Related Issues

N/A

Accuracy Test

N/A. This PR adds benchmark plumbing and does not modify model architecture, kernels, or inference numerics.

Benchmark & Profiling

N/A. This PR adds a benchmark entrypoint and is not expected to change model serving throughput or latency.

Local API-flow smoke was validated with OpenAI-compatible mock main and judge endpoints for both SocialOmni level1 and level2. The benchmark relies on the existing sglang-omni serving stack for real model startup.

Checklist

Format your code according with pre-commit.
Add unit tests.
Update documentation / docstrings / example tutorials as needed.
Provide throughput / latency benchmark results and accuracy evaluation results as needed.

Verification:

uv run --with pre-commit pre-commit run --files \
  benchmarks/README.md \
  benchmarks/dataset/prepare.py \
  benchmarks/dataset/socialomni.py \
  benchmarks/eval/benchmark_omni_socialomni.py \
  benchmarks/tasks/socialomni.py \
  tests/test_socialomni_benchmark.py

./.venv/bin/pytest tests/test_socialomni_benchmark.py -q

./.venv/bin/python -m py_compile \
  benchmarks/dataset/prepare.py \
  benchmarks/dataset/socialomni.py \
  benchmarks/tasks/socialomni.py \
  benchmarks/eval/benchmark_omni_socialomni.py \
  tests/test_socialomni_benchmark.py

git diff --check

Results:

- pre-commit run --files ...: passed
- pytest tests/test_socialomni_benchmark.py -q: 13 passed
- py_compile: passed
- git diff --check: passed

This wires SocialOmni into the existing benchmarks surface with a single entrypoint that supports both level1 and level2, deterministic mini data materialization, and 1-judge/3-judge evaluation flows. The implementation is flattened into benchmark-style single files so it matches the surrounding repo conventions instead of introducing a nested package layout. The integration also hardens the live path by validating judge configuration up front, preflighting OpenAI-compatible endpoints before long runs, and keeping Q2 averages faithful to returned zero scores rather than silently dropping them. README and dataset preparation guidance are updated so maintainers can prepare full or mini SocialOmni data with the same benchmark workflow used by the other benchmarks. Constraint: The upstream benchmark surface should stay aligned with existing single-file benchmark conventions Constraint: SocialOmni must support both level1/level2 and 1-judge/3-judge flows from one entrypoint Rejected: Split SocialOmni into separate level1 and level2 benchmark families | user and plan required a single benchmark surface Rejected: Keep nested dataset/task package directories | did not match upstream benchmark file conventions Confidence: high Scope-risk: moderate Reversibility: clean Directive: Keep socialomni-mini deterministic and update its manifest/tests together if dataset selection changes Tested: ./ .venv/bin/pytest tests/test_socialomni_benchmark.py -q Tested: ./.venv/bin/python -m py_compile benchmarks/dataset/prepare.py benchmarks/dataset/socialomni.py benchmarks/tasks/socialomni.py benchmarks/eval/benchmark_omni_socialomni.py tests/test_socialomni_benchmark.py Tested: git diff --check -- benchmarks/README.md benchmarks/dataset/prepare.py benchmarks/dataset/socialomni.py benchmarks/tasks/socialomni.py benchmarks/eval/benchmark_omni_socialomni.py tests/test_socialomni_benchmark.py Not-tested: Live smoke against a real sglang-omni served model (local OpenAI-compatible stubs were used previously for flow validation)

Copilot

Pull request overview

Adds a PR-ready SocialOmni benchmark flow (dataset prep + loaders + eval entrypoint) so omni models can be evaluated through the existing OpenAI-compatible /v1/chat/completions serving API.

Changes:

Introduces SocialOmni dataset preparation (full + deterministic socialomni-mini) and sample loaders.
Adds SocialOmni benchmark task logic for level1 scoring and level2 Q1/Q2 + judge workflow.
Adds a new user-facing eval entrypoint plus documentation and unit tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`benchmarks/dataset/socialomni.py`	Implements SocialOmni dataset materialization (full/mini) and level1/level2 sample loaders.
`benchmarks/tasks/socialomni.py`	Adds request helpers, prompt building, parsing/scoring, level2 workflow, judge orchestration, and metrics.
`benchmarks/eval/benchmark_omni_socialomni.py`	Adds CLI entrypoint to run level1/level2 benchmarks against OpenAI-compatible endpoints (plus judge preflight).
`benchmarks/dataset/prepare.py`	Extends dataset preparation CLI to support `socialomni` and `socialomni-mini` via a dedicated handler.
`benchmarks/README.md`	Documents SocialOmni dataset preparation and eval usage.
`tests/test_socialomni_benchmark.py`	Adds unit tests for dataset prep/loading and level1/level2 benchmark flows (including judge preflight).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T06:26:07Z

+        completed = await asyncio.gather(*[_limited(sample) for sample in samples])
+
+    per_sample = [item[0] for item in completed]
+    primary_requests = [request for _, requests, _ in completed for request in requests]
+    judge_requests = [request for _, _, requests in completed for request in requests]


run_socialomni_level2_benchmark builds a list of coroutines for all samples and submits them to asyncio.gather(...) at once. For large SocialOmni runs this can create thousands of pending tasks and significantly increase memory/overhead. Consider a bounded producer/consumer pattern (e.g., asyncio.as_completed with a fixed task set, or chunking) so only ~max_concurrency tasks are live at a time.

Suggested change

completed = await asyncio.gather(*[_limited(sample) for sample in samples])

per_sample = [item[0] for item in completed]

primary_requests = [request for _, requests, _ in completed for request in requests]

judge_requests = [request for _, _, requests in completed for request in requests]

completed: list[

tuple[dict[str, Any], list[RequestResult], list[RequestResult]] | None

] = [None] * len(samples)

sample_iter = iter(enumerate(samples))

pending: dict[asyncio.Task[tuple[int, tuple[dict[str, Any], list[RequestResult], list[RequestResult]]]], int] = {}

def _schedule_next() -> bool:

try:

index, sample = next(sample_iter)

except StopIteration:

return False

async def _run_one(

sample_index: int, sample_value: SocialOmniLevel2Sample

) -> tuple[int, tuple[dict[str, Any], list[RequestResult], list[RequestResult]]]:

return sample_index, await _limited(sample_value)

task = asyncio.create_task(_run_one(index, sample))

pending[task] = index

return True

for _ in range(min(max(1, max_concurrency), len(samples))):

if not _schedule_next():

break

while pending:

done, _ = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)

for task in done:

pending.pop(task, None)

index, result = await task

completed[index] = result

_schedule_next()

ordered_completed = [item for item in completed if item is not None]

per_sample = [item[0] for item in ordered_completed]

primary_requests = [request for _, requests, _ in ordered_completed for request in requests]

judge_requests = [request for _, _, requests in ordered_completed for request in requests]

Copilot · 2026-04-25T06:26:07Z

+        rel_name = Path(str(row["video_path"]).strip()).name
+        _copy_file(
+            source_root / LEVEL1_VIDEOS_DIR / rel_name, level1_videos_dir / rel_name


In _materialize_mini_dataset, level1 videos are copied using only Path(row["video_path"]).name, but the written dataset.json keeps the original video_path values unchanged. If the upstream dataset contains any subdirectory components in video_path, the prepared mini dataset will reference paths that were not copied (and load_socialomni_level1_samples will fail to find the files). Consider either (a) preserving the relative directory structure when copying, or (b) normalizing/re-writing each row’s video_path to the basename you actually copy.

Suggested change

rel_name = Path(str(row["video_path"]).strip()).name

_copy_file(

source_root / LEVEL1_VIDEOS_DIR / rel_name, level1_videos_dir / rel_name

rel_path = Path(str(row["video_path"]).strip())

destination_path = level1_videos_dir / rel_path

destination_path.parent.mkdir(parents=True, exist_ok=True)

_copy_file(

source_root / LEVEL1_VIDEOS_DIR / rel_path, destination_path

Copilot · 2026-04-25T06:26:07Z

+    for row in filtered_level2_rows:
+        rel_name = Path(str(row["video_file"]).strip()).name
+        _copy_file(
+            source_root / LEVEL2_VIDEOS_DIR / rel_name, level2_videos_dir / rel_name
+        )


Same issue for level2: videos are copied using Path(row["video_file"]).name but the annotations.json rows are written without updating video_file. If video_file contains directories in the source annotations, the mini dataset will end up with broken references. Please normalize video_file in the emitted rows (or copy using the original relative path).

Copilot · 2026-04-25T06:26:08Z

+        with tempfile.TemporaryDirectory(
+            prefix=f"socialomni_{sample.sample_id}_"
+        ) as tmpdir:
+            cut_path = str(Path(tmpdir) / Path(sample.video_path).name)
+            cut_video_prefix(
+                sample.video_path,
+                parse_level2_timestamp_to_seconds(sample.question_1.timestamp),
+                cut_path,
+            )


cut_video_prefix() is invoked via subprocess.run() inside run_socialomni_level2_sample() (an async function). This blocks the event loop during ffmpeg execution, so increasing max_concurrency won’t help and other in-flight HTTP requests will stall. Consider running the prefix cut via asyncio.to_thread(...) / an executor, or using asyncio.create_subprocess_exec to avoid blocking the loop.

Copilot · 2026-04-25T06:26:08Z

+                            session,
+                            judge=judge,
+                            sample_id=sample.sample_id,
+                            video_path=sample.video_path,


The judge requests are sent with video_path=sample.video_path (the full, uncut video), while the primary model Q1/Q2 requests use cut_path (prefix-cut video). This means judges can see post-timestamp context the model did not, which can skew scores and diverges from the intended “judge the interruption at that moment” protocol. Pass cut_path (or otherwise ensure judges evaluate the same prefix clip) to run_socialomni_judge.

Suggested change

video_path=sample.video_path,

video_path=cut_path,

zhaochenyang20 · 2026-04-27T06:02:45Z

could you submit your results on the full set? @Alexisxty

Copilot AI review requested due to automatic review settings April 25, 2026 06:21

Copilot started reviewing on behalf of Alexisxty April 25, 2026 06:21 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

zhaochenyang20 added the run-ci Triggers GPU CI workflows label Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a PR-ready SocialOmni benchmark path for sglang-omni#352

Add a PR-ready SocialOmni benchmark path for sglang-omni#352
Alexisxty wants to merge 1 commit intosgl-project:mainfrom
Alexisxty:socialomni-benchmark

Alexisxty commented Apr 25, 2026 •

edited by shuaills

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

zhaochenyang20 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        completed = await asyncio.gather(*[_limited(sample) for sample in samples])
-    per_sample = [item[0] for item in completed]
-    primary_requests = [request for _, requests, _ in completed for request in requests]
-    judge_requests = [request for _, _, requests in completed for request in requests]
+        completed: list[
+            tuple[dict[str, Any], list[RequestResult], list[RequestResult]] | None
+        ] = [None] * len(samples)
+        sample_iter = iter(enumerate(samples))
+        pending: dict[asyncio.Task[tuple[int, tuple[dict[str, Any], list[RequestResult], list[RequestResult]]]], int] = {}
+        def _schedule_next() -> bool:
+            try:
+                index, sample = next(sample_iter)
+            except StopIteration:
+                return False
+            async def _run_one(
+                sample_index: int, sample_value: SocialOmniLevel2Sample
+            ) -> tuple[int, tuple[dict[str, Any], list[RequestResult], list[RequestResult]]]:
+                return sample_index, await _limited(sample_value)
+            task = asyncio.create_task(_run_one(index, sample))
+            pending[task] = index
+            return True
+        for _ in range(min(max(1, max_concurrency), len(samples))):
+            if not _schedule_next():
+                break
+        while pending:
+            done, _ = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)
+            for task in done:
+                pending.pop(task, None)
+                index, result = await task
+                completed[index] = result
+                _schedule_next()
+    ordered_completed = [item for item in completed if item is not None]
+    per_sample = [item[0] for item in ordered_completed]
+    primary_requests = [request for _, requests, _ in ordered_completed for request in requests]
+    judge_requests = [request for _, _, requests in ordered_completed for request in requests]

Conversation

Alexisxty commented Apr 25, 2026 • edited by shuaills Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alexisxty commented Apr 25, 2026 •

edited by shuaills

Loading