Skip to content

Add a PR-ready SocialOmni benchmark path for sglang-omni#352

Open
Alexisxty wants to merge 1 commit intosgl-project:mainfrom
Alexisxty:socialomni-benchmark
Open

Add a PR-ready SocialOmni benchmark path for sglang-omni#352
Alexisxty wants to merge 1 commit intosgl-project:mainfrom
Alexisxty:socialomni-benchmark

Conversation

@Alexisxty
Copy link
Copy Markdown

@Alexisxty Alexisxty commented Apr 25, 2026

Motivation

Add a PR-ready SocialOmni benchmark path for sglang-omni so users can evaluate omni models through the existing OpenAI-compatible serving API.

The benchmark follows the existing repository pattern: users start the target model service separately, then point the benchmark entrypoint to that endpoint with --base-url / --model.

Modifications

  • Add SocialOmni dataset preparation and loading support, including socialomni and deterministic socialomni-mini preparation.
  • Add a single-file SocialOmni dataset module at benchmarks/dataset/socialomni.py.
  • Add a single-file SocialOmni task module at benchmarks/tasks/socialomni.py for:
    • level1 prompt construction, request sending, answer parsing, and accuracy summary;
    • level2 Q1/Q2 workflow, video prefix cutting, judge requests, and result aggregation;
    • main endpoint and judge endpoint preflight checks.
  • Add the user-facing entrypoint benchmarks/eval/benchmark_omni_socialomni.py.
  • Document SocialOmni usage in benchmarks/README.md.
  • Add unit tests covering dataset materialization, endpoint normalization/preflight, level1 scoring, level2 branching, judge parsing/aggregation, and entrypoint behavior.

Related Issues

N/A

Accuracy Test

N/A. This PR adds benchmark plumbing and does not modify model architecture, kernels, or inference numerics.

Benchmark & Profiling

N/A. This PR adds a benchmark entrypoint and is not expected to change model serving throughput or latency.

Local API-flow smoke was validated with OpenAI-compatible mock main and judge endpoints for both SocialOmni level1 and level2. The benchmark relies on the existing sglang-omni serving stack for real model startup.

Checklist

  • Format your code according with pre-commit.
  • Add unit tests.
  • Update documentation / docstrings / example tutorials as needed.
  • Provide throughput / latency benchmark results and accuracy evaluation results as needed.

Verification:

uv run --with pre-commit pre-commit run --files \
  benchmarks/README.md \
  benchmarks/dataset/prepare.py \
  benchmarks/dataset/socialomni.py \
  benchmarks/eval/benchmark_omni_socialomni.py \
  benchmarks/tasks/socialomni.py \
  tests/test_socialomni_benchmark.py

./.venv/bin/pytest tests/test_socialomni_benchmark.py -q

./.venv/bin/python -m py_compile \
  benchmarks/dataset/prepare.py \
  benchmarks/dataset/socialomni.py \
  benchmarks/tasks/socialomni.py \
  benchmarks/eval/benchmark_omni_socialomni.py \
  tests/test_socialomni_benchmark.py

git diff --check

Results:

- pre-commit run --files ...: passed
- pytest tests/test_socialomni_benchmark.py -q: 13 passed
- py_compile: passed
- git diff --check: passed

This wires SocialOmni into the existing benchmarks surface with a single
entrypoint that supports both level1 and level2, deterministic mini data
materialization, and 1-judge/3-judge evaluation flows. The implementation is
flattened into benchmark-style single files so it matches the surrounding repo
conventions instead of introducing a nested package layout.

The integration also hardens the live path by validating judge configuration
up front, preflighting OpenAI-compatible endpoints before long runs, and keeping
Q2 averages faithful to returned zero scores rather than silently dropping them.
README and dataset preparation guidance are updated so maintainers can prepare
full or mini SocialOmni data with the same benchmark workflow used by the other
benchmarks.

Constraint: The upstream benchmark surface should stay aligned with existing single-file benchmark conventions
Constraint: SocialOmni must support both level1/level2 and 1-judge/3-judge flows from one entrypoint
Rejected: Split SocialOmni into separate level1 and level2 benchmark families | user and plan required a single benchmark surface
Rejected: Keep nested dataset/task package directories | did not match upstream benchmark file conventions
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Keep socialomni-mini deterministic and update its manifest/tests together if dataset selection changes
Tested: ./ .venv/bin/pytest tests/test_socialomni_benchmark.py -q
Tested: ./.venv/bin/python -m py_compile benchmarks/dataset/prepare.py benchmarks/dataset/socialomni.py benchmarks/tasks/socialomni.py benchmarks/eval/benchmark_omni_socialomni.py tests/test_socialomni_benchmark.py
Tested: git diff --check -- benchmarks/README.md benchmarks/dataset/prepare.py benchmarks/dataset/socialomni.py benchmarks/tasks/socialomni.py benchmarks/eval/benchmark_omni_socialomni.py tests/test_socialomni_benchmark.py
Not-tested: Live smoke against a real sglang-omni served model (local OpenAI-compatible stubs were used previously for flow validation)
Copilot AI review requested due to automatic review settings April 25, 2026 06:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a PR-ready SocialOmni benchmark flow (dataset prep + loaders + eval entrypoint) so omni models can be evaluated through the existing OpenAI-compatible /v1/chat/completions serving API.

Changes:

  • Introduces SocialOmni dataset preparation (full + deterministic socialomni-mini) and sample loaders.
  • Adds SocialOmni benchmark task logic for level1 scoring and level2 Q1/Q2 + judge workflow.
  • Adds a new user-facing eval entrypoint plus documentation and unit tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
benchmarks/dataset/socialomni.py Implements SocialOmni dataset materialization (full/mini) and level1/level2 sample loaders.
benchmarks/tasks/socialomni.py Adds request helpers, prompt building, parsing/scoring, level2 workflow, judge orchestration, and metrics.
benchmarks/eval/benchmark_omni_socialomni.py Adds CLI entrypoint to run level1/level2 benchmarks against OpenAI-compatible endpoints (plus judge preflight).
benchmarks/dataset/prepare.py Extends dataset preparation CLI to support socialomni and socialomni-mini via a dedicated handler.
benchmarks/README.md Documents SocialOmni dataset preparation and eval usage.
tests/test_socialomni_benchmark.py Adds unit tests for dataset prep/loading and level1/level2 benchmark flows (including judge preflight).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +782 to +786
completed = await asyncio.gather(*[_limited(sample) for sample in samples])

per_sample = [item[0] for item in completed]
primary_requests = [request for _, requests, _ in completed for request in requests]
judge_requests = [request for _, _, requests in completed for request in requests]
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_socialomni_level2_benchmark builds a list of coroutines for all samples and submits them to asyncio.gather(...) at once. For large SocialOmni runs this can create thousands of pending tasks and significantly increase memory/overhead. Consider a bounded producer/consumer pattern (e.g., asyncio.as_completed with a fixed task set, or chunking) so only ~max_concurrency tasks are live at a time.

Suggested change
completed = await asyncio.gather(*[_limited(sample) for sample in samples])
per_sample = [item[0] for item in completed]
primary_requests = [request for _, requests, _ in completed for request in requests]
judge_requests = [request for _, _, requests in completed for request in requests]
completed: list[
tuple[dict[str, Any], list[RequestResult], list[RequestResult]] | None
] = [None] * len(samples)
sample_iter = iter(enumerate(samples))
pending: dict[asyncio.Task[tuple[int, tuple[dict[str, Any], list[RequestResult], list[RequestResult]]]], int] = {}
def _schedule_next() -> bool:
try:
index, sample = next(sample_iter)
except StopIteration:
return False
async def _run_one(
sample_index: int, sample_value: SocialOmniLevel2Sample
) -> tuple[int, tuple[dict[str, Any], list[RequestResult], list[RequestResult]]]:
return sample_index, await _limited(sample_value)
task = asyncio.create_task(_run_one(index, sample))
pending[task] = index
return True
for _ in range(min(max(1, max_concurrency), len(samples))):
if not _schedule_next():
break
while pending:
done, _ = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)
for task in done:
pending.pop(task, None)
index, result = await task
completed[index] = result
_schedule_next()
ordered_completed = [item for item in completed if item is not None]
per_sample = [item[0] for item in ordered_completed]
primary_requests = [request for _, requests, _ in ordered_completed for request in requests]
judge_requests = [request for _, _, requests in ordered_completed for request in requests]

Copilot uses AI. Check for mistakes.
Comment on lines +230 to +232
rel_name = Path(str(row["video_path"]).strip()).name
_copy_file(
source_root / LEVEL1_VIDEOS_DIR / rel_name, level1_videos_dir / rel_name
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _materialize_mini_dataset, level1 videos are copied using only Path(row["video_path"]).name, but the written dataset.json keeps the original video_path values unchanged. If the upstream dataset contains any subdirectory components in video_path, the prepared mini dataset will reference paths that were not copied (and load_socialomni_level1_samples will fail to find the files). Consider either (a) preserving the relative directory structure when copying, or (b) normalizing/re-writing each row’s video_path to the basename you actually copy.

Suggested change
rel_name = Path(str(row["video_path"]).strip()).name
_copy_file(
source_root / LEVEL1_VIDEOS_DIR / rel_name, level1_videos_dir / rel_name
rel_path = Path(str(row["video_path"]).strip())
destination_path = level1_videos_dir / rel_path
destination_path.parent.mkdir(parents=True, exist_ok=True)
_copy_file(
source_root / LEVEL1_VIDEOS_DIR / rel_path, destination_path

Copilot uses AI. Check for mistakes.
Comment on lines +267 to +271
for row in filtered_level2_rows:
rel_name = Path(str(row["video_file"]).strip()).name
_copy_file(
source_root / LEVEL2_VIDEOS_DIR / rel_name, level2_videos_dir / rel_name
)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue for level2: videos are copied using Path(row["video_file"]).name but the annotations.json rows are written without updating video_file. If video_file contains directories in the source annotations, the mini dataset will end up with broken references. Please normalize video_file in the emitted rows (or copy using the original relative path).

Copilot uses AI. Check for mistakes.
Comment on lines +637 to +645
with tempfile.TemporaryDirectory(
prefix=f"socialomni_{sample.sample_id}_"
) as tmpdir:
cut_path = str(Path(tmpdir) / Path(sample.video_path).name)
cut_video_prefix(
sample.video_path,
parse_level2_timestamp_to_seconds(sample.question_1.timestamp),
cut_path,
)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut_video_prefix() is invoked via subprocess.run() inside run_socialomni_level2_sample() (an async function). This blocks the event loop during ffmpeg execution, so increasing max_concurrency won’t help and other in-flight HTTP requests will stall. Consider running the prefix cut via asyncio.to_thread(...) / an executor, or using asyncio.create_subprocess_exec to avoid blocking the loop.

Copilot uses AI. Check for mistakes.
session,
judge=judge,
sample_id=sample.sample_id,
video_path=sample.video_path,
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The judge requests are sent with video_path=sample.video_path (the full, uncut video), while the primary model Q1/Q2 requests use cut_path (prefix-cut video). This means judges can see post-timestamp context the model did not, which can skew scores and diverges from the intended “judge the interruption at that moment” protocol. Pass cut_path (or otherwise ensure judges evaluate the same prefix clip) to run_socialomni_judge.

Suggested change
video_path=sample.video_path,
video_path=cut_path,

Copilot uses AI. Check for mistakes.
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

could you submit your results on the full set? @Alexisxty

@zhaochenyang20 zhaochenyang20 added the run-ci Triggers GPU CI workflows label Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-ci Triggers GPU CI workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants