[Skill] Add running-eval-suite for refreshing reference benchmark by yxs · Pull Request #400 · sgl-project/sglang-omni

yxs · 2026-05-06T05:23:29Z

adds a running-eval-suite skill so a contributor types one slash command and the skill runs every benchmark, refreshes the reference table cells in place, and auto-commits for review.

Usage

cd /sgl-workspace/sglang-omni
claude
# inside Claude Code:
/running-eval-suite                        # default: all benchmarks, 1 round
/running-eval-suite --benchmarks mmsu      # subset
/running-eval-suite --rounds 3             # variance observation
/running-eval-suite --smoke 50             # cap to 50 samples; no edit applied

After the run the skill commits to a local commit named
[Docs] Refresh reference benchmarks via /running-eval-suite (<utc-ts>).
The skill never pushes; the contributor reviews via git show HEAD and pushes to their fork manually.

Copilot

Pull request overview

Adds a new “running-eval-suite” Claude skill backend intended to run full-set reference benchmarks and refresh the embedded markdown reference tables under benchmarks/eval/benchmark_omni_*.py.

Changes:

Introduces a new skill backend (runner.py) with precheck, run, apply-plan, and report subcommands.
Adds a shipped model configuration (models/qwen3-omni/config.yaml) describing benchmark rows, locators, and JSON→table-cell mappings.
Adds documentation and unit tests for markdown parsing / row matching / apply-plan generation, plus ignores .eval-runs/ artifacts.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tests/test_running_eval_suite.py`	Unit tests for table parsing, locators, JSON-path traversal, source rewriting, and apply-plan edit construction.
`docs/contributing/running-eval-suite.md`	Contributor guide for running the eval suite in a container and using `/running-eval-suite`.
`.gitignore`	Ignores `.eval-runs/` run artifacts.
`.claude/skills/running-eval-suite/SKILL.md`	Skill contract/spec describing intended end-to-end behavior.
`.claude/skills/running-eval-suite/runner.py`	New CLI backend implementing precheck/run/apply-plan/report and markdown table row rewriting plan generation.
`.claude/skills/running-eval-suite/models/qwen3-omni/config.yaml`	Shipped config defining 16 benchmark rows and how to locate/update their reference table cells.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            port=port, output_dir=str(out_dir), model=model_id,
+        )
+        if py and "{python}" in row.get("client", ""):
+            client_cmd = client_cmd.replace("{python}", py)


+    if gpus_csv:
+        env["CUDA_VISIBLE_DEVICES"] = gpus_csv
+    log_path.parent.mkdir(parents=True, exist_ok=True)
+    log_fh = open(log_path, "w")
+    popen = subprocess.Popen(
+        cmd, shell=True, stdout=log_fh, stderr=subprocess.STDOUT,
+        env=env, start_new_session=True,
+    )
+    return ServerHandle(popen, port, log_path)


+    file: "benchmarks/eval/benchmark_omni_mmmu.py"
+    locate:
+      config_substring: "enable_audio=True"
+      source_substring: "50-sample subset**, c=1, max_tokens=2048"


+  (model `qwen3-omni`, all rows, 1 round, real apply, auto-commit).
+- `/running-eval-suite --rows mmsu-text-h200-c8-accuracy,mmsu-text-h200-c8-speed`
+- `/running-eval-suite --rounds 3` (multi-round; only round 1 is used
+  for apply — multi-round aggregation is a follow-up)
+- `/running-eval-suite --smoke 50` (`--max-samples 50` injected; apply
+  is refused for smoke runs to prevent committing partial-set numbers)
+
+**The skill takes no other input.** I never call `AskUserQuestion` for
+model / rows / rounds / mode / proceed. Defaults handle every case;
+errors halt the run with a printed reason. If you want a different
+behavior, pass the flags above. This matches the `tune-ci-thresholds`
+contract: type `/running-eval-suite` and walk away.
+
+## Steps I follow
+
+1. Apply defaults to the invocation: `model=qwen3-omni`, `rows=all`,
+   `rounds=1`, `smoke=null`. **No `AskUserQuestion`.** If the user
+   passed unknown flags, stop with the parsing error.


+7. Auto-commit the refresh:
+   ```
+   git add benchmarks/eval/
+   git commit -m "[Docs] Refresh reference benchmarks via /running-eval-suite (<utc-ts>)"
+   ```
+   No `git push`. If `git diff --cached --quiet` says there's nothing
+   to commit (e.g. every row was skipped), skip the commit and print
+   a one-liner so the user knows the run produced no changes.


+**full-set** benchmarks and rewrites the reference cells, not the CI
+thresholds.
+
+[ci-doc]: ./tune-ci-thresholds.md


Hardware-agnostic; runs all benchmarks under benchmarks/eval/, fills reference table cells in benchmark_*.py, auto-commits for review.

Copilot

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

+    log_fh = open(log_path, "w")
+    popen = subprocess.Popen(
+        cmd, shell=True, stdout=log_fh, stderr=subprocess.STDOUT,
+        env=env, start_new_session=True,
+    )


+- Free GPUs. The skill **never** kills another user's processes; if
+  GPUs are busy precheck fails with the busy PID list and stops.


+# running-eval-suite config — every reference table row across the
+# 6 benchmarks/eval/benchmark_*.py files is described here.
+#
+# `running-eval-suite` reads each row, runs the server + client, then
+# either replaces the matching row in the benchmark .py file (same
+# hardware tag) or appends a new row at the end of that section's
+# table (new hardware). "Local v1 Pipeline Result" sections are
+# always skipped — those are contributors' personal experiments, not


Copilot AI review requested due to automatic review settings May 6, 2026 05:23

yxs marked this pull request as draft May 6, 2026 05:24

Copilot started reviewing on behalf of yxs May 6, 2026 05:24 View session

yxs changed the title ~~[Skill] Add running-eval-suite for refreshing reference benchmark cells~~ [Skill] Add running-eval-suite for refreshing reference benchmark May 6, 2026

Copilot AI reviewed May 6, 2026

View reviewed changes

[Skill] Add running-eval-suite for refreshing reference benchmark cells

5ecf476

Hardware-agnostic; runs all benchmarks under benchmarks/eval/, fills reference table cells in benchmark_*.py, auto-commits for review.

yxs force-pushed the running-eval-suite-skill branch from d483ebb to 5ecf476 Compare May 6, 2026 08:18

yxs marked this pull request as ready for review May 6, 2026 20:19

Copilot AI review requested due to automatic review settings May 6, 2026 20:19

Copilot started reviewing on behalf of yxs May 6, 2026 20:19 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

zhaochenyang20 mentioned this pull request May 6, 2026

[Feature] Enable Thinker CUDA Graph by Default #404

Merged

Merge branch 'main' into running-eval-suite-skill

362c14f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Skill] Add running-eval-suite for refreshing reference benchmark#400

[Skill] Add running-eval-suite for refreshing reference benchmark#400
yxs wants to merge 2 commits intosgl-project:mainfrom
yxs:running-eval-suite-skill

yxs commented May 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		- Free GPUs. The skill never kills another user's processes; if
		GPUs are busy precheck fails with the busy PID list and stops.

Conversation

yxs commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yxs commented May 6, 2026 •

edited

Loading