Skip to content

[Skill] Add running-eval-suite for refreshing reference benchmark#400

Open
yxs wants to merge 2 commits intosgl-project:mainfrom
yxs:running-eval-suite-skill
Open

[Skill] Add running-eval-suite for refreshing reference benchmark#400
yxs wants to merge 2 commits intosgl-project:mainfrom
yxs:running-eval-suite-skill

Conversation

@yxs
Copy link
Copy Markdown
Collaborator

@yxs yxs commented May 6, 2026

adds a running-eval-suite skill so a contributor types one slash command and the skill runs every benchmark, refreshes the reference table cells in place, and auto-commits for review.

Usage

cd /sgl-workspace/sglang-omni
claude
# inside Claude Code:
/running-eval-suite                        # default: all benchmarks, 1 round
/running-eval-suite --benchmarks mmsu      # subset
/running-eval-suite --rounds 3             # variance observation
/running-eval-suite --smoke 50             # cap to 50 samples; no edit applied

After the run the skill commits to a local commit named
[Docs] Refresh reference benchmarks via /running-eval-suite (<utc-ts>).
The skill never pushes; the contributor reviews via git show HEAD and pushes to their fork manually.

Copilot AI review requested due to automatic review settings May 6, 2026 05:23
@yxs yxs marked this pull request as draft May 6, 2026 05:24
@yxs yxs changed the title [Skill] Add running-eval-suite for refreshing reference benchmark cells [Skill] Add running-eval-suite for refreshing reference benchmark May 6, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “running-eval-suite” Claude skill backend intended to run full-set reference benchmarks and refresh the embedded markdown reference tables under benchmarks/eval/benchmark_omni_*.py.

Changes:

  • Introduces a new skill backend (runner.py) with precheck, run, apply-plan, and report subcommands.
  • Adds a shipped model configuration (models/qwen3-omni/config.yaml) describing benchmark rows, locators, and JSON→table-cell mappings.
  • Adds documentation and unit tests for markdown parsing / row matching / apply-plan generation, plus ignores .eval-runs/ artifacts.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_running_eval_suite.py Unit tests for table parsing, locators, JSON-path traversal, source rewriting, and apply-plan edit construction.
docs/contributing/running-eval-suite.md Contributor guide for running the eval suite in a container and using /running-eval-suite.
.gitignore Ignores .eval-runs/ run artifacts.
.claude/skills/running-eval-suite/SKILL.md Skill contract/spec describing intended end-to-end behavior.
.claude/skills/running-eval-suite/runner.py New CLI backend implementing precheck/run/apply-plan/report and markdown table row rewriting plan generation.
.claude/skills/running-eval-suite/models/qwen3-omni/config.yaml Shipped config defining 16 benchmark rows and how to locate/update their reference table cells.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +539 to +542
port=port, output_dir=str(out_dir), model=model_id,
)
if py and "{python}" in row.get("client", ""):
client_cmd = client_cmd.replace("{python}", py)
Comment on lines +378 to +386
if gpus_csv:
env["CUDA_VISIBLE_DEVICES"] = gpus_csv
log_path.parent.mkdir(parents=True, exist_ok=True)
log_fh = open(log_path, "w")
popen = subprocess.Popen(
cmd, shell=True, stdout=log_fh, stderr=subprocess.STDOUT,
env=env, start_new_session=True,
)
return ServerHandle(popen, port, log_path)
file: "benchmarks/eval/benchmark_omni_mmmu.py"
locate:
config_substring: "enable_audio=True"
source_substring: "50-sample subset**, c=1, max_tokens=2048"
Comment on lines +64 to +81
(model `qwen3-omni`, all rows, 1 round, real apply, auto-commit).
- `/running-eval-suite --rows mmsu-text-h200-c8-accuracy,mmsu-text-h200-c8-speed`
- `/running-eval-suite --rounds 3` (multi-round; only round 1 is used
for apply — multi-round aggregation is a follow-up)
- `/running-eval-suite --smoke 50` (`--max-samples 50` injected; apply
is refused for smoke runs to prevent committing partial-set numbers)

**The skill takes no other input.** I never call `AskUserQuestion` for
model / rows / rounds / mode / proceed. Defaults handle every case;
errors halt the run with a printed reason. If you want a different
behavior, pass the flags above. This matches the `tune-ci-thresholds`
contract: type `/running-eval-suite` and walk away.

## Steps I follow

1. Apply defaults to the invocation: `model=qwen3-omni`, `rows=all`,
`rounds=1`, `smoke=null`. **No `AskUserQuestion`.** If the user
passed unknown flags, stop with the parsing error.
Comment on lines +157 to +164
7. Auto-commit the refresh:
```
git add benchmarks/eval/
git commit -m "[Docs] Refresh reference benchmarks via /running-eval-suite (<utc-ts>)"
```
No `git push`. If `git diff --cached --quiet` says there's nothing
to commit (e.g. every row was skipped), skip the commit and print
a one-liner so the user knows the run produced no changes.
Comment thread docs/contributing/running-eval-suite.md Outdated
**full-set** benchmarks and rewrites the reference cells, not the CI
thresholds.

[ci-doc]: ./tune-ci-thresholds.md
Hardware-agnostic; runs all benchmarks under benchmarks/eval/, fills
reference table cells in benchmark_*.py, auto-commits for review.
@yxs yxs force-pushed the running-eval-suite-skill branch from d483ebb to 5ecf476 Compare May 6, 2026 08:18
@yxs yxs marked this pull request as ready for review May 6, 2026 20:19
Copilot AI review requested due to automatic review settings May 6, 2026 20:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

Comment on lines +444 to +448
log_fh = open(log_path, "w")
popen = subprocess.Popen(
cmd, shell=True, stdout=log_fh, stderr=subprocess.STDOUT,
env=env, start_new_session=True,
)
Comment on lines +51 to +52
- Free GPUs. The skill **never** kills another user's processes; if
GPUs are busy precheck fails with the busy PID list and stops.
Comment on lines +1 to +8
# running-eval-suite config — every reference table row across the
# 6 benchmarks/eval/benchmark_*.py files is described here.
#
# `running-eval-suite` reads each row, runs the server + client, then
# either replaces the matching row in the benchmark .py file (same
# hardware tag) or appends a new row at the end of that section's
# table (new hardware). "Local v1 Pipeline Result" sections are
# always skipped — those are contributors' personal experiments, not
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants