[Skill] Add running-eval-suite for refreshing reference benchmark#400
Open
yxs wants to merge 2 commits intosgl-project:mainfrom
Open
[Skill] Add running-eval-suite for refreshing reference benchmark#400yxs wants to merge 2 commits intosgl-project:mainfrom
yxs wants to merge 2 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new “running-eval-suite” Claude skill backend intended to run full-set reference benchmarks and refresh the embedded markdown reference tables under benchmarks/eval/benchmark_omni_*.py.
Changes:
- Introduces a new skill backend (
runner.py) withprecheck,run,apply-plan, andreportsubcommands. - Adds a shipped model configuration (
models/qwen3-omni/config.yaml) describing benchmark rows, locators, and JSON→table-cell mappings. - Adds documentation and unit tests for markdown parsing / row matching / apply-plan generation, plus ignores
.eval-runs/artifacts.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_running_eval_suite.py |
Unit tests for table parsing, locators, JSON-path traversal, source rewriting, and apply-plan edit construction. |
docs/contributing/running-eval-suite.md |
Contributor guide for running the eval suite in a container and using /running-eval-suite. |
.gitignore |
Ignores .eval-runs/ run artifacts. |
.claude/skills/running-eval-suite/SKILL.md |
Skill contract/spec describing intended end-to-end behavior. |
.claude/skills/running-eval-suite/runner.py |
New CLI backend implementing precheck/run/apply-plan/report and markdown table row rewriting plan generation. |
.claude/skills/running-eval-suite/models/qwen3-omni/config.yaml |
Shipped config defining 16 benchmark rows and how to locate/update their reference table cells. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+539
to
+542
| port=port, output_dir=str(out_dir), model=model_id, | ||
| ) | ||
| if py and "{python}" in row.get("client", ""): | ||
| client_cmd = client_cmd.replace("{python}", py) |
Comment on lines
+378
to
+386
| if gpus_csv: | ||
| env["CUDA_VISIBLE_DEVICES"] = gpus_csv | ||
| log_path.parent.mkdir(parents=True, exist_ok=True) | ||
| log_fh = open(log_path, "w") | ||
| popen = subprocess.Popen( | ||
| cmd, shell=True, stdout=log_fh, stderr=subprocess.STDOUT, | ||
| env=env, start_new_session=True, | ||
| ) | ||
| return ServerHandle(popen, port, log_path) |
| file: "benchmarks/eval/benchmark_omni_mmmu.py" | ||
| locate: | ||
| config_substring: "enable_audio=True" | ||
| source_substring: "50-sample subset**, c=1, max_tokens=2048" |
Comment on lines
+64
to
+81
| (model `qwen3-omni`, all rows, 1 round, real apply, auto-commit). | ||
| - `/running-eval-suite --rows mmsu-text-h200-c8-accuracy,mmsu-text-h200-c8-speed` | ||
| - `/running-eval-suite --rounds 3` (multi-round; only round 1 is used | ||
| for apply — multi-round aggregation is a follow-up) | ||
| - `/running-eval-suite --smoke 50` (`--max-samples 50` injected; apply | ||
| is refused for smoke runs to prevent committing partial-set numbers) | ||
|
|
||
| **The skill takes no other input.** I never call `AskUserQuestion` for | ||
| model / rows / rounds / mode / proceed. Defaults handle every case; | ||
| errors halt the run with a printed reason. If you want a different | ||
| behavior, pass the flags above. This matches the `tune-ci-thresholds` | ||
| contract: type `/running-eval-suite` and walk away. | ||
|
|
||
| ## Steps I follow | ||
|
|
||
| 1. Apply defaults to the invocation: `model=qwen3-omni`, `rows=all`, | ||
| `rounds=1`, `smoke=null`. **No `AskUserQuestion`.** If the user | ||
| passed unknown flags, stop with the parsing error. |
Comment on lines
+157
to
+164
| 7. Auto-commit the refresh: | ||
| ``` | ||
| git add benchmarks/eval/ | ||
| git commit -m "[Docs] Refresh reference benchmarks via /running-eval-suite (<utc-ts>)" | ||
| ``` | ||
| No `git push`. If `git diff --cached --quiet` says there's nothing | ||
| to commit (e.g. every row was skipped), skip the commit and print | ||
| a one-liner so the user knows the run produced no changes. |
| **full-set** benchmarks and rewrites the reference cells, not the CI | ||
| thresholds. | ||
|
|
||
| [ci-doc]: ./tune-ci-thresholds.md |
Hardware-agnostic; runs all benchmarks under benchmarks/eval/, fills reference table cells in benchmark_*.py, auto-commits for review.
d483ebb to
5ecf476
Compare
Comment on lines
+444
to
+448
| log_fh = open(log_path, "w") | ||
| popen = subprocess.Popen( | ||
| cmd, shell=True, stdout=log_fh, stderr=subprocess.STDOUT, | ||
| env=env, start_new_session=True, | ||
| ) |
Comment on lines
+51
to
+52
| - Free GPUs. The skill **never** kills another user's processes; if | ||
| GPUs are busy precheck fails with the busy PID list and stops. |
Comment on lines
+1
to
+8
| # running-eval-suite config — every reference table row across the | ||
| # 6 benchmarks/eval/benchmark_*.py files is described here. | ||
| # | ||
| # `running-eval-suite` reads each row, runs the server + client, then | ||
| # either replaces the matching row in the benchmark .py file (same | ||
| # hardware tag) or appends a new row at the end of that section's | ||
| # table (new hardware). "Local v1 Pipeline Result" sections are | ||
| # always skipped — those are contributors' personal experiments, not |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
adds a
running-eval-suiteskill so a contributor types one slash command and the skill runs every benchmark, refreshes the reference table cells in place, and auto-commits for review.Usage
After the run the skill commits to a local commit named
[Docs] Refresh reference benchmarks via /running-eval-suite (<utc-ts>).The skill never pushes; the contributor reviews via git show HEAD and pushes to their fork manually.