功能描述 / Feature Description
Add a pruning strategy to evalscope that selects the minimal sample set from a benchmark while preserving model ranking signal. Enables fast go/no-go evaluation without running full benchmark suites.
需求背景 / Background
Running full benchmarks (e.g. 315 LiveCodeBench samples, 100 AA-LCR samples) for every candidate model is expensive. For sales-driven evaluation ("is this model good enough?"), we only need a subset that correctly ranks models. Analysis shows 50–60% of benchmark items are either trivially easy (all models pass) or trivially hard (all fail) - these carry zero ranking information and can be safely pruned.
预期行为 / Expected Behavior
Users can invoke pruning via dataset_args:
evalscope eval --model <model> --datasets live_code_bench \
--dataset-args '{"live_code_bench": {"pruning_strategy": "variance_stratified", "prune_ratio": 0.6, "review_dir": "./reviews"}}'
Or use the standalone comparison CLI:
python -m evalscope.cli.pruning_compare
--review-dir ./reviews --benchmark live_code_bench_v5 --score-key pass --prune-ratio 0.6
Design:
Compute per-item difficulty (mean pass rate) and discrimination (score variance) from historical review JSONL files
Stratify items into 4 difficulty buckets
Within each stratum, select highest-variance items
Include calibration anchors from extremes
Validated results:
| Benchmark |
Full |
Pruned |
Reduction |
Rank Preserved |
| LiveCodeBench v5 |
315 |
189 |
40% |
Yes (Kendall τ = 1.0) |
| AA-LCR |
100 |
50 |
50% |
Yes (Kendall τ = 1.0) |
其他信息 / Additional Information
Integration options:
Option A: PruningMixin added to existing adapters (minimal core changes)
Option B: New --prune-strategy CLI flag in evalscope eval
Option C: Separate evalscope prune subcommand that pre-computes the pruned index set
Implementation PR: #1391
Looking for feedback on preferred integration path before wiring into the full pipeline.
功能描述 / Feature Description
Add a pruning strategy to evalscope that selects the minimal sample set from a benchmark while preserving model ranking signal. Enables fast go/no-go evaluation without running full benchmark suites.
需求背景 / Background
Running full benchmarks (e.g. 315 LiveCodeBench samples, 100 AA-LCR samples) for every candidate model is expensive. For sales-driven evaluation ("is this model good enough?"), we only need a subset that correctly ranks models. Analysis shows 50–60% of benchmark items are either trivially easy (all models pass) or trivially hard (all fail) - these carry zero ranking information and can be safely pruned.
预期行为 / Expected Behavior
Users can invoke pruning via
dataset_args:Or use the standalone comparison CLI:
python -m evalscope.cli.pruning_compare
--review-dir ./reviews --benchmark live_code_bench_v5 --score-key pass --prune-ratio 0.6
Design:
Compute per-item difficulty (mean pass rate) and discrimination (score variance) from historical review JSONL files
Stratify items into 4 difficulty buckets
Within each stratum, select highest-variance items
Include calibration anchors from extremes
Validated results:
其他信息 / Additional Information
Integration options:
Option A: PruningMixin added to existing adapters (minimal core changes)
Option B: New --prune-strategy CLI flag in evalscope eval
Option C: Separate evalscope prune subcommand that pre-computes the pruned index set
Implementation PR: #1391
Looking for feedback on preferred integration path before wiring into the full pipeline.