Skip to content

Feature proposal: Benchmark pruning via variance-weighted stratified sampling #1393

@Shashank-mankala1

Description

@Shashank-mankala1

功能描述 / Feature Description

Add a pruning strategy to evalscope that selects the minimal sample set from a benchmark while preserving model ranking signal. Enables fast go/no-go evaluation without running full benchmark suites.

需求背景 / Background

Running full benchmarks (e.g. 315 LiveCodeBench samples, 100 AA-LCR samples) for every candidate model is expensive. For sales-driven evaluation ("is this model good enough?"), we only need a subset that correctly ranks models. Analysis shows 50–60% of benchmark items are either trivially easy (all models pass) or trivially hard (all fail) - these carry zero ranking information and can be safely pruned.

预期行为 / Expected Behavior

Users can invoke pruning via dataset_args:

evalscope eval --model <model> --datasets live_code_bench \
    --dataset-args '{"live_code_bench": {"pruning_strategy": "variance_stratified", "prune_ratio": 0.6, "review_dir": "./reviews"}}'

Or use the standalone comparison CLI:

python -m evalscope.cli.pruning_compare
--review-dir ./reviews --benchmark live_code_bench_v5 --score-key pass --prune-ratio 0.6

Design:

Compute per-item difficulty (mean pass rate) and discrimination (score variance) from historical review JSONL files
Stratify items into 4 difficulty buckets
Within each stratum, select highest-variance items
Include calibration anchors from extremes
Validated results:

Benchmark Full Pruned Reduction Rank Preserved
LiveCodeBench v5 315 189 40% Yes (Kendall τ = 1.0)
AA-LCR 100 50 50% Yes (Kendall τ = 1.0)

其他信息 / Additional Information

Integration options:

Option A: PruningMixin added to existing adapters (minimal core changes)
Option B: New --prune-strategy CLI flag in evalscope eval
Option C: Separate evalscope prune subcommand that pre-computes the pruned index set

Implementation PR: #1391

Looking for feedback on preferred integration path before wiring into the full pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions