Feature proposal: Benchmark pruning via variance-weighted stratified sampling

## 功能描述 / Feature Description
Add a pruning strategy to evalscope that selects the minimal sample set from a benchmark while preserving model ranking signal. Enables fast go/no-go evaluation without running full benchmark suites.

## 需求背景 / Background
Running full benchmarks (e.g. 315 LiveCodeBench samples, 100 AA-LCR samples) for every candidate model is expensive. For sales-driven evaluation ("is this model good enough?"), we only need a subset that correctly ranks models. Analysis shows 50–60% of benchmark items are either trivially easy (all models pass) or trivially hard (all fail) - these carry zero ranking information and can be safely pruned.

## 预期行为 / Expected Behavior
Users can invoke pruning via `dataset_args`:

```
evalscope eval --model <model> --datasets live_code_bench \
    --dataset-args '{"live_code_bench": {"pruning_strategy": "variance_stratified", "prune_ratio": 0.6, "review_dir": "./reviews"}}'
```
Or use the standalone comparison CLI:


python -m evalscope.cli.pruning_compare \
    --review-dir ./reviews --benchmark live_code_bench_v5 --score-key pass --prune-ratio 0.6

### Design:

Compute per-item difficulty (mean pass rate) and discrimination (score variance) from historical review JSONL files
Stratify items into 4 difficulty buckets
Within each stratum, select highest-variance items
Include calibration anchors from extremes
**Validated results:**

| Benchmark        | Full | Pruned | Reduction | Rank Preserved          |
|------------------|------|--------|-----------|-------------------------|
| LiveCodeBench v5 | 315  | 189    | 40%       | Yes (Kendall τ = 1.0)   |
| AA-LCR           | 100  | 50     | 50%       | Yes (Kendall τ = 1.0)   |


## 其他信息 / Additional Information
Integration options:

Option A: PruningMixin added to existing adapters (minimal core changes)
Option B: New --prune-strategy CLI flag in evalscope eval
Option C: Separate evalscope prune subcommand that pre-computes the pruned index set

Implementation PR: #1391 

Looking for feedback on preferred integration path before wiring into the full pipeline.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature proposal: Benchmark pruning via variance-weighted stratified sampling #1393

功能描述 / Feature Description

需求背景 / Background

预期行为 / Expected Behavior

Design:

其他信息 / Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Benchmark	Full	Pruned	Reduction	Rank Preserved
LiveCodeBench v5	315	189	40%	Yes (Kendall τ = 1.0)
AA-LCR	100	50	50%	Yes (Kendall τ = 1.0)

Feature proposal: Benchmark pruning via variance-weighted stratified sampling #1393

Description

功能描述 / Feature Description

需求背景 / Background

预期行为 / Expected Behavior

Design:

其他信息 / Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions