Skip to content

Commit ebcc6c0

Browse files
Lawhyclaude
andcommitted
refactor(eval): reorganize benchmark registry and CLI structure
- Move benchmarks to eval/benchmarks/ subdirectory for cleaner auto-discovery - Remove _SKIP_MODULES in favor of dedicated benchmarks/ directory - Add list_unavailable_benchmarks() to track modules with missing dependencies - Refactor CLI: strands-env list -> strands-env eval list - Refactor CLI: strands-env eval <benchmark> -> strands-env eval run <benchmark> - Split CLI: move eval commands to cli/eval.py for better organization - Update tests and documentation Co-Authored-By: Claude Opus 4.5 <[email protected]>
1 parent 3dcfd7f commit ebcc6c0

File tree

10 files changed

+485
-362
lines changed

10 files changed

+485
-362
lines changed

CLAUDE.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,9 @@ The package lives in `src/strands_env/` with these modules:
6161

6262
### `cli/`
6363

64-
**__init__.py** — CLI entry point with `strands-env` command group. `list` shows registered benchmarks. `eval` runs benchmark evaluation with environment and optional evaluator hooks.
64+
**__init__.py** — CLI entry point with `strands-env` command group. Registers subcommand groups.
65+
66+
**eval.py** — Evaluation CLI commands: `strands-env eval list` shows registered/unavailable benchmarks, `strands-env eval run` executes benchmark evaluation with environment and optional evaluator hooks.
6567

6668
**config.py** — Configuration dataclasses: `SamplingConfig`, `ModelConfig`, `EnvConfig`, `EvalConfig`. Each has `to_dict()` for serialization. Config saved to output directory for reproducibility.
6769

@@ -71,11 +73,13 @@ The package lives in `src/strands_env/` with these modules:
7173

7274
**evaluator.py**`Evaluator` class orchestrates concurrent rollouts with checkpointing and pass@k metrics. Takes an async `env_factory` for flexible environment creation. Uses tqdm with `logging_redirect_tqdm` for clean progress output. Subclasses implement `load_dataset()` for different benchmarks.
7375

74-
**registry.py** — Benchmark registry with `@register_eval(name)` decorator. `get_benchmark(name)` and `list_benchmarks()` for discovery.
76+
**registry.py** — Benchmark registry with `@register_eval(name)` decorator. Auto-discovers benchmark modules from `benchmarks/` subdirectory on first access. `get_benchmark(name)`, `list_benchmarks()`, and `list_unavailable_benchmarks()` for discovery. Modules with missing dependencies are tracked as unavailable.
7577

7678
**metrics.py**`compute_pass_at_k` implements the unbiased pass@k estimator. `MetricFn` type alias for pluggable metrics.
7779

78-
**aime.py**`AIMEEvaluator` base class for AIME benchmarks. `AIME2024Evaluator` and `AIME2025Evaluator` registered as separate benchmarks with different dataset paths.
80+
**benchmarks/** — Benchmark evaluator modules. Each module uses `@register_eval` decorator. Auto-discovered on first registry access; missing dependencies cause module to be skipped with warning.
81+
82+
**benchmarks/aime.py**`AIMEEvaluator` base class for AIME benchmarks. `AIME2024Evaluator` and `AIME2025Evaluator` registered as separate benchmarks with different dataset paths.
7983

8084
### `utils/`
8185

docs/evaluation.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,17 @@ The `strands-env` CLI provides commands for running benchmark evaluations.
99
### List Benchmarks
1010

1111
```bash
12-
strands-env list
12+
strands-env eval list
1313
```
1414

1515
### Run Evaluation
1616

1717
```bash
1818
# Using a registered benchmark
19-
strands-env eval <benchmark> --env <hook_file> [options]
19+
strands-env eval run <benchmark> --env <hook_file> [options]
2020

2121
# Using a custom evaluator hook
22-
strands-env eval --evaluator <evaluator_file> --env <hook_file> [options]
22+
strands-env eval run --evaluator <evaluator_file> --env <hook_file> [options]
2323
```
2424

2525
**Required arguments:**
@@ -59,25 +59,25 @@ strands-env eval --evaluator <evaluator_file> --env <hook_file> [options]
5959

6060
```bash
6161
# Using registered benchmark with code sandbox env
62-
strands-env eval aime-2024 \
62+
strands-env eval run aime-2024 \
6363
--env examples/eval/aime_code/code_sandbox_env.py \
6464
--base-url http://localhost:30000
6565

6666
# Using custom evaluator hook (custom benchmark)
67-
strands-env eval \
67+
strands-env eval run \
6868
--evaluator examples/eval/simple_math/simple_math_evaluator.py \
6969
--env examples/eval/simple_math/calculator_env.py \
7070
--base-url http://localhost:30000
7171

7272
# Pass@8 evaluation with high concurrency
73-
strands-env eval aime-2024 \
73+
strands-env eval run aime-2024 \
7474
--env examples/eval/simple_math/calculator_env.py \
7575
--base-url http://localhost:30000 \
7676
--n-samples-per-prompt 8 \
7777
--max-concurrency 30
7878

7979
# With custom tool parser
80-
strands-env eval aime-2024 \
80+
strands-env eval run aime-2024 \
8181
--env examples/eval/simple_math/calculator_env.py \
8282
--base-url http://localhost:30000 \
8383
--tool-parser qwen_xml
@@ -193,18 +193,21 @@ EvaluatorClass = MyEvaluator
193193

194194
Then run:
195195
```bash
196-
strands-env eval --evaluator my_evaluator.py --env my_env.py --base-url http://localhost:30000
196+
strands-env eval run --evaluator my_evaluator.py --env my_env.py --base-url http://localhost:30000
197197
```
198198

199199
### Registered Evaluator
200200

201-
Alternatively, use `@register_eval` to make it available by name:
201+
To add a built-in benchmark, create a module in `src/strands_env/eval/benchmarks/` and use `@register_eval`:
202202

203203
```python
204+
# src/strands_env/eval/benchmarks/my_benchmark.py
204205
from collections.abc import Iterable
205206

206207
from strands_env.core import Action, TaskContext
207-
from strands_env.eval import Evaluator, register_eval
208+
209+
from ..evaluator import Evaluator
210+
from ..registry import register_eval
208211

209212
@register_eval("my-benchmark")
210213
class MyEvaluator(Evaluator):
@@ -222,6 +225,8 @@ class MyEvaluator(Evaluator):
222225
)
223226
```
224227

228+
Benchmarks are auto-discovered from the `benchmarks/` subdirectory. If a benchmark has missing dependencies, it will be listed as unavailable in `strands-env eval list` with the import error message.
229+
225230
### Programmatic Usage
226231

227232
```python
@@ -296,7 +301,7 @@ ToolParserClass = MyToolParser
296301

297302
Then use:
298303
```bash
299-
strands-env eval aime-2024 \
304+
strands-env eval run aime-2024 \
300305
--env my_env.py \
301306
--base-url http://localhost:30000 \
302307
--tool-parser my_tool_parser.py

0 commit comments

Comments
 (0)