horizon-rl
diff --git a/‎CLAUDE.md‎
Lines changed: 7 additions & 3 deletions b/‎CLAUDE.md‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎docs/evaluation.md‎
Lines changed: 16 additions & 11 deletions b/‎docs/evaluation.md‎
Lines changed: 16 additions & 11 deletions
@@ -61,7 +61,9 @@ The package lives in `src/strands_env/` with these modules:
 
 ### `cli/`
 
-**__init__.py** — CLI entry point with `strands-env` command group. `list` shows registered benchmarks. `eval` runs benchmark evaluation with environment and optional evaluator hooks.
+**__init__.py** — CLI entry point with `strands-env` command group. Registers subcommand groups.
+
+**eval.py** — Evaluation CLI commands: `strands-env eval list` shows registered/unavailable benchmarks, `strands-env eval run` executes benchmark evaluation with environment and optional evaluator hooks.
 
 **config.py** — Configuration dataclasses: `SamplingConfig`, `ModelConfig`, `EnvConfig`, `EvalConfig`. Each has `to_dict()` for serialization. Config saved to output directory for reproducibility.
 
@@ -71,11 +73,13 @@ The package lives in `src/strands_env/` with these modules:
 
 **evaluator.py** — `Evaluator` class orchestrates concurrent rollouts with checkpointing and pass@k metrics. Takes an async `env_factory` for flexible environment creation. Uses tqdm with `logging_redirect_tqdm` for clean progress output. Subclasses implement `load_dataset()` for different benchmarks.
 
-**registry.py** — Benchmark registry with `@register_eval(name)` decorator. `get_benchmark(name)` and `list_benchmarks()` for discovery.
+**registry.py** — Benchmark registry with `@register_eval(name)` decorator. Auto-discovers benchmark modules from `benchmarks/` subdirectory on first access. `get_benchmark(name)`, `list_benchmarks()`, and `list_unavailable_benchmarks()` for discovery. Modules with missing dependencies are tracked as unavailable.
 
 **metrics.py** — `compute_pass_at_k` implements the unbiased pass@k estimator. `MetricFn` type alias for pluggable metrics.
 
-**aime.py** — `AIMEEvaluator` base class for AIME benchmarks. `AIME2024Evaluator` and `AIME2025Evaluator` registered as separate benchmarks with different dataset paths.
+**benchmarks/** — Benchmark evaluator modules. Each module uses `@register_eval` decorator. Auto-discovered on first registry access; missing dependencies cause module to be skipped with warning.
+
+**benchmarks/aime.py** — `AIMEEvaluator` base class for AIME benchmarks. `AIME2024Evaluator` and `AIME2025Evaluator` registered as separate benchmarks with different dataset paths.
 
 ### `utils/`
 
 
@@ -9,17 +9,17 @@ The `strands-env` CLI provides commands for running benchmark evaluations.
 ### List Benchmarks
 
 ```bash
-strands-env list
+strands-env eval list
 ```
 
 ### Run Evaluation
 
 ```bash
 # Using a registered benchmark
-strands-env eval <benchmark> --env <hook_file> [options]
+strands-env eval run <benchmark> --env <hook_file> [options]
 
 # Using a custom evaluator hook
-strands-env eval --evaluator <evaluator_file> --env <hook_file> [options]
+strands-env eval run --evaluator <evaluator_file> --env <hook_file> [options]
 ```
 
 **Required arguments:**
@@ -59,25 +59,25 @@ strands-env eval --evaluator <evaluator_file> --env <hook_file> [options]
 
 ```bash
 # Using registered benchmark with code sandbox env
-strands-env eval aime-2024 \
+strands-env eval run aime-2024 \
     --env examples/eval/aime_code/code_sandbox_env.py \
     --base-url http://localhost:30000
 
 # Using custom evaluator hook (custom benchmark)
-strands-env eval \
+strands-env eval run \
     --evaluator examples/eval/simple_math/simple_math_evaluator.py \
     --env examples/eval/simple_math/calculator_env.py \
     --base-url http://localhost:30000
 
 # Pass@8 evaluation with high concurrency
-strands-env eval aime-2024 \
+strands-env eval run aime-2024 \
     --env examples/eval/simple_math/calculator_env.py \
     --base-url http://localhost:30000 \
     --n-samples-per-prompt 8 \
     --max-concurrency 30
 
 # With custom tool parser
-strands-env eval aime-2024 \
+strands-env eval run aime-2024 \
     --env examples/eval/simple_math/calculator_env.py \
     --base-url http://localhost:30000 \
     --tool-parser qwen_xml
@@ -193,18 +193,21 @@ EvaluatorClass = MyEvaluator
 
 Then run:
 ```bash
-strands-env eval --evaluator my_evaluator.py --env my_env.py --base-url http://localhost:30000
+strands-env eval run --evaluator my_evaluator.py --env my_env.py --base-url http://localhost:30000
 ```
 
 ### Registered Evaluator
 
-Alternatively, use `@register_eval` to make it available by name:
+To add a built-in benchmark, create a module in `src/strands_env/eval/benchmarks/` and use `@register_eval`:
 
 ```python
+# src/strands_env/eval/benchmarks/my_benchmark.py
 from collections.abc import Iterable
 
 from strands_env.core import Action, TaskContext
-from strands_env.eval import Evaluator, register_eval
+
+from ..evaluator import Evaluator
+from ..registry import register_eval
 
 @register_eval("my-benchmark")
 class MyEvaluator(Evaluator):
@@ -222,6 +225,8 @@ class MyEvaluator(Evaluator):
             )
 ```
 
+Benchmarks are auto-discovered from the `benchmarks/` subdirectory. If a benchmark has missing dependencies, it will be listed as unavailable in `strands-env eval list` with the import error message.
+
 ### Programmatic Usage
 
 ```python
@@ -296,7 +301,7 @@ ToolParserClass = MyToolParser
 
 Then use:
 ```bash
-strands-env eval aime-2024 \
+strands-env eval run aime-2024 \
     --env my_env.py \
     --base-url http://localhost:30000 \
     --tool-parser my_tool_parser.py