horizon-rl
diff --git a/‎CHANGELOG.md‎
Lines changed: 20 additions & 3 deletions b/‎CHANGELOG.md‎
Lines changed: 20 additions & 3 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 12 additions & 2 deletions b/‎CLAUDE.md‎
Lines changed: 12 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 49 additions & 25 deletions b/‎README.md‎
Lines changed: 49 additions & 25 deletions
diff --git a/‎examples/aime_eval.py‎
Lines changed: 0 additions & 177 deletions b/‎examples/aime_eval.py‎
Lines changed: 0 additions & 177 deletions
@@ -7,6 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+
+- **CLI** (`strands-env`)
+  - `strands-env list`: List registered benchmarks.
+  - `strands-env eval <benchmark> --env <hook_file>`: Run benchmark evaluation.
+  - Hook file pattern: Python files exporting `create_env_factory(model_factory, env_config)`.
+  - Support for `--backend sglang|bedrock`, `--profile`, `--role-arn`, and sampling options.
+  - SGLang server health check with clear error messages.
+- **Benchmark Registry**
+  - `@register(name)` decorator for registering benchmark evaluators.
+  - `get_benchmark(name)` and `list_benchmarks()` for discovery.
+  - `AIME2024Evaluator` and `AIME2025Evaluator` as separate registered benchmarks.
+- **Code Quality**
+  - `@override` decorator from `typing_extensions` for explicit method overrides.
+
+### Changed
+
+- Reorganized examples: removed `aime_eval.py` and `common.py`, added `calculator_demo.py`.
+- Hook files moved to `examples/envs/`.
+
 ## [0.1.1] - 2026-02-06
 
 ### Added
@@ -23,9 +43,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `utils/aws.py`: AWS boto3 session caching with `RefreshableCredentials` for auto-refresh.
 - **Tools**
   - `CodeInterpreterToolkit`: `execute_code` and `execute_command` for sandboxed execution.
-- **Examples**
-  - `aime_eval.py`: Support `--env chat` and `--env code` modes with `--role-arn` option.
-  - `common.py`: Use cached SGLang client with connection pooling.
 
 ## [0.1.0] - 2026-02-03
 
 
@@ -59,17 +59,27 @@ The package lives in `src/strands_env/` with these modules:
 
 **environment.py** — Base `Environment` class. `step(action)` creates a fresh model via factory, attaches a `TokenManager`, builds an `Agent` with tools/hooks (always includes `ToolIterationLimiter`), runs `invoke_async`, then collects metrics and optional reward. Subclasses override `get_tools()` and `get_hooks()` to customize. Messages are sliced so only new messages from the current step appear in the observation.
 
+### `cli/`
+
+**__init__.py** — CLI entry point with `strands-env` command group. `list` shows registered benchmarks. `eval` runs benchmark evaluation with hook file.
+
+**config.py** — Configuration dataclasses: `SamplingConfig`, `ModelConfig`, `EnvConfig`, `EvalConfig`. Passed to hook files and factory builders.
+
+**utils.py** — `build_model_factory(config, max_concurrency)` creates SGLang or Bedrock model factories. `load_env_hook(path)` loads hook files. SGLang health check with clear error messages.
+
 ### `eval/`
 
 **evaluator.py** — `Evaluator` class orchestrates concurrent rollouts with checkpointing and pass@k metrics. Takes an async `env_factory` for flexible environment creation. Uses tqdm with `logging_redirect_tqdm` for clean progress output. Subclasses implement `load_dataset()` for different benchmarks.
 
+**registry.py** — Benchmark registry with `@register(name)` decorator. `get_benchmark(name)` and `list_benchmarks()` for discovery.
+
 **metrics.py** — `pass_at_k_metric` implements the unbiased pass@k estimator. `MetricFn` type alias for pluggable metrics.
 
-**aime.py** — `AIMEEvaluator` subclass for AIME benchmark evaluation.
+**aime.py** — `AIMEEvaluator` base class for AIME benchmarks. `AIME2024Evaluator` and `AIME2025Evaluator` registered as separate benchmarks with different dataset paths.
 
 ### `utils/`
 
-**sglang.py** — SGLang client caching with `@lru_cache`. `get_cached_client(base_url, max_connections)` for connection pooling. `get_cached_client_from_slime_args(args)` for slime RL training integration.
+**sglang.py** — SGLang client caching with `@lru_cache`. `get_cached_client(base_url, max_connections)` for connection pooling. `get_cached_client_from_slime_args(args)` for slime RL training integration. `check_server_health(base_url)` for early validation.
 
 **aws.py** — AWS boto3 session caching. `get_boto3_session(region, profile_name)` with `@lru_cache` (boto3 handles credential refresh). `get_assumed_role_session(role_arn, region)` uses `RefreshableCredentials` for programmatic role assumption with auto-refresh.
 
 
@@ -62,10 +62,10 @@ result.reward.reward                # 1.0
 result.termination_reason           # TerminationReason.TASK_COMPLETE
 ```
 
-See [`examples/math_env.py`](examples/math_env.py) for a complete example:
+See [`examples/calculator_demo.py`](examples/calculator_demo.py) for a complete example:
 
 ```bash
-python examples/math_env.py --backend sglang --sglang-base-url http://localhost:30000
+python examples/calculator_demo.py --backend sglang --base-url http://localhost:30000
 ```
 
 ## RL Training
@@ -116,38 +116,62 @@ Key points:
 
 ## Evaluation
 
-The `Evaluator` orchestrates concurrent rollouts with checkpointing and pass@k metrics. It takes an async `env_factory` for flexible environment creation per sample, and subclasses implement `load_dataset` for different benchmarks:
+### CLI
 
-```python
-...
-from strands_env.eval import Evaluator
+The `strands-env` CLI provides commands for running benchmark evaluations:
 
-class YourEvaluator(Evaluator):
-    benchmark_name = "YourBenchmark"
+```bash
+# List available benchmarks
+strands-env list
 
-    def load_dataset(self) -> Iterable[Action]:
-        ...
+# Run AIME 2024 evaluation with SGLang
+strands-env eval aime-2024 --env examples/envs/calculator_env.py --backend sglang
 
-async def env_factory(action: Action) -> Environment:
-    ...
+# Run with Bedrock
+strands-env eval aime-2024 --env examples/envs/code_sandbox_env.py --backend bedrock --model-id us.anthropic.claude-sonnet-4-20250514
 
-evaluator = YourEvaluator(
-    env_factory=env_factory,
-    n_samples_per_prompt=8,
-    max_concurrency=30,
-    keep_tokens=False, # Set True if requiring token-level trajectories (SGLang only)
-    metrics_fns=[...], # Define more metrics, pass@k has been included by default
-)
+# With multiple samples for pass@k
+strands-env eval aime-2024 --env examples/envs/calculator_env.py --backend sglang --n-samples 8 --max-concurrency 30
+```
+
+### Hook Files
 
-actions = evaluator.load_dataset()
-results = await evaluator.run(actions)
-metrics = evaluator.compute_metrics(results)  # {"pass@1": 0.75, "pass@8": 0.95}
+Environment hook files define how environments are created. They export a `create_env_factory` function:
+
+```python
+# examples/envs/calculator_env.py
+from strands_env.cli.config import EnvConfig
+from strands_env.core.models import ModelFactory
+from strands_env.environments.calculator import CalculatorEnv
+from strands_env.rewards.math_reward import MathRewardFunction
+
+def create_env_factory(model_factory: ModelFactory, env_config: EnvConfig):
+    reward_fn = MathRewardFunction()
+
+    async def env_factory(_action):
+        return CalculatorEnv(
+            model_factory=model_factory,
+            reward_fn=reward_fn,
+            system_prompt=env_config.system_prompt,
+            max_tool_iterations=env_config.max_tool_iterations,
+        )
+
+    return env_factory
 ```
 
-See [`examples/aime_eval.py`](examples/aime_eval.py) for a complete example:
+### Programmatic Usage
 
-```bash
-python examples/aime_eval.py --backend sglang --sglang-base-url http://localhost:30000
+For custom evaluators, subclass `Evaluator` and implement `load_dataset`:
+
+```python
+from strands_env.eval import Evaluator, register
+
+@register("my-benchmark")
+class MyEvaluator(Evaluator):
+    benchmark_name = "my-benchmark"
+
+    def load_dataset(self) -> Iterable[Action]:
+        ...
 ```
 
 ## Development