Skip to content

Commit 76d8b57

Browse files
Lawhyclaude
andcommitted
feat(cli): add strands-env CLI with benchmark registry
- Add CLI module with `strands-env list` and `strands-env eval` commands - Add benchmark registry with `@register` decorator for evaluator discovery - Add hook file pattern for environment definition (create_env_factory) - Add SGLang server health check with clear error messages - Add `@override` decorator from typing_extensions for explicit overrides - Rename AIMEEvaluatorBase to AIMEEvaluator, register AIME2024/2025 separately - Reorganize examples: add calculator_demo.py, move hook files to examples/envs/ - Update documentation (CHANGELOG, CLAUDE.md, README) Co-Authored-By: Claude Opus 4.5 <[email protected]>
1 parent 0263a9c commit 76d8b57

File tree

20 files changed

+1031
-394
lines changed

20 files changed

+1031
-394
lines changed

CHANGELOG.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Added
11+
12+
- **CLI** (`strands-env`)
13+
- `strands-env list`: List registered benchmarks.
14+
- `strands-env eval <benchmark> --env <hook_file>`: Run benchmark evaluation.
15+
- Hook file pattern: Python files exporting `create_env_factory(model_factory, env_config)`.
16+
- Support for `--backend sglang|bedrock`, `--profile`, `--role-arn`, and sampling options.
17+
- SGLang server health check with clear error messages.
18+
- **Benchmark Registry**
19+
- `@register(name)` decorator for registering benchmark evaluators.
20+
- `get_benchmark(name)` and `list_benchmarks()` for discovery.
21+
- `AIME2024Evaluator` and `AIME2025Evaluator` as separate registered benchmarks.
22+
- **Code Quality**
23+
- `@override` decorator from `typing_extensions` for explicit method overrides.
24+
25+
### Changed
26+
27+
- Reorganized examples: removed `aime_eval.py` and `common.py`, added `calculator_demo.py`.
28+
- Hook files moved to `examples/envs/`.
29+
1030
## [0.1.1] - 2026-02-06
1131

1232
### Added
@@ -23,9 +43,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2343
- `utils/aws.py`: AWS boto3 session caching with `RefreshableCredentials` for auto-refresh.
2444
- **Tools**
2545
- `CodeInterpreterToolkit`: `execute_code` and `execute_command` for sandboxed execution.
26-
- **Examples**
27-
- `aime_eval.py`: Support `--env chat` and `--env code` modes with `--role-arn` option.
28-
- `common.py`: Use cached SGLang client with connection pooling.
2946

3047
## [0.1.0] - 2026-02-03
3148

CLAUDE.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,17 +59,27 @@ The package lives in `src/strands_env/` with these modules:
5959

6060
**environment.py** — Base `Environment` class. `step(action)` creates a fresh model via factory, attaches a `TokenManager`, builds an `Agent` with tools/hooks (always includes `ToolIterationLimiter`), runs `invoke_async`, then collects metrics and optional reward. Subclasses override `get_tools()` and `get_hooks()` to customize. Messages are sliced so only new messages from the current step appear in the observation.
6161

62+
### `cli/`
63+
64+
**__init__.py** — CLI entry point with `strands-env` command group. `list` shows registered benchmarks. `eval` runs benchmark evaluation with hook file.
65+
66+
**config.py** — Configuration dataclasses: `SamplingConfig`, `ModelConfig`, `EnvConfig`, `EvalConfig`. Passed to hook files and factory builders.
67+
68+
**utils.py**`build_model_factory(config, max_concurrency)` creates SGLang or Bedrock model factories. `load_env_hook(path)` loads hook files. SGLang health check with clear error messages.
69+
6270
### `eval/`
6371

6472
**evaluator.py**`Evaluator` class orchestrates concurrent rollouts with checkpointing and pass@k metrics. Takes an async `env_factory` for flexible environment creation. Uses tqdm with `logging_redirect_tqdm` for clean progress output. Subclasses implement `load_dataset()` for different benchmarks.
6573

74+
**registry.py** — Benchmark registry with `@register(name)` decorator. `get_benchmark(name)` and `list_benchmarks()` for discovery.
75+
6676
**metrics.py**`pass_at_k_metric` implements the unbiased pass@k estimator. `MetricFn` type alias for pluggable metrics.
6777

68-
**aime.py**`AIMEEvaluator` subclass for AIME benchmark evaluation.
78+
**aime.py**`AIMEEvaluator` base class for AIME benchmarks. `AIME2024Evaluator` and `AIME2025Evaluator` registered as separate benchmarks with different dataset paths.
6979

7080
### `utils/`
7181

72-
**sglang.py** — SGLang client caching with `@lru_cache`. `get_cached_client(base_url, max_connections)` for connection pooling. `get_cached_client_from_slime_args(args)` for slime RL training integration.
82+
**sglang.py** — SGLang client caching with `@lru_cache`. `get_cached_client(base_url, max_connections)` for connection pooling. `get_cached_client_from_slime_args(args)` for slime RL training integration. `check_server_health(base_url)` for early validation.
7383

7484
**aws.py** — AWS boto3 session caching. `get_boto3_session(region, profile_name)` with `@lru_cache` (boto3 handles credential refresh). `get_assumed_role_session(role_arn, region)` uses `RefreshableCredentials` for programmatic role assumption with auto-refresh.
7585

README.md

Lines changed: 49 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -62,10 +62,10 @@ result.reward.reward # 1.0
6262
result.termination_reason # TerminationReason.TASK_COMPLETE
6363
```
6464

65-
See [`examples/math_env.py`](examples/math_env.py) for a complete example:
65+
See [`examples/calculator_demo.py`](examples/calculator_demo.py) for a complete example:
6666

6767
```bash
68-
python examples/math_env.py --backend sglang --sglang-base-url http://localhost:30000
68+
python examples/calculator_demo.py --backend sglang --base-url http://localhost:30000
6969
```
7070

7171
## RL Training
@@ -116,38 +116,62 @@ Key points:
116116

117117
## Evaluation
118118

119-
The `Evaluator` orchestrates concurrent rollouts with checkpointing and pass@k metrics. It takes an async `env_factory` for flexible environment creation per sample, and subclasses implement `load_dataset` for different benchmarks:
119+
### CLI
120120

121-
```python
122-
...
123-
from strands_env.eval import Evaluator
121+
The `strands-env` CLI provides commands for running benchmark evaluations:
124122

125-
class YourEvaluator(Evaluator):
126-
benchmark_name = "YourBenchmark"
123+
```bash
124+
# List available benchmarks
125+
strands-env list
127126

128-
def load_dataset(self) -> Iterable[Action]:
129-
...
127+
# Run AIME 2024 evaluation with SGLang
128+
strands-env eval aime-2024 --env examples/envs/calculator_env.py --backend sglang
130129

131-
async def env_factory(action: Action) -> Environment:
132-
...
130+
# Run with Bedrock
131+
strands-env eval aime-2024 --env examples/envs/code_sandbox_env.py --backend bedrock --model-id us.anthropic.claude-sonnet-4-20250514
133132

134-
evaluator = YourEvaluator(
135-
env_factory=env_factory,
136-
n_samples_per_prompt=8,
137-
max_concurrency=30,
138-
keep_tokens=False, # Set True if requiring token-level trajectories (SGLang only)
139-
metrics_fns=[...], # Define more metrics, pass@k has been included by default
140-
)
133+
# With multiple samples for pass@k
134+
strands-env eval aime-2024 --env examples/envs/calculator_env.py --backend sglang --n-samples 8 --max-concurrency 30
135+
```
136+
137+
### Hook Files
141138

142-
actions = evaluator.load_dataset()
143-
results = await evaluator.run(actions)
144-
metrics = evaluator.compute_metrics(results) # {"pass@1": 0.75, "pass@8": 0.95}
139+
Environment hook files define how environments are created. They export a `create_env_factory` function:
140+
141+
```python
142+
# examples/envs/calculator_env.py
143+
from strands_env.cli.config import EnvConfig
144+
from strands_env.core.models import ModelFactory
145+
from strands_env.environments.calculator import CalculatorEnv
146+
from strands_env.rewards.math_reward import MathRewardFunction
147+
148+
def create_env_factory(model_factory: ModelFactory, env_config: EnvConfig):
149+
reward_fn = MathRewardFunction()
150+
151+
async def env_factory(_action):
152+
return CalculatorEnv(
153+
model_factory=model_factory,
154+
reward_fn=reward_fn,
155+
system_prompt=env_config.system_prompt,
156+
max_tool_iterations=env_config.max_tool_iterations,
157+
)
158+
159+
return env_factory
145160
```
146161

147-
See [`examples/aime_eval.py`](examples/aime_eval.py) for a complete example:
162+
### Programmatic Usage
148163

149-
```bash
150-
python examples/aime_eval.py --backend sglang --sglang-base-url http://localhost:30000
164+
For custom evaluators, subclass `Evaluator` and implement `load_dataset`:
165+
166+
```python
167+
from strands_env.eval import Evaluator, register
168+
169+
@register("my-benchmark")
170+
class MyEvaluator(Evaluator):
171+
benchmark_name = "my-benchmark"
172+
173+
def load_dataset(self) -> Iterable[Action]:
174+
...
151175
```
152176

153177
## Development

examples/aime_eval.py

Lines changed: 0 additions & 177 deletions
This file was deleted.

0 commit comments

Comments
 (0)