You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> `strands-agents` is designed for serving, not training. `strands-env` integrates [`strands-sglang`](https://github.com/horizon-rl/strands-sglang) to bridge this gap.
9
+
## Features
10
10
11
-
## Define an environment
11
+
This package standardizes agent environments by treating each `env.step()` as a full agent loop (`prompt → (tool_call, tool_response)* → response`), not a single model call. Built on [strands](https://github.com/strands-agents/sdk-python) agent loop and [`strands-sglang`](https://github.com/horizon-rl/strands-sglang) for RL training.
12
12
13
-
Subclass `Environment` and customize your tools:
13
+
-**Define environments easily** — subclass `Environment` and implement tools as `@tool` functions
14
+
-**Capture token-level observations** — TITO data for on-policy RL training (SGLang backend)
15
+
-**Plug in reward functions** — evaluate agent outputs with custom `RewardFunction`
16
+
-**Run benchmarks** — `Evaluator` with pass@k metrics, checkpointing, and resume
Each `step()` runs a full agent loop (reasoning + tool calls), not a single model call. Strands' hook-based design makes it easy to customize what happens within each step.
37
-
38
-
## Install
63
+
See [`examples/math_env.py`](examples/math_env.py) for a complete example:
For RL training with [slime](https://github.com/THUDM/slime/), customize the `generate` and `reward_func` methods to replace single generation with agentic rollout:
72
+
73
+
```python
74
+
from strands_env.core import Action, TaskContext
75
+
from strands_env.core.models import sglang_model_factory
76
+
from strands_env.utils import get_cached_client_from_slime_args
See [`examples/math_env.py`](examples/math_env.py) for a complete runnable example:
110
+
Key points:
111
+
-`get_cached_client_from_slime_args(args)` provides connection pooling across rollouts
112
+
-`TokenObservation` contains token IDs and logprobs for on-policy training
113
+
- Reward is computed separately to allow async/batched reward computation
114
+
115
+
## Evaluation
116
+
117
+
The `Evaluator` orchestrates concurrent rollouts with checkpointing and pass@k metrics. It takes an async `env_factory` for flexible environment creation per sample, and subclasses implement `load_dataset` for different benchmarks:
0 commit comments