What's New
CLI for Benchmark Evaluation
strands-env list- List registered benchmarksstrands-env eval <benchmark> --env <hook_file>- Run evaluations with SGLang or Bedrock backends
Evaluator Hooks
- Custom evaluator support via
--evaluatorflag for implementing benchmarks - Environment hooks for flexible environment configuration (environments are not necessarily tied to benchmarks)
Reproducibility
config.jsonsaved to output directory with full configuration- Auto-backfill of
model_id,tokenizer_path, andsystem_prompt
Built-in Benchmarks
aime-2024- AIME 2024 math competitionaime-2025- AIME 2025 math competition
Example
strands-env eval aime-2024 \
--env examples/envs/calculator_env.py \
--backend sglang \
--n-samples-per-prompt 8 \
--max-concurrency 30Full Changelog: v0.1.1...v0.1.2