v0.1.2: CLI and Benchmark Registry

Latest

Latest

Lawhy released this 07 Feb 07:20

· 21 commits to main since this release

f1491d3

What's New

CLI for Benchmark Evaluation

strands-env list - List registered benchmarks
strands-env eval <benchmark> --env <hook_file> - Run evaluations with SGLang or Bedrock backends

Evaluator Hooks

Custom evaluator support via --evaluator flag for implementing benchmarks
Environment hooks for flexible environment configuration (environments are not necessarily tied to benchmarks)

Reproducibility

config.json saved to output directory with full configuration
Auto-backfill of model_id, tokenizer_path, and system_prompt

Built-in Benchmarks

aime-2024 - AIME 2024 math competition
aime-2025 - AIME 2025 math competition

Example

strands-env eval aime-2024 \
  --env examples/envs/calculator_env.py \
  --backend sglang \
  --n-samples-per-prompt 8 \
  --max-concurrency 30

Full Changelog: v0.1.1...v0.1.2

Assets 2