Skip to content

v0.1.2: CLI and Benchmark Registry

Latest

Choose a tag to compare

@Lawhy Lawhy released this 07 Feb 07:20
· 21 commits to main since this release

What's New

CLI for Benchmark Evaluation

  • strands-env list - List registered benchmarks
  • strands-env eval <benchmark> --env <hook_file> - Run evaluations with SGLang or Bedrock backends

Evaluator Hooks

  • Custom evaluator support via --evaluator flag for implementing benchmarks
  • Environment hooks for flexible environment configuration (environments are not necessarily tied to benchmarks)

Reproducibility

  • config.json saved to output directory with full configuration
  • Auto-backfill of model_id, tokenizer_path, and system_prompt

Built-in Benchmarks

  • aime-2024 - AIME 2024 math competition
  • aime-2025 - AIME 2025 math competition

Example

strands-env eval aime-2024 \
  --env examples/envs/calculator_env.py \
  --backend sglang \
  --n-samples-per-prompt 8 \
  --max-concurrency 30

Full Changelog: v0.1.1...v0.1.2