feat(hackathon): add benchmarking strandly script#1087
Conversation
|
Assessment: Comment This is a well-structured benchmarking tool for the Review Categories
Nice addition to the developer tooling — the integration with ContextBench's evaluation framework and the CloudWatch metrics emission for trending are particularly useful. |
|
Really nice addition — clean separation (loader/runner/evaluator/reporter/cloudwatch), thorough README, and integrating ContextBench is exactly the right call. Since it's a draft devtool I skipped style nits and focused on bugs that would make the benchmark numbers wrong, because that's what makes a benchmark misleading rather than just rough. I checked out the branch locally and tested the trajectory parser empirically. TL;DR: three issues in the measurement layer systematically bias the cross-config comparison the tool exists to make. None block merging as a prototype — but I'd hold off publishing the numbers as a strategy comparison until #1–#3 are addressed. 🔴 Critical (these change the scores)1. Trajectory is reconstructed from 2. The bash path regex misses most common investigation commands. 3. Symbol / Span / EditLoc metrics are structurally always Details, evidence & suggested fixes for #1–#3#1 — post-run trajectory extraction vs. in-place mutation // runner.ts — runs AFTER agent.invoke() completes
const trajectory = extractTrajectory(agent.messages, repoDir)
Net: coverage is under-counted for compressing configs and under-counted more the more aggressively they compress — so the comparison is apples-to-oranges. Suggested fix: capture tool calls live with a #2 — regex parser (tested locally against
The #3 — always-zero metrics
🟠 Worth addressing for reproducibility#4 nondeterminism · #5 path normalization4. 5. 🟡 Minor (fine for a devtool)#6 timer leak · #7 token definition · #8 double checkout
Great foundation overall — fixing #1 (event-hook trajectory capture) and #2 (flag/multi-file/grep handling) would make the cross-config comparison trustworthy, and #3 is a quick "implement or hide" decision. Happy to help if useful! |
Partial fix to strands-agents#1069 - previously the agent would prematurely exit if the agent generated a tool with an invalid name; this avoids that by ensuring the agent loop continues with zero tool-uses. --------- Co-authored-by: Mackenzie Zastrow <zastrowm@users.noreply.github.com>
Description
Adds a
strandly benchmarkcommand that runs Strands agents against ContextBench — a code investigation benchmark that measures how well an agent finds relevant code for real GitHub issues.The benchmark:
bashtool to investigate the target repoIncludes 6 built-in configs testing different context management strategies (control, offloader, offloader-aggressive, summarizing, sliding-proactive, offloader-summarizing, these will be updated to the built in context management strategies!!!!), support for custom agent files via
--agent-file, configurable model via--model, and a--min-coverageflag for future CI gating.Usage:
Related Issues
N/A
Documentation PR
N/A — README included at
strandly/src/benchmark/README.mdType of Change
New feature
Testing
How have you tested the change?
django__django-15987task — achieved 100% file coverage--agent-filecustom config loading--min-coveragethreshold gatingnpx tsc --noEmit --project strandly/tsconfig.json)npm run checkChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.