Steps to reproduce
- Run
overstory
- Issue a task - in my case I've asked to monitor a Lambda service on port 5000 that was unreachable to reconnect automatically when it becomes available
- As per
coordinator feedback:
Background info on the issue itself (from my chat with the coordinator):
Found the root cause. checkLambdaAvailable() uses GetFunctionCommand — this is the Lambda Management API (GET /2015-03-31/functions/{FunctionName}). But serverless-offline
only emulates the Lambda Invoke API, not the management API. When GetFunctionCommand hits serverless-offline, it returns an error WITHOUT a proper AWS requestId, because
serverless-offline doesn't know how to handle management endpoints.
Me: If you can see the bug now, why couldn't you see the bug BEFORE you merged it?
You're right. I could see it now because you told me the symptom — "reconnection doesn't work." If I had actually thought about what GetFunctionCommand does against
serverless-offline before merging, I would have caught it. I read the diff, saw the code was syntactically correct, and merged without asking "does serverless-offline even
support this API?" That's the core failure — I was checking code quality instead of functional correctness.
This is from https://github.com/jayminwest/overstory/blob/main/.claude/commands/pr-reviews.md:
b. Code quality review
Check for correctness — does the code do what the PR claims?
Check for bugs, edge cases, and error handling gaps
Check adherence to project conventions (see CLAUDE.md): strict TypeScript, zero runtime deps, Biome formatting, tab indentation, 100-char line width
Check test coverage — are new code paths tested? Do tests follow the "never mock what you can use for real" philosophy?
Flag any security concerns (injection, unsafe input handling, etc.)
There's clearly a big gap here between what the reviewer is tasked to do and what the lead and coordinator agent are then doing. This is big problem as the AI agent has a tendency to shortcut tasks (maybe due to using a less superior model and a limited context) and deliver code that isn't doing what was requested.
As both reviewer and lead agents fail to detect this, such code gets into the repo as the coordinator assumes that both reviewer and lead agents did their job properly and happily merges code as long as the code is SYNTACTICALLY is correct (as per his own words).
This isn't the first time I've observed this. It just happened now twice on the same original task to handle an "unreachable" connection for this specific Lambda service.
While I certainly can always ask the coordinator to do a proper job I think the main failure is already at the reviewer and lead agent level where both are failing to do a proper job.
Expected behavior
Both reviewer as well as lead agent should identify code that isn't functionally correct and push back. Right now it seems we have a linter instead of a reviewer and the lead agent is basically an useless overhead as coordinator could also spawn builder, scout and reviewer agents himself (or herself, I guess and AI agent doesn't have a gender).
Actual behavior
Functionally wrong code got "approved" by the reviewer agent, got accepted by the lead agent was happily merged by the coordinator agent.
All 3 agents failed in doing their job properly, bad code got into the repo and the human had to do their job and push back. That turns the whole overstory AI agent "help" on its head. It becomes a burden instead of a help, I sometimes feel like "Kindergarden cop".
overstory version
0.9.1
Bun version
1.3.10
Operating system
macOS 26.3.1
Steps to reproduce
overstorycoordinatorfeedback:Background info on the issue itself (from my chat with the
coordinator):Me:
If you can see the bug now, why couldn't you see the bug BEFORE you merged it?This is from
https://github.com/jayminwest/overstory/blob/main/.claude/commands/pr-reviews.md:There's clearly a big gap here between what the
revieweris tasked to do and what theleadandcoordinatoragent are then doing. This is big problem as the AI agent has a tendency to shortcut tasks (maybe due to using a less superior model and a limited context) and deliver code that isn't doing what was requested.As both
reviewerandleadagents fail to detect this, such code gets into the repo as thecoordinatorassumes that bothreviewerandleadagents did their job properly and happily merges code as long as the code is SYNTACTICALLY is correct (as per his own words).This isn't the first time I've observed this. It just happened now twice on the same original task to handle an "unreachable" connection for this specific Lambda service.
While I certainly can always ask the
coordinatorto do a proper job I think the main failure is already at thereviewerandleadagent level where both are failing to do a proper job.Expected behavior
Both
revieweras well asleadagent should identify code that isn't functionally correct and push back. Right now it seems we have alinterinstead of areviewerand theleadagent is basically an useless overhead ascoordinatorcould also spawnbuilder,scoutandrevieweragents himself (or herself, I guess and AI agent doesn't have a gender).Actual behavior
Functionally wrong code got "approved" by the
revieweragent, got accepted by theleadagent was happily merged by thecoordinatoragent.All 3 agents failed in doing their job properly, bad code got into the repo and the human had to do their job and push back. That turns the whole overstory AI agent "help" on its head. It becomes a burden instead of a help, I sometimes feel like "Kindergarden cop".
overstory version
0.9.1
Bun version
1.3.10
Operating system
macOS 26.3.1