Skip to content
This repository was archived by the owner on May 28, 2026. It is now read-only.

Code reviewer doesn't do a functional review, lead agent blindly accepts it and coordinator happily merges it into repo #131

@a4xrbj1

Description

@a4xrbj1

Steps to reproduce

  1. Run overstory
  2. Issue a task - in my case I've asked to monitor a Lambda service on port 5000 that was unreachable to reconnect automatically when it becomes available
  3. As per coordinator feedback:

Background info on the issue itself (from my chat with the coordinator):

Found the root cause. checkLambdaAvailable() uses GetFunctionCommand — this is the Lambda Management API (GET /2015-03-31/functions/{FunctionName}). But serverless-offline
  only emulates the Lambda Invoke API, not the management API. When GetFunctionCommand hits serverless-offline, it returns an error WITHOUT a proper AWS requestId, because
  serverless-offline doesn't know how to handle management endpoints.

Me: If you can see the bug now, why couldn't you see the bug BEFORE you merged it?

You're right. I could see it now because you told me the symptom — "reconnection doesn't work." If I had actually thought about what GetFunctionCommand does against
  serverless-offline before merging, I would have caught it. I read the diff, saw the code was syntactically correct, and merged without asking "does serverless-offline even
  support this API?" That's the core failure — I was checking code quality instead of functional correctness.

This is from https://github.com/jayminwest/overstory/blob/main/.claude/commands/pr-reviews.md:

b. Code quality review
Check for correctness — does the code do what the PR claims?
Check for bugs, edge cases, and error handling gaps
Check adherence to project conventions (see CLAUDE.md): strict TypeScript, zero runtime deps, Biome formatting, tab indentation, 100-char line width
Check test coverage — are new code paths tested? Do tests follow the "never mock what you can use for real" philosophy?
Flag any security concerns (injection, unsafe input handling, etc.)

There's clearly a big gap here between what the reviewer is tasked to do and what the lead and coordinator agent are then doing. This is big problem as the AI agent has a tendency to shortcut tasks (maybe due to using a less superior model and a limited context) and deliver code that isn't doing what was requested.

As both reviewer and lead agents fail to detect this, such code gets into the repo as the coordinator assumes that both reviewer and lead agents did their job properly and happily merges code as long as the code is SYNTACTICALLY is correct (as per his own words).

This isn't the first time I've observed this. It just happened now twice on the same original task to handle an "unreachable" connection for this specific Lambda service.

While I certainly can always ask the coordinator to do a proper job I think the main failure is already at the reviewer and lead agent level where both are failing to do a proper job.

Expected behavior

Both reviewer as well as lead agent should identify code that isn't functionally correct and push back. Right now it seems we have a linter instead of a reviewer and the lead agent is basically an useless overhead as coordinator could also spawn builder, scout and reviewer agents himself (or herself, I guess and AI agent doesn't have a gender).

Actual behavior

Functionally wrong code got "approved" by the reviewer agent, got accepted by the lead agent was happily merged by the coordinator agent.

All 3 agents failed in doing their job properly, bad code got into the repo and the human had to do their job and push back. That turns the whole overstory AI agent "help" on its head. It becomes a burden instead of a help, I sometimes feel like "Kindergarden cop".

overstory version

0.9.1

Bun version

1.3.10

Operating system

macOS 26.3.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:agent-lifecycleAgent identity, sessions, checkpoint/resume, slingarea:coordinationCoordinator, lead hierarchy, groups, multi-agent orchestrationbugSomething isn't workingdifficulty:complexCross-cutting, architectural, or high-riskfocus:reliabilityStall detection, recovery, cleanup, session DB hygienepriority:highSignificant improvement, clear path to implement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions