Skip to content

Add optional skills-mode improve and verify stages#144

Open
lee-to wants to merge 5 commits into
mainfrom
feature/issue-131-skills-mode-improve-verify
Open

Add optional skills-mode improve and verify stages#144
lee-to wants to merge 5 commits into
mainfrom
feature/issue-131-skills-mode-improve-verify

Conversation

@lee-to

@lee-to lee-to commented Jun 29, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add optional skills-mode runPlanImprove and runPostVerify task flags, defaulted off and exposed only when subagents are disabled.
  • Add Improve and Verify coordinator stages backed by /aif-improve and /aif-verify.
  • Run Verify after implementation and before Review; when Review is skipped, Verify moves directly to Done.
  • Update API, data, schema, UI, documentation, and tests for the expanded workflow.

Validation

  • npm test --workspace=@aif/agent -- coordinator.test.ts
  • npm run ai:validate

Related issue: #131

@lee-to lee-to requested a review from ichinya June 29, 2026 14:32
@ichinya

ichinya commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds two optional skills-mode pipeline stages (improve and verify) gated by per-task boolean flags. The design is clean: purely additive, zero impact on existing subagent-mode tasks, and the coordinator routing logic is well-separated. The code follows existing patterns throughout.


Architecture & Design ✅

The approach is solid:

  1. Additive, not in-place. New statuses, new columns (default 0), new pipeline entries. No existing behavior changes. This is the right way to extend the state machine.

  2. Defense-in-depth on the guard. The API forces runPlanImprove/runPostVerify to false when useSubagents=true, AND the coordinator checks !task.useSubagents in shouldRunSkillsModeImprove/Verify. Belt and suspenders — good.

  3. Dynamic routing via getStageSuccessStatus. Clean abstraction. The pipeline entries still declare static onSuccess, but the coordinator overrides at runtime based on task flags. This avoids a combinatorial explosion of pipeline entries.

  4. Non-destructive rollback in improver. If /aif-improve produces garbage output, the original plan is preserved. The looksLikeFullPlanUpdate() gate is the right safety net.

  5. Verify gate pattern. Parsing aif-gate-result code fences is a reasonable convention. The gate blocks only on fail/blocking=true, silently passes on warn/pass. This matches the existing review gate semantics.


Concerns

1. Verify retry loop on blocking gate

When the verify gate throws (status: "fail" or blocking: true), the coordinator leaves the task in verify status. On the next poll cycle, findCoordinatorTaskCandidates("verify") picks it up again and re-runs runVerifier. This creates an implicit retry loop with no backoff or retry limit.

Is this intentional? The existing review stage has maxReviewIterations to cap retries. The verify stage has no equivalent. If the implementation is genuinely broken, the verify stage will keep retrying every 30 seconds indefinitely until the stale-stage watchdog kicks in.

Suggestion: either add a verifyIterationCount field (mirroring reviewIterationCount) or have the verify gate throw to blocked_external instead of staying in verify.

2. Budget borrowing

  • runImprover borrows planCheckerMaxBudgetUsd — makes sense, it's a plan-scoped operation.
  • runVerifier borrows reviewSidecarMaxBudgetUsd — also reasonable, it's a review-scoped operation.

But the naming is misleading. If someone configures planCheckerMaxBudgetUsd expecting it to only affect the plan-checker, they'll be surprised when the improver uses it too. Consider either:

  • Renaming to planStageMaxBudgetUsd (covers both plan-checker and improver)
  • Or adding explicit improverMaxBudgetUsd/verifierMaxBudgetUsd fields (more granular)

This is a follow-up concern, not a blocker.

3. improve status in the state machine

The improve status has no human actions in HUMAN_ACTIONS_BY_STATUS. This is correct — it's agent-controlled. But there's no way for a human to cancel a task stuck in improve without manually setting it to blocked_external or similar. The existing planning status has the same limitation, so this is consistent, but worth noting.


Specific Code Comments

packages/agent/src/coordinator.tsgetStageSuccessStatus

function getStageSuccessStatus(task: TaskRow, stage: StatusTransition): TaskStatus {
  if (stage.label === "planner" && shouldRunSkillsModeImprove(task)) {
    return "improve";
  }
  if (stage.label === "implementer" && shouldRunSkillsModeVerify(task)) {
    return "verify";
  }
  if (stage.label === "verifier" && task.skipReview) {
    return "done";
  }
  return stage.onSuccess;
}

Clean. The three conditions are mutually exclusive (different stage labels), so no ordering ambiguity. The fallback to stage.onSuccess preserves existing behavior.

packages/agent/src/subagents/improver.tslooksLikeFullPlanUpdate guard

if (!looksLikeFullPlanUpdate(currentPlan, improvedPlan)) {
  log.warn(/* ... */);
  persistTaskPlanForTask({ /* ... planText: currentPlan ... */ });
  return;
}

Good defensive pattern. The improver won't accidentally nuke a plan with partial output. The warning log provides enough context for debugging.

packages/agent/src/subagents/verifier.tsextractVerifyGateResult

function extractVerifyGateResult(resultText: string): VerifyGateResult | null {
  const fence = resultText.match(/```aif-gate-result\s*([\s\S]*?)```/);
  if (!fence) return null;
  // ... JSON.parse with type narrowing
}

The parsing is defensive (returns null on malformed input, validates field types). One edge case: if the LLM produces multiple aif-gate-result fences, only the first is parsed. This is probably fine for now but worth documenting.

packages/data/src/index.tsfindCoordinatorTaskCandidates

The conditional chain for stage matching is getting long:

stage === "implementer"
  ? or(
      eq(tasks.status, "implementing"),
      and(eq(tasks.status, "plan_ready"), eq(tasks.autoMode, true)),
    )
  : stage === "improver"
    ? inArray(tasks.status, ["improve"])
    : stage === "plan-checker"
      ? and(eq(tasks.status, "plan_ready"), eq(tasks.autoMode, true))
      : stage === "planner"
        ? inArray(tasks.status, ["planning"])
        : stage === "verifier"
          ? inArray(tasks.status, ["verify"])
          : inArray(tasks.status, ["review"]);

This is a nested ternary chain with 6 branches. It works but is hard to read. Consider refactoring to a switch statement or a Record<CoordinatorStage, SQL> map in a follow-up. Not a blocker for this PR.


Minor Issues

  1. Migration version 24 — Confirm no collision with other open PRs. The Migration Version Rule requires append-only versioning.

  2. Test mock for improver.ts — The mock in coordinator.test.ts uses vi.fn().mockResolvedValue(undefined). This doesn't test the improver's rollback path. Consider adding a test where the improver throws to verify the coordinator handles it correctly.

  3. Column.tsx owner badgesimprove and verify both get the amber "AI controlled" badge. Consistent with planning and review. Good.

  4. CSS variables — Light theme: --color-status-improve: #d97706 (amber-600), --color-status-verify: #0ea5e9 (sky-500). Dark theme: --color-status-improve: #b45309 (amber-700), --color-status-verify: #0284c7 (sky-600). These have sufficient contrast against their respective backgrounds.


Questions for the Author

  1. Verify retry behavior: When the verify gate blocks, should the task stay in verify (implicit retry) or move to blocked_external (explicit human intervention)? The current behavior is the former, but there's no iteration cap.

  2. Future extensibility: The aif-gate-result pattern in the verifier — is this intended to be reused by other stages (e.g., plan-checker could emit a gate result too)? If so, should the parsing be extracted to a shared utility?

  3. Budget granularity: Any plans to add dedicated improverMaxBudgetUsd / verifierMaxBudgetUsd project settings, or is the current borrowing from plan-checker/review-sidecar budgets intentional long-term?


Files I'd Want to See Tests For

  • improver.ts — ✅ has tests, covers happy path and non-plan output detection
  • verifier.ts — ✅ has tests, covers pass and blocking gate
  • coordinator.ts — ✅ has tests for improve/verify routing, skipReview+verify combo
  • Missing: coordinator test where improver/verifier throws (error path)

Overall: Clean, well-structured PR. The additive design, defense-in-depth guards, and defensive parsing make this safe to merge. The verify retry loop is the only behavioral concern worth discussing before merge.

@lee-to

lee-to commented Jun 30, 2026

Copy link
Copy Markdown
Owner Author

Thanks for the review. I addressed the blocking verify retry concern in 1e1277a.

Changes made:

  • Added a structured StageManualBlockError for intentional stage gate blocks.
  • The verifier now throws that structured error for failing/blocking aif-gate-result output after saving the verification report.
  • The coordinator maps that error to blocked_external with retryAfter=null, so a blocking verify result requires manual intervention instead of being retried every poll.
  • Added regression coverage for the stage error classifier and for a second coordinator poll confirming the verifier is not re-run.

Validation:

  • npx vitest run packages/agent/src/tests/stageErrorHandler.test.ts packages/agent/src/tests/verifier.test.ts packages/agent/src/tests/coordinator.test.ts
  • npm run ai:validate
  • npm run format:check after the commit hook formatting pass

I left the budget naming and data query refactor notes as follow-up/non-blocking, since they do not affect the retry-loop behavior.

@ichinya

ichinya commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

PR Review

Summary

PR extends the workflow engine by adding optional Improve and Verify stages for skills-mode (useSubagents=false), as well as new flags, FSM states, and changes to the coordinator logic.

This is not a point change, but an extension of the execution model with conditional stages.


Strengths

  • Correct opt-in design:
    • features are disabled by default
    • ignored when useSubagents=true
  • Fixed an important bug with a retry-loop in verify:
    • introduced StageManualBlockError
    • added a structural transition in blocked_external
  • Logically improved pipeline:
    • Improve → Plan refinement
    • Verify → Post-execution validation
  • Documentation is synchronized with the code

Risks / Concerns

1. Growth of state machine complexity

Conditional stages (improve, verify) are added, which increases the number of hidden transitions and complicates the workflow mental model.


2. Flag interaction complexity

Behavior depends on the combination of:

  • useSubagents
  • skipReview
  • runPlanImprove
  • runPostVerify

There is no explicit truth table, which complicates the system's predictability.


3. Verify semantics

verify is actually a validation gate, but it is treated as a full-fledged stage with watchdog/retry logic, which may result in an architectural mix of abstraction levels.


4. Watchdog behavior

Extending the stale-stage watchdog to verify is logical, but it may be redundant for a short-lived validation step.


Suggestions

  • Add an explicit truth table for flag combinations
  • Clearly define the semantic distinction:
    • Improve = plan refinement
    • Verify = execution validation gate
  • Consider separating verify as a "validation layer" rather than a full stage
  • Add tests for all combinations of flags + edge cases retry-loop fix

Conclusion

PR qualitatively expands the pipeline and fixes a critical retry-loop bug in verify.

The main trade-off is an increase in FSM complexity and conditional branching, which requires a more strict specification of behavior.

@lee-to

lee-to commented Jul 1, 2026

Copy link
Copy Markdown
Owner Author

Addressed the latest review feedback in 7017e09. Added an explicit skills-mode flag truth table to docs/architecture.md, clarified Improve as plan refinement vs Verify as the execution validation gate, and documented why verify remains a coordinator stage for lifecycle/claim/timeout/profile/activity-log reuse. Added coordinator coverage for all flag combinations, including subagent-mode ignore behavior and skipReview interactions. Validation: npx vitest run packages/agent/src/tests/coordinator.test.ts, npm run format:check, git diff --check, npm run ai:validate.

@ichinya

ichinya commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Review: Request changes

Thanks for the update. The structured StageManualBlockError path addresses the previous blocking-verify retry loop, and the coordinator/API/UI changes are generally consistent. CI is green on the current head.

I found one blocking behavior issue: in the normal runPostVerify=true + skipReview=false path, runVerifier writes the verification output to reviewComments, then the following reviewer stage overwrites reviewComments with its own combined review/security output. That means pass/warn verification output is not retained after the standard pipeline, so any non-blocking verification findings disappear from the task once review completes.

Please either persist verification output in a dedicated field/log, or make the reviewer preserve/merge the existing Verification section when writing its final reviewComments.

Non-blocking note: issue #131 describes Review -> Verify -> Done, while this PR implements Implementing -> Verify -> Review/Done. The PR body and docs are internally consistent with the new order, but it would be worth confirming/updating the issue acceptance criteria so the final behavior is explicit.

Inline comment

File: packages/agent/src/subagents/verifier.ts
Line: 107

runVerifier persists the verification report into reviewComments, but the normal non-skipReview path immediately runs runReviewer, which writes setTaskFields({ reviewComments: combinedReview }) and overwrites this field. As a result pass/warn verification output is lost after review. Please either persist verification separately or have the reviewer merge/preserve the existing Verification section before overwriting reviewComments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants