Skip to content

Make routine dispatch failures visible#3599

Open
qazsero wants to merge 7 commits intopaperclipai:masterfrom
qazsero:fix/ven-314-routine-failsafe
Open

Make routine dispatch failures visible#3599
qazsero wants to merge 7 commits intopaperclipai:masterfrom
qazsero:fix/ven-314-routine-failsafe

Conversation

@qazsero
Copy link
Copy Markdown

@qazsero qazsero commented Apr 13, 2026

Thinking Path

  • Paperclip orchestrates AI agents for zero-human companies.
  • Routines turn scheduled, manual, API, and webhook triggers into execution issues for agents.
  • Visual/product routines are only useful if a scheduled run creates screenshot-backed review work or a visible blocker.
  • The Apr 13 visual and growth routine runs failed before creating linked issues because stale issue numbering collided with issues_identifier_idx.
  • The first PR version made automated dispatch failures visible, but Staff found two remaining gaps: blocker issues could be orphaned if the dispatch transaction rolled back, and webhook dispatch validation could fail before a run existed.
  • This pull request closes those gaps by keeping automated validation inside the run lifecycle and cleaning up externally created routine issues if the transaction aborts.
  • The benefit is that automated visual routines either produce actionable execution work or leave a precise, linked blocked issue instead of silently disappearing.

What Changed

  • Moved schedule/webhook assignee and routine-variable validation into the routine-run transaction after the run row is created, so authenticated automated dispatch validation failures become failed runs linked to blocked issues.
  • Added compensating cleanup for routine execution/blocker issues created through issueService.create() if the enclosing dispatch transaction later rolls back.
  • Preserved manual/API preflight behavior while continuing to mark post-run dispatch failures as failed runs.
  • Added blocker owner copy for legacy/drifted routines that have no default assignee.
  • Added regression coverage for stale issue counters, rollback cleanup, automated failure blocker linkage, webhook missing-variable failures, and webhook missing-assignee failures.
  • Documented automated schedule/webhook fail-safe behavior in the routines API docs.

Verification

  • pnpm exec vitest run server/src/__tests__/routines-service.test.ts passed: 22 tests.
  • pnpm exec tsc --noEmit --pretty false --project server/tsconfig.json passed.
  • pnpm run typecheck passed.
  • pnpm build passed.
  • env -u DATABASE_URL PATH="/Users/fabiannitka/.nvm/versions/node/v24.12.0/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/bin:/bin:/usr/sbin:/sbin" pnpm test:run passed: 229 files, 1259 passed, 1 skipped.

Risks

  • Low migration risk: no schema changes.
  • Behavioral shift: authenticated schedule/webhook dispatch validation failures now create failed routine runs and blocked issues instead of throwing before run creation.
  • Cleanup is compensating rather than fully atomic because issueService.create() owns its own transaction; regression coverage forces rollback after blocker creation and asserts no orphan issues remain.

Model Used

  • OpenAI GPT-5.4 via Codex CLI, high reasoning mode, with local shell/tool execution.

Checklist

  • I have included a thinking path that traces from project context to this change
  • I have specified the model used (with version and capability details)
  • I have run tests locally and they pass
  • I have added or updated tests where applicable
  • If this change affects the UI, I have included before/after screenshots
  • I have updated relevant documentation to reflect my changes
  • I have considered and documented any risks above
  • I will address all Greptile and reviewer comments before requesting merge

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 13, 2026

Greptile Summary

This PR adds fail-safe dispatch handling for scheduled and webhook routines: when an automated run fails to dispatch, it creates a visible blocked issue assigned to the routine assignee and links it to the run record. It also adds two regression tests (stale issue-counter correction and failure-blocker linkage) and documents the behavior in the API docs.

  • Missing PR template sections: The description omits the required Thinking Path, Risks, Model Used, and Checklist sections mandated by CONTRIBUTING.md and the PR template. Please fill these in before merging.
  • The createAutomatedFailureIssue helper (called inside the db.transaction() catch block) uses the outer db connection rather than txDb, so the blocker issue INSERT is committed outside the enclosing transaction. If finalizeRun or updateRoutineTouchedState subsequently fail and the transaction rolls back, the blocker issue persists with a stale run ID in its description.

Confidence Score: 4/5

Not safe to merge until the transactional atomicity bug in createAutomatedFailureIssue is fixed and the PR description is completed per the template.

One P1 finding: the blocker issue is written to the DB outside the enclosing transaction, which can produce an orphaned visible issue if the transaction rolls back. Additionally, the PR description is missing required template sections (Thinking Path, Risks, Model Used, Checklist).

server/src/services/routines.ts — specifically the createAutomatedFailureIssue function and its call site inside the transaction catch block.

Important Files Changed

Filename Overview
server/src/services/routines.ts Adds failure-blocker issue creation for schedule/webhook dispatch errors, but the helper uses the outer db connection instead of txDb, breaking transactional atomicity.
server/src/tests/routines-service.test.ts Adds two new regression tests: stale issue-counter correction on scheduled dispatch, and dispatch-failure blocker linkage. Tests are well-scoped and verify the right DB state.
docs/api/routines.md Adds a paragraph documenting fail-safe run behavior; accurate and clear.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: server/src/services/routines.ts
Line: 698-708

Comment:
**Blocker issue created outside the enclosing transaction**

`issueSvc.create` is called with the outer `db` connection, not `txDb`, so the INSERT is committed immediately and independently of the surrounding `db.transaction()`. If either of the subsequent `finalizeRun(…, txDb)` or `updateRoutineTouchedState(…, txDb)` calls throw (e.g. a constraint violation on `routine_runs`), the transaction rolls back and `createdRun` is gone — but the blocker issue already exists in the DB, referencing a non-existent run ID in its description. The fix is to thread `txDb` through to `createAutomatedFailureIssue` and pass it to `issueSvc.create`:

```typescript
async function createAutomatedFailureIssue(
  input: { ... },
  dbConn: Db,
) {
  if (input.source !== "schedule" && input.source !== "webhook") return null;
  return issueSvc.create(input.routine.companyId, { ... }, dbConn);
}
```

and at the call site inside the catch block:

```typescript
failureIssue = await createAutomatedFailureIssue({...}, txDb);
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: make routine dispatch failures visi..." | Re-trigger Greptile

Comment thread server/src/services/routines.ts Outdated
Comment on lines +698 to +708
return issueSvc.create(input.routine.companyId, {
projectId: input.routine.projectId,
goalId: input.routine.goalId,
parentId: input.routine.parentIssueId,
title: `Blocked routine: ${input.routine.title}`,
description: formatRoutineFailureDescription(input),
status: "blocked",
priority: input.routine.priority,
assigneeAgentId: input.routine.assigneeAgentId,
originKind: "manual",
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Blocker issue created outside the enclosing transaction

issueSvc.create is called with the outer db connection, not txDb, so the INSERT is committed immediately and independently of the surrounding db.transaction(). If either of the subsequent finalizeRun(…, txDb) or updateRoutineTouchedState(…, txDb) calls throw (e.g. a constraint violation on routine_runs), the transaction rolls back and createdRun is gone — but the blocker issue already exists in the DB, referencing a non-existent run ID in its description. The fix is to thread txDb through to createAutomatedFailureIssue and pass it to issueSvc.create:

async function createAutomatedFailureIssue(
  input: { ... },
  dbConn: Db,
) {
  if (input.source !== "schedule" && input.source !== "webhook") return null;
  return issueSvc.create(input.routine.companyId, { ... }, dbConn);
}

and at the call site inside the catch block:

failureIssue = await createAutomatedFailureIssue({...}, txDb);
Prompt To Fix With AI
This is a comment left during a code review.
Path: server/src/services/routines.ts
Line: 698-708

Comment:
**Blocker issue created outside the enclosing transaction**

`issueSvc.create` is called with the outer `db` connection, not `txDb`, so the INSERT is committed immediately and independently of the surrounding `db.transaction()`. If either of the subsequent `finalizeRun(…, txDb)` or `updateRoutineTouchedState(…, txDb)` calls throw (e.g. a constraint violation on `routine_runs`), the transaction rolls back and `createdRun` is gone — but the blocker issue already exists in the DB, referencing a non-existent run ID in its description. The fix is to thread `txDb` through to `createAutomatedFailureIssue` and pass it to `issueSvc.create`:

```typescript
async function createAutomatedFailureIssue(
  input: { ... },
  dbConn: Db,
) {
  if (input.source !== "schedule" && input.source !== "webhook") return null;
  return issueSvc.create(input.routine.companyId, { ... }, dbConn);
}
```

and at the call site inside the catch block:

```typescript
failureIssue = await createAutomatedFailureIssue({...}, txDb);
```

How can I resolve this? If you propose a fix, please make it concise.

@qazsero
Copy link
Copy Markdown
Author

qazsero commented Apr 13, 2026

Staff review: changes requested before QA.

  • server/src/services/routines.ts: the failure blocker is created through the outer issueSvc.create() while the dispatch run is still inside db.transaction(). If later run finalization/touched-state update fails, the transaction rolls back but the visible blocked issue can remain orphaned with only description text tying it to the run. Please make blocker creation atomic with the run update, or add a compensating cleanup path and regression coverage.
  • server/src/services/routines.ts: automated webhook failures are documented as fail-safe, but variable/assignee validation happens before the routineRuns row and failure-blocker catch. Either bring webhook validation failures into the fail-safe path or narrow the docs and add webhook queue-failure coverage so the claim is precise.
  • PR hygiene: the PR body does not use .github/PULL_REQUEST_TEMPLATE.md; it is missing Thinking Path, Risks, Model Used, and Checklist.

Local verification from Staff passed: pnpm exec vitest run server/src/__tests__/routines-service.test.ts, pnpm -r typecheck, pnpm build, and full env -u DATABASE_URL ... pnpm test:run (229 files, 1256 passed, 1 skipped).

Note: GitHub would not allow this token to submit a formal request-changes review because it is treated as the PR author, so I am leaving the blocker as a PR comment and routing the Paperclip issue back to Builder.

@qazsero
Copy link
Copy Markdown
Author

qazsero commented Apr 13, 2026

Staff review: changes requested before QA.

  • releases/v2026.410.0.md is deleted in this branch. That file is unrelated to the routine dispatch fail-safe work and contains the current security release notes, including the GHSA advisory and upgrade guidance. Restore it so this PR stays scoped to routines.
  • Focused verification passed on my side: pnpm exec vitest run server/src/__tests__/routines-service.test.ts and pnpm exec tsc --noEmit --pretty false --project server/tsconfig.json.

The routine fail-safe changes look materially improved from the previous review; I am holding this at Staff only because of the unrelated release-note deletion.

@qazsero
Copy link
Copy Markdown
Author

qazsero commented Apr 13, 2026

Staff review: approved for QA.

  • The previous routine transaction/webhook blockers are addressed: automated schedule/webhook validation now lands inside the routine-run lifecycle, failure runs link to visible blocked issues, and externally-created dispatch/blocker issues are cleaned up if the enclosing transaction aborts.
  • The release note file is byte-identical to current origin/master; it appears in the three-dot diff only because the branch point is older, not because this PR changes release content.
  • Local verification passed: pnpm exec vitest run server/src/tests/routines-service.test.ts; pnpm exec tsc --noEmit --pretty false --project server/tsconfig.json; pnpm run typecheck; pnpm build; git diff --check FETCH_HEAD...HEAD.
  • Full env -u DATABASE_URL ... pnpm test:run hit one unrelated suite-order failure in issue-update-comment-wakeup-routes.test.ts; that exact file passed immediately in isolation. GitHub checks are green.

Next gate is QA validation for the screenshot-backed routine/run behavior requested by VEN-314.

@serenakeyitan
Copy link
Copy Markdown

serenakeyitan commented Apr 13, 2026

🌱 gardener · ✅ ALIGNED · severity: low · commit: c0dd7633c02e

What is this? repo-gardener checks whether PRs and issues fit the project's product decisions, architecture, and roadmap — not code correctness. Think of it as a product-context review layer. For code review, see Greptile/CodeRabbit.

Context fit

Area This PR Tree guidance Fit
Visible failure blockers for automated routine dispatch VEN-314 intent: surface routine dispatch failures via a linked "blocker" issue instead of silent failed runs product/NODE.md decision #6 — "Surface problems, don't hide them. No silent auto-recovery." ✅ Aligned
Routine lifecycle hardening Latest 3 commits harden cleanup: cancel running wakeups on cleanup; close dispatch cleanup gaps; harden wakeup race cleanup product/routines/NODE.md — concurrency policies and dispatch are part of the routine contract; failure visibility falls under the same decision surface ✅ Aligned
Staff review action item (@qazsero) Reviewer flags a gap: automated failure-blocker creation at routines.ts:708 / :1157 reuses input.routine.assigneeAgentId, which can be rejected by issueService.create() (issues.ts:571-590) when the saved assignee is later pending_approval or terminated — leaving a failed run with no linkedIssueId This is a correctness edge within the VEN-314 intent — the tree-level decision (visible failures) is unchanged; the fix is execution detail ✅ Aligned (tree); action item lives with Builder
Scope 1279+153 LOC across routine service + tests; no new routes, no schema changes, no governance-surface changes engineering/backend/NODE.md — routine service is canonical; in-scope ✅ Aligned
Tree nodes referenced

No concerns at the tree-alignment layer. The @qazsero staff review flags a legitimate edge case (stored assignee no longer assignable → failure-blocker creation fails too → no visible blocker), but that's a code-correctness question within the already-aligned VEN-314 intent. The fix path the reviewer recommends (retry unassigned / fallback owner / guarantee a visible error issue) stays consistent with decision #6; whichever mechanism Builder picks, the tree-level decision doesn't move.


Reviewed commit: c0dd7633c02e · Tree snapshot: 0986440f · Commands: @gardener re-review · @gardener pause · @gardener ignore

🌱 Posted by repo-gardener — an open-source context-aware review bot built on First-Tree. Reviews this repo against serenakeyitan/paperclip-tree, a user-maintained context tree. Not affiliated with this project's maintainers.

@qazsero
Copy link
Copy Markdown
Author

qazsero commented Apr 15, 2026

Staff review: changes requested

This is close, but I do not think it is safe for QA yet.

Blocking issues:

  • server/src/services/routines.ts:925 queues the heartbeat wakeup before the run is finalized at server/src/services/routines.ts:934. If finalizeRun() or updateRoutineTouchedState() fails after the wakeup succeeds, the catch path deletes the execution issue at server/src/services/routines.ts:949, and the outer cleanup only deletes issue rows at server/src/services/routines.ts:995. The queued heartbeat_runs / agent_wakeup_requests rows can remain pointed at an issue that was deleted and no longer has a linked routine run. The rollback test at server/src/__tests__/routines-service.test.ts:773 only covers the easier case where wakeup throws before wakeup artifacts commit, so this guardrail is still unproven.

  • tickScheduledTriggers() advances routineTriggers.nextRunAt at server/src/services/routines.ts:1626 before dispatching at server/src/services/routines.ts:1644. If dispatch aborts after that claim, the scheduled occurrence is consumed with no run and no visible blocker. The existing rollback test asserts no run/issue remains, but does not assert trigger state. That still conflicts with the goal that automated visual routines cannot disappear silently.

Please either make those side effects atomic, or add compensating cleanup/retry behavior for both the wakeup artifacts and schedule claim, with regression coverage that forces the post-wakeup finalization failure path.

Verification I ran on a merge-result worktree against current origin/master:

  • pnpm install --frozen-lockfile completed, with existing plugin bin warnings for unbuilt plugin SDK artifacts.
  • pnpm exec vitest run server/src/__tests__/routines-service.test.ts passed: 22 tests.
  • git diff --check origin/master passed.

I did not run broader gates after finding the blocking review issue.

@qazsero
Copy link
Copy Markdown
Author

qazsero commented Apr 15, 2026

Staff Review

Not ready for QA yet. The latest patch still does not cover the production wakeup lifecycle for the post-wakeup finalization failure case.

Blocker:

  • server/src/services/routines.ts cleanup only selects heartbeat_runs with status = 'queued' and only deletes wakeup requests with status in ('queued', 'coalesced').
  • The real heartbeat.wakeup() path inserts the queued run, then immediately calls startNextQueuedRunForAgent() before returning. That can claim the run as running and mark the wakeup request claimed before dispatchRoutineRun() reaches finalizeRun().
  • If finalizeRun() or touched-state update fails after wakeup returns, cleanupIssueWakeupArtifacts() can delete the routine execution issue while leaving a running heartbeat and claimed wakeup request pointed at the deleted issue. That is the same class of orphaned wakeup artifact the fix is meant to close.
  • The new regression test uses a fake wakeup that leaves the run/request queued, so it does not exercise the production path where wakeup() starts the run before returning.

What I would expect before QA:

  • Make the cleanup/failure path safe when the wakeup has already been claimed/running, or change dispatch ordering so the run cannot be claimed until routine run finalization is durable.
  • Add regression coverage that uses the real heartbeat wakeup behavior or faithfully simulates queued -> running and queued -> claimed before post-wakeup finalization fails.

Verification I ran:

  • pnpm exec vitest run server/src/__tests__/routines-service.test.ts passed: 24 tests.
  • pnpm exec tsc --noEmit --pretty false --project server/tsconfig.json passed.
  • git diff --check origin/master...HEAD passed.

@qazsero
Copy link
Copy Markdown
Author

qazsero commented Apr 15, 2026

Staff Review

Not ready for QA yet. The latest patch improves the steady queued and steady running cases, but there is still a race in the cleanup path.

Blocking issue:

  • cleanupIssueWakeupArtifacts() snapshots live runs, classifies queued runs, then later deletes the queued run ids without re-checking status. If a queued run is claimed between the select and the delete, the cleanup can delete a now-running heartbeat_runs row while leaving the wakeup request in claimed status. The follow-up wakeup delete only covers queued/coalesced, so the claimed agent_wakeup_requests row can survive pointing at an issue that the caller then deletes.
  • Relevant flow: server/src/services/routines.ts:724 selects live runs, server/src/services/routines.ts:739 classifies queued/running, server/src/services/routines.ts:804 deletes queued run ids without a status predicate, and server/src/services/routines.ts:816 only deletes queued/coalesced wakeup requests. The issue row can then be deleted at server/src/services/routines.ts:1080 or server/src/services/routines.ts:844.
  • The new tests cover stable queued and stable running artifacts, but not the queued-to-running race between cleanup snapshot and cleanup delete.

Expected before QA:

  • Revalidate run status before deleting queued run ids, or handle all selected live runs through a cancellation path that remains safe if they are claimed mid-cleanup.
  • Ensure claimed wakeup requests cannot survive for a transient routine issue that is deleted.
  • Add regression coverage that forces a queued artifact to become running/claimed during cleanup before the issue deletion proceeds.

Verification I ran:

  • pnpm exec vitest run server/src/__tests__/routines-service.test.ts passed: 25 tests.
  • pnpm exec tsc --noEmit --pretty false --project server/tsconfig.json passed.
  • git diff --check origin/master...HEAD passed.

@qazsero
Copy link
Copy Markdown
Author

qazsero commented Apr 15, 2026

Staff Review

Approved for QA.

  • The previous queued-to-running cleanup race is addressed: routine cleanup now reconciles live heartbeat runs over bounded passes, deletes only still-queued runs, cancels running/claimed runs through heartbeat.cancelRun(), and leaves the transient issue in place if cleanup cannot prove wakeup artifacts are inactive.
  • Scheduled trigger claim restoration is covered for dispatch aborts after the trigger has advanced.
  • Regression coverage now includes stale issue counters, visible automated blockers, rollback cleanup, webhook validation failures, post-wakeup finalization failure, claimed/running wakeups, and the queued-to-running cleanup race.

Verification run locally on c0dd7633:

  • pnpm exec vitest run server/src/__tests__/routines-service.test.ts passed: 26 tests.
  • pnpm exec tsc --noEmit --pretty false --project server/tsconfig.json passed.
  • pnpm run typecheck passed.
  • pnpm build passed.
  • env -u DATABASE_URL PATH=<without /usr/local/bin> pnpm test:run passed: 229 files, 1263 passed, 1 skipped.
  • git diff --check passed.

GitHub checks are green at review time: policy, verify, e2e, and Snyk passed; PR merge state is CLEAN.

Next owner: QA Engineer for the screenshot-backed routine/run validation gate on VEN-314.

@qazsero
Copy link
Copy Markdown
Author

qazsero commented Apr 19, 2026

Staff Review

Changes requested before QA.

Blocking issue:

  • server/src/services/routines.ts:708 creates the automated failure blocker with assigneeAgentId: input.routine.assigneeAgentId.
  • The normal dispatch path can fail because that same saved routine assignee is no longer assignable, for example an agent was later marked pending_approval or terminated. issueService.create() rejects those assignees in server/src/services/issues.ts:571-590.
  • In that case, the failure-blocker creation at server/src/services/routines.ts:1157 retries with the same invalid assignee and fails too. The catch only records a failed routine run with no linkedIssueId at server/src/services/routines.ts:1177-1182, so an automated schedule/webhook routine can still fail without leaving the visible blocked issue promised by VEN-314.

Expected before QA:

  • Make automated failure blocker creation independent of the broken routine assignee path: retry unassigned, route to a safe fallback owner, or otherwise guarantee a visible blocked/error issue when the stored assignee is invalid.
  • Add regression coverage for a scheduled or webhook routine whose existing assignee becomes terminated or pending_approval before dispatch.

Verification I ran:

  • pnpm exec vitest run server/src/__tests__/routines-service.test.ts passed: 26 tests.
  • pnpm exec tsc --noEmit --pretty false --project server/tsconfig.json passed.
  • git diff --check origin/master...HEAD passed.

GitHub checks are green, but this fail-safe edge is still inside the VEN-314 acceptance criteria, so I am routing the Paperclip issue back to Builder.

@qazsero qazsero force-pushed the fix/ven-314-routine-failsafe branch from d2b0ef1 to 8134189 Compare April 20, 2026 22:11
@qazsero
Copy link
Copy Markdown
Author

qazsero commented Apr 20, 2026

Staff review approved for QA.

Reviewed refreshed head 8134189a against base 1266954a after the rebase. No code blockers found.

Checked:

  • scheduled/webhook routine dispatch failure path creates linked visible blocked issues;
  • unassignable saved routine assignees fall back to an unassigned blocked issue with owner/next-action copy;
  • rollback cleanup covers externally-created execution/blocker issues and wakeup artifacts;
  • queued/running/claimed wakeup cleanup race remains covered;
  • workspace branch interpolation path from current master is preserved;
  • stale companies.issueCounter scheduled-routine regression remains covered.

Verification:

  • pnpm vitest run server/src/__tests__/routines-service.test.ts passed: 29 tests.
  • pnpm --filter @paperclipai/plugin-sdk build passed; this is needed before direct server tsc in this worktree.
  • pnpm exec tsc --noEmit --pretty false --project server/tsconfig.json passed.
  • git diff --check 1266954a4ee33ecca3589d47e4356fa1bdeaac5d...HEAD passed.
  • GitHub checks are green: policy, verify, e2e, and Snyk; merge state is CLEAN.

Residual QA gate: validate the live visual routine path with browser/screenshot evidence or a precise blocked visual-review issue. The code change cannot by itself prove the live Product Designer/Growth routine prompts in production.

qazsero and others added 6 commits April 25, 2026 09:08
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
@qazsero qazsero force-pushed the fix/ven-314-routine-failsafe branch from 8134189 to 3329965 Compare April 25, 2026 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants