Skip to content

fix(heartbeat): escalate stranded issue when recovery retry succeeds without execution path#4459

Open
supertaz wants to merge 2 commits intopaperclipai:masterfrom
supertaz:fix/stranded-issue-recovery-escalation
Open

fix(heartbeat): escalate stranded issue when recovery retry succeeds without execution path#4459
supertaz wants to merge 2 commits intopaperclipai:masterfrom
supertaz:fix/stranded-issue-recovery-escalation

Conversation

@supertaz
Copy link
Copy Markdown

@supertaz supertaz commented Apr 25, 2026

Problem

When an assigned in_progress issue has no live execution path, the stranded-issue reconciler queues a recovery retry with retryReason: issue_continuation_needed. The escalation guard — didAutomaticRecoveryFail — was supposed to detect that the retry was exhausted and escalate the issue to blocked. However, the guard only checked for unsuccessful terminal statuses (failed, cancelled, timed_out). If the recovery run exited with succeeded (e.g., the agent posted a comment and returned cleanly without re-establishing an execution path), the guard returned false, the issue remained in_progress with no execution path, and the reconciler re-queued another recovery run on the next scheduler tick (default 30 s). This produced an indefinite loop that could only be stopped by manual intervention.

The same logic applies to the assignment_recovery path for todo issues.

Root cause

// recovery/service.ts (before)
function didAutomaticRecoveryFail(latestRun, expectedRetryReason) {
  ...
  return latestRetryReason === expectedRetryReason &&
    UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES.includes(latestRun.status);
    // ["failed", "cancelled", "timed_out"] — succeeded is NOT here
}

The call site checks hasActiveExecutionPath before calling this guard, so by the time we reach the escalation check the issue is already known to be stranded. A succeeded recovery run that left the issue stranded is semantically exhausted — it should escalate, not re-queue.

Fix

Two commits:

Commit 1 (f9f382ec): Replace UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES with HEARTBEAT_RUN_TERMINAL_STATUSES (which includes succeeded) in recovery/service.ts and heartbeat.ts. Rename didAutomaticRecoveryFaildidAutomaticRecoveryExhaust to reflect that succeeded retries are now also considered exhausted.

// recovery/service.ts (after)
export function didAutomaticRecoveryExhaust(latestRun, expectedRetryReason) {
  ...
  // A succeeded recovery run is also considered exhausted: call sites verify there is no
  // active execution path before reaching this check, so a run that exited successfully
  // without re-establishing one left the issue stranded and should trigger escalation.
  return latestRetryReason === expectedRetryReason &&
    HEARTBEAT_RUN_TERMINAL_STATUSES.includes(latestRun.status);
    // ["succeeded", "failed", "cancelled", "timed_out"]
}

Commit 2 (14604ce8): Remove the parallel copy of didAutomaticRecoveryExhaust from heartbeat.ts. Export the canonical implementation from recovery/service.ts and import it at the one remaining call site in heartbeat.ts. Completes the consolidation that was started when reconcileStrandedAssignedIssues was moved into recovery/service.ts.

Reproducer

  1. Create an agent-assigned issue in in_progress state with no active execution_run_id.
  2. Create a heartbeat_run for that issue with status = succeeded and contextSnapshot.retryReason = "issue_continuation_needed".
  3. Call reconcileStrandedAssignedIssues().
  4. Before this fix: a new recovery run is queued (loop continues indefinitely). After this fix: the issue is escalated to blocked.

Testing

  • Updated the existing test "re-enqueues continuation when the latest automatic continuation succeeded without closing the issue" to reflect the correct post-fix behavior (escalates to blocked).
  • All 20 tests in heartbeat-process-recovery.test.ts pass.
  • TypeCheck: pnpm --filter @paperclipai/server typecheck passes.
  • Self-reviewed by gpt-5.4-mini at xhigh reasoning effort before submission. Two findings raised: (1) P1 false positive — expectStrandedRecoveryArtifacts helper exists in test file at line 572, outside the diff window; (2) P2 false positive — the todo-branch early-exit at recovery/service.ts:1456 (latestRun.status === "succeeded" → skip) fires before didAutomaticRecoveryExhaust is called, so the concern about todo escalation does not apply. No actionable findings remained after analysis.

…without execution path

The reconciler queues an issue_continuation_needed (or assignment_recovery)
retry when an assigned issue has no live execution path. The escalation
gate previously only tripped on failed/cancelled/timed_out terminal
statuses, so a recovery run that exited successfully (e.g., posted a
comment and returned) without re-establishing a real execution path would
leave the issue in the same state, causing the reconciler to re-queue
another recovery run on every tick (default 30s). This produced an
indefinite loop until manual intervention.

The hasActiveExecutionPath check earlier in the same branch already
guarantees we only reach this guard when the issue is still stranded, so
any terminal status of the recovery retry — including succeeded — should
trigger escalation to blocked.

Rename didAutomaticRecoveryFail to didAutomaticRecoveryExhaust to reflect
that succeeded retries are now also considered exhausted.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 25, 2026

Greptile Summary

This PR fixes an infinite recovery loop in the stranded-issue reconciler: when a heartbeat recovery run exits with succeeded but leaves the issue still stranded (no active execution path), didAutomaticRecoveryFail would return false and re-queue another recovery run indefinitely. The fix extends the exhaustion check to include succeeded in the terminal-status set and renames the function to didAutomaticRecoveryExhaust to reflect the updated semantics — applied symmetrically to both heartbeat.ts and recovery/service.ts.

The PR description is detailed and technically correct, but it does not follow the required PR template from CONTRIBUTING.md. The following required sections are missing: Thinking Path, Risks, Model Used, and the Checklist. Please fill in those sections before requesting merge — the template lives at .github/PULL_REQUEST_TEMPLATE.md.

Confidence Score: 4/5

Safe to merge after cleaning up unused constants and filling in the PR template; the fix is correct and well-tested.

Only P2 findings: one dead constant left in two files and a missing PR template. The core logic change is correct, symmetrically applied, and the updated test explicitly verifies the new escalation path.

Minor cleanup needed in server/src/services/recovery/service.ts and server/src/services/heartbeat.ts to remove the now-unused UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES constant.

Important Files Changed

Filename Overview
server/src/services/recovery/service.ts Renames didAutomaticRecoveryFail → didAutomaticRecoveryExhaust and adds HEARTBEAT_RUN_TERMINAL_STATUSES (including "succeeded"); UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES is now unused dead code
server/src/services/heartbeat.ts Applies the same rename and constant swap in the parallel copy of the function; UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES (line 154) is now unused
server/src/tests/heartbeat-process-recovery.test.ts Correctly flips test expectations from continuationRequeued=1/escalated=0 to continuationRequeued=0/escalated=1 and verifies blocked status + escalation artifacts
Prompt To Fix All With AI
This is a comment left during a code review.
Path: server/src/services/recovery/service.ts
Line: 37

Comment:
**Unused constant left behind**

`UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES` is declared here but is no longer referenced anywhere in `recovery/service.ts` after the swap to `HEARTBEAT_RUN_TERMINAL_STATUSES`. The same dead constant also remains in `heartbeat.ts` (line 154). Both can be removed to avoid confusion about which set of statuses represents an "unsuccessful" terminal state.

```suggestion
const HEARTBEAT_RUN_TERMINAL_STATUSES = ["succeeded", "failed", "cancelled", "timed_out"] as const;
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(heartbeat): escalate stranded issue ..." | Re-trigger Greptile

@@ -35,6 +35,7 @@ import { isAutomaticRecoverySuppressedByPauseHold } from "./pause-hold-guard.js"

const EXECUTION_PATH_HEARTBEAT_RUN_STATUSES = ["queued", "running", "scheduled_retry"] as const;
const UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES = ["failed", "cancelled", "timed_out"] as const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused constant left behind

UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES is declared here but is no longer referenced anywhere in recovery/service.ts after the swap to HEARTBEAT_RUN_TERMINAL_STATUSES. The same dead constant also remains in heartbeat.ts (line 154). Both can be removed to avoid confusion about which set of statuses represents an "unsuccessful" terminal state.

Suggested change
const UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES = ["failed", "cancelled", "timed_out"] as const;
const HEARTBEAT_RUN_TERMINAL_STATUSES = ["succeeded", "failed", "cancelled", "timed_out"] as const;
Prompt To Fix With AI
This is a comment left during a code review.
Path: server/src/services/recovery/service.ts
Line: 37

Comment:
**Unused constant left behind**

`UNSUCCESSFUL_HEARTBEAT_RUN_TERMINAL_STATUSES` is declared here but is no longer referenced anywhere in `recovery/service.ts` after the swap to `HEARTBEAT_RUN_TERMINAL_STATUSES`. The same dead constant also remains in `heartbeat.ts` (line 154). Both can be removed to avoid confusion about which set of statuses represents an "unsuccessful" terminal state.

```suggestion
const HEARTBEAT_RUN_TERMINAL_STATUSES = ["succeeded", "failed", "cancelled", "timed_out"] as const;
```

How can I resolve this? If you propose a fix, please make it concise.

…overy service

Remove the parallel copy of didAutomaticRecoveryExhaust from heartbeat.ts
and export the canonical implementation from recovery/service.ts. The
execution-path recovery caller in heartbeat.ts now imports it directly,
completing the consolidation started when reconcileStrandedAssignedIssues
was moved into recovery/service.ts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant