fix(heartbeat): teach reaper about out-of-process K8s adapter liveness (FAR-108)#4162
fix(heartbeat): teach reaper about out-of-process K8s adapter liveness (FAR-108)#4162cpfarhood wants to merge 3 commits intopaperclipai:masterfrom
Conversation
Adds `hasOutOfProcessLiveness` capability flag to `ServerAdapterModule` so adapters like `claude_k8s` can declare that their execution runs in a remote Kubernetes Job — not a local child process. The reaper now: - Skips cold-startup sweeps for flagged adapters so still-running K8s Jobs survive server restarts without being killed. - Emits `adapter_liveness_lost` error code and a K8s-aware message when a genuinely stale out-of-process run is reaped, replacing the confusing "child pid -1" sentinel. - Omits the `process_lost` retry queue entry (local pid disappearing is irrelevant for remote adapters). Adds `inferHeartbeatRunStopReason` mapping for `adapter_liveness_lost` and three regression tests in heartbeat-process-recovery that cover the cold-startup no-reap, stale reap with correct codes, and a guard that local adapters with no pid still reach the classic `process_lost` path. Closes FAR-108. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Greptile SummaryAdds Confidence Score: 5/5Safe to merge — new reaper path is fully opt-in behind the All remaining findings are P2 (a test-cleanup performance nit). Core logic, staleness gating, cold-startup guard, error code wiring, and stop-metadata inference are all correct. Existing tests are unaffected. No files require special attention for correctness; Important Files Changed
Prompt To Fix All With AIThis is a comment left during a code review.
Path: server/src/__tests__/heartbeat-process-recovery.test.ts
Line: 588-609
Comment:
**Cold-startup test leaves a stuck "running" run that slows `afterEach` cleanup**
The cold-startup test intentionally preserves the run in `"running"` state with `processPid = null` / `processGroupId = null`. The `afterEach` polling loop treats any run matching `status = "running" && !processPid && !processGroupId` as "managed execution still active" and waits until `idlePolls >= 3`. Because nothing ever transitions this run to a terminal state, the loop spins all 100 iterations (≈5 s) before the DB cleanup finally deletes it.
A lightweight fix is to mark the run terminal at the end of the test since the assertion only needs the returned `result.reaped`:
```ts
// After the reapOrphanedRuns assertion, mark the run failed so afterEach
// cleanup does not spin the full 100-iteration timeout.
await db
.update(heartbeatRuns)
.set({ status: "failed" })
.where(eq(heartbeatRuns.id, runId));
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "refactor(heartbeat): remove unreachable ..." | Re-trigger Greptile |
…venessLostMessage The staleThresholdMs <= 0 fallback was dead code — the sole call site already guards with `if (staleThresholdMs <= 0) continue` before invoking this function, so staleThresholdMs is always positive. Addresses Greptile P2 finding on PR paperclipai#4162. Co-Authored-By: Paperclip <noreply@paperclip.ing>
|
@greptile-apps The dead branch in |
…venessLostMessage The staleThresholdMs <= 0 fallback was dead code — the sole call site already guards with `if (staleThresholdMs <= 0) continue` before invoking this function, so staleThresholdMs is always positive. Addresses Greptile P2 finding on PR paperclipai#4162. Co-Authored-By: Paperclip <noreply@paperclip.ing>
…venessLostMessage The staleThresholdMs <= 0 fallback was dead code — the sole call site already guards with `if (staleThresholdMs <= 0) continue` before invoking this function, so staleThresholdMs is always positive. Addresses Greptile P2 finding on PR paperclipai#4162. Co-Authored-By: Paperclip <noreply@paperclip.ing>
…venessLostMessage The staleThresholdMs <= 0 fallback was dead code — the sole call site already guards with `if (staleThresholdMs <= 0) continue` before invoking this function, so staleThresholdMs is always positive. Addresses Greptile P2 finding on PR paperclipai#4162. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Mark the intentionally-preserved "running" run as failed at the end of the cold-startup reaper test so the afterEach polling loop does not spin its full 100-iteration (~5 s) timeout waiting for the run to settle. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Thinking Path
What Changed
packages/adapter-utils/src/types.ts— AddedhasOutOfProcessLiveness?: booleantoServerAdapterModule. Documents semantics: skip local pid checks, defer cold-startup reap, useadapter_liveness_losterror code.server/src/services/heartbeat.ts— AddedadapterHasOutOfProcessLiveness()helper (reads the new flag viagetServerAdapter). InreapOrphanedRuns, inserted an early branch for flagged adapters that: (a) skips cold-startup sweep (staleThresholdMs <= 0), (b) sets run tofailedwitherrorCode: adapter_liveness_lostand a K8s-aware message, (c) callsreleaseIssueExecutionAndPromotewithout theprocess_lostretry, and (d) emits alifecyclerun event.server/src/services/heartbeat-stop-metadata.ts— Added"adapter_liveness_lost"toHeartbeatRunStopReasonunion and wired it intoinferHeartbeatRunStopReason.server/src/__tests__/heartbeat-process-recovery.test.ts— Updated mock to acceptadapterTypeand returnhasOutOfProcessLiveness: truefortest_k8s_out_of_process. Added three new tests:staleThresholdMs === 0).errorCode: adapter_liveness_lost, nochild pidwording, noprocess_lostretry row.process_lostpath (notadapter_liveness_lost).Verification
claude_k8sadapter withhasOutOfProcessLiveness: true, start a long-running run, restart the server, confirm the run is not immediately killed and that itsupdatedAtkeeps refreshing.Risks
hasOutOfProcessLiveness === true. No built-in adapter sets this flag today; all existing behaviour is unchanged.hasOutOfProcessLiveness: truein theirServerAdapterModule,claude_k8sand similar adapters will continue to see the oldprocess_lostpath. This is intentional — the fix is opt-in to avoid unintended changes.Model Used
claude-sonnet-4-6)Checklist