fix(heartbeat): teach reaper about out-of-process K8s adapter liveness (FAR-108) by cpfarhood · Pull Request #4162 · paperclipai/paperclip

cpfarhood · 2026-04-20T20:43:03Z

Thinking Path

Paperclip orchestrates AI agents for zero-human companies

Agents can run via local adapters (child processes) or remote adapters (Kubernetes Jobs, gateway HTTP services, etc.)

The heartbeat reaper kills stale runs by checking local pid / process-group liveness — but K8s-style adapters have no local pid; their "process" is a remote Kubernetes Job

When the reaper finds processPid = null for a claude_k8s run it still emits the misleading child pid -1 sentinel and a process_lost retry, neither of which make sense for a remote Job

This PR adds a hasOutOfProcessLiveness capability flag to ServerAdapterModule so K8s-style adapters can opt into a separate reaper path that (a) defers at cold startup, (b) emits a meaningful adapter_liveness_lost error code, and (c) skips the process_lost retry

The benefit is that long-running claude_k8s runs survive server restarts, operators see actionable error messages when a K8s Job genuinely disappears, and the fix is fully backward-compatible with all existing local adapters

What Changed

packages/adapter-utils/src/types.ts — Added hasOutOfProcessLiveness?: boolean to ServerAdapterModule. Documents semantics: skip local pid checks, defer cold-startup reap, use adapter_liveness_lost error code.
server/src/services/heartbeat.ts — Added adapterHasOutOfProcessLiveness() helper (reads the new flag via getServerAdapter). In reapOrphanedRuns, inserted an early branch for flagged adapters that: (a) skips cold-startup sweep (staleThresholdMs <= 0), (b) sets run to failed with errorCode: adapter_liveness_lost and a K8s-aware message, (c) calls releaseIssueExecutionAndPromote without the process_lost retry, and (d) emits a lifecycle run event.
server/src/services/heartbeat-stop-metadata.ts — Added "adapter_liveness_lost" to HeartbeatRunStopReason union and wired it into inferHeartbeatRunStopReason.
server/src/__tests__/heartbeat-process-recovery.test.ts — Updated mock to accept adapterType and return hasOutOfProcessLiveness: true for test_k8s_out_of_process. Added three new tests:
1. Cold-startup sweep does not reap an out-of-process run (staleThresholdMs === 0).
2. Stale out-of-process run is reaped with errorCode: adapter_liveness_lost, no child pid wording, no process_lost retry row.
3. Regression guard: local adapter with no pid still reaches the classic process_lost path (not adapter_liveness_lost).

Verification

Run existing and new heartbeat reaper tests:

pnpm vitest run server/src/__tests__/heartbeat-process-recovery.test.ts

All three new tests pass; existing tests unaffected.
To validate end-to-end: configure a claude_k8s adapter with hasOutOfProcessLiveness: true, start a long-running run, restart the server, confirm the run is not immediately killed and that its updatedAt keeps refreshing.

Risks

Low risk for existing local adapters. The new code path is gated strictly on hasOutOfProcessLiveness === true. No built-in adapter sets this flag today; all existing behaviour is unchanged.
K8s adapters must opt in. Without setting hasOutOfProcessLiveness: true in their ServerAdapterModule, claude_k8s and similar adapters will continue to see the old process_lost path. This is intentional — the fix is opt-in to avoid unintended changes.
Cold-startup window. Deferring the cold-startup reap for out-of-process adapters means a truly dead K8s run (Job deleted while server was down) won't be cleaned up until the 5-minute periodic sweep. Acceptable trade-off vs. killing live Jobs on every restart.

For core feature work, check ROADMAP.md first and discuss it in #dev before opening the PR. Feature PRs that overlap with planned core work may need to be redirected — check the roadmap first. See CONTRIBUTING.md.

Model Used

Provider: Anthropic
Model: Claude Sonnet 4.6 (claude-sonnet-4-6)
Context window: 200K tokens
Capabilities: Tool use, code execution, extended reasoning, multi-file editing
Mode: Paperclip heartbeat agent (Maverick) — autonomous execution with issue context

Checklist

I have included a thinking path that traces from project context to this change
I have specified the model used (with version and capability details)
I have checked ROADMAP.md and confirmed this PR does not duplicate planned core work
I have run tests locally and they pass
I have added or updated tests where applicable
If this change affects the UI, I have included before/after screenshots (no UI changes)
I have updated relevant documentation to reflect my changes
I have considered and documented any risks above
I will address all Greptile and reviewer comments before requesting merge

Adds `hasOutOfProcessLiveness` capability flag to `ServerAdapterModule` so adapters like `claude_k8s` can declare that their execution runs in a remote Kubernetes Job — not a local child process. The reaper now: - Skips cold-startup sweeps for flagged adapters so still-running K8s Jobs survive server restarts without being killed. - Emits `adapter_liveness_lost` error code and a K8s-aware message when a genuinely stale out-of-process run is reaped, replacing the confusing "child pid -1" sentinel. - Omits the `process_lost` retry queue entry (local pid disappearing is irrelevant for remote adapters). Adds `inferHeartbeatRunStopReason` mapping for `adapter_liveness_lost` and three regression tests in heartbeat-process-recovery that cover the cold-startup no-reap, stale reap with correct codes, and a guard that local adapters with no pid still reach the classic `process_lost` path. Closes FAR-108. Co-Authored-By: Paperclip <noreply@paperclip.ing>

greptile-apps · 2026-04-20T20:45:51Z

Greptile Summary

Adds hasOutOfProcessLiveness?: boolean to ServerAdapterModule and threads a new reaper branch through reapOrphanedRuns so K8s-style adapters (a) survive server restarts without being killed at cold startup, (b) emit a meaningful adapter_liveness_lost error code instead of the misleading "child pid -1" sentinel, and (c) skip the process_lost retry that only makes sense for local child processes. The change is strictly opt-in and all three new tests pass, with the existing tests unaffected.

Confidence Score: 5/5

Safe to merge — new reaper path is fully opt-in behind the hasOutOfProcessLiveness flag and no built-in adapter sets it today.

All remaining findings are P2 (a test-cleanup performance nit). Core logic, staleness gating, cold-startup guard, error code wiring, and stop-metadata inference are all correct. Existing tests are unaffected.

No files require special attention for correctness; heartbeat-process-recovery.test.ts has a minor afterEach slowdown for the cold-startup test case.

Important Files Changed

Filename	Overview
packages/adapter-utils/src/types.ts	Added `hasOutOfProcessLiveness?: boolean` to `ServerAdapterModule` with thorough JSDoc explaining the three reaper behaviours when set to `true`.
server/src/services/heartbeat.ts	Added `adapterHasOutOfProcessLiveness` helper and out-of-process reaper branch; logic is gated correctly, cold-startup guard and `adapter_liveness_lost` error code wired properly, no `process_lost` retry queued.
server/src/services/heartbeat-stop-metadata.ts	Extended `HeartbeatRunStopReason` union and `inferHeartbeatRunStopReason` to handle the new `adapter_liveness_lost` error code cleanly.
server/src/tests/heartbeat-process-recovery.test.ts	Three new tests cover cold-startup skip, stale reap with `adapter_liveness_lost`, and the regression guard; the cold-startup test leaves a "running" run with no pid in the DB, causing the `afterEach` cleanup loop to spin ~5 s unnecessarily.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: server/src/__tests__/heartbeat-process-recovery.test.ts
Line: 588-609

Comment:
**Cold-startup test leaves a stuck "running" run that slows `afterEach` cleanup**

The cold-startup test intentionally preserves the run in `"running"` state with `processPid = null` / `processGroupId = null`. The `afterEach` polling loop treats any run matching `status = "running" && !processPid && !processGroupId` as "managed execution still active" and waits until `idlePolls >= 3`. Because nothing ever transitions this run to a terminal state, the loop spins all 100 iterations (≈5 s) before the DB cleanup finally deletes it.

A lightweight fix is to mark the run terminal at the end of the test since the assertion only needs the returned `result.reaped`:

```ts
// After the reapOrphanedRuns assertion, mark the run failed so afterEach
// cleanup does not spin the full 100-iteration timeout.
await db
  .update(heartbeatRuns)
  .set({ status: "failed" })
  .where(eq(heartbeatRuns.id, runId));
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "refactor(heartbeat): remove unreachable ..." | Re-trigger Greptile}

…venessLostMessage The staleThresholdMs <= 0 fallback was dead code — the sole call site already guards with `if (staleThresholdMs <= 0) continue` before invoking this function, so staleThresholdMs is always positive. Addresses Greptile P2 finding on PR paperclipai#4162. Co-Authored-By: Paperclip <noreply@paperclip.ing>

cpfarhood · 2026-04-20T21:03:48Z

@greptile-apps The dead branch in buildAdapterLivenessLostMessage has been removed. The unreachable staleThresholdMs <= 0 fallback is gone — the function now unconditionally formats with the staleness duration (since the call site already guards with if (staleThresholdMs <= 0) continue before ever reaching it). Please re-review.

…venessLostMessage The staleThresholdMs <= 0 fallback was dead code — the sole call site already guards with `if (staleThresholdMs <= 0) continue` before invoking this function, so staleThresholdMs is always positive. Addresses Greptile P2 finding on PR paperclipai#4162. Co-Authored-By: Paperclip <noreply@paperclip.ing>

Mark the intentionally-preserved "running" run as failed at the end of the cold-startup reaper test so the afterEach polling loop does not spin its full 100-iteration (~5 s) timeout waiting for the run to settle. Co-Authored-By: Paperclip <noreply@paperclip.ing>

greptile-apps bot reviewed Apr 20, 2026

View reviewed changes

Comment thread server/src/services/heartbeat.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(heartbeat): teach reaper about out-of-process K8s adapter liveness (FAR-108)#4162

fix(heartbeat): teach reaper about out-of-process K8s adapter liveness (FAR-108)#4162
cpfarhood wants to merge 3 commits intopaperclipai:masterfrom
farhoodliquor:fix/far-108-k8s-adapter-reaper-liveness

cpfarhood commented Apr 20, 2026

Uh oh!

greptile-apps bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

cpfarhood commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cpfarhood commented Apr 20, 2026

Thinking Path

What Changed

Verification

Risks

Model Used

Checklist

Uh oh!

greptile-apps bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

cpfarhood commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Apr 20, 2026 •

edited

Loading