Skip to content

fix(heartbeat): teach reaper about out-of-process K8s adapter liveness (FAR-108)#4162

Open
cpfarhood wants to merge 3 commits intopaperclipai:masterfrom
farhoodliquor:fix/far-108-k8s-adapter-reaper-liveness
Open

fix(heartbeat): teach reaper about out-of-process K8s adapter liveness (FAR-108)#4162
cpfarhood wants to merge 3 commits intopaperclipai:masterfrom
farhoodliquor:fix/far-108-k8s-adapter-reaper-liveness

Conversation

@cpfarhood
Copy link
Copy Markdown
Contributor

Thinking Path

  • Paperclip orchestrates AI agents for zero-human companies
  • Agents can run via local adapters (child processes) or remote adapters (Kubernetes Jobs, gateway HTTP services, etc.)
  • The heartbeat reaper kills stale runs by checking local pid / process-group liveness — but K8s-style adapters have no local pid; their "process" is a remote Kubernetes Job
  • When the reaper finds processPid = null for a claude_k8s run it still emits the misleading child pid -1 sentinel and a process_lost retry, neither of which make sense for a remote Job
  • This PR adds a hasOutOfProcessLiveness capability flag to ServerAdapterModule so K8s-style adapters can opt into a separate reaper path that (a) defers at cold startup, (b) emits a meaningful adapter_liveness_lost error code, and (c) skips the process_lost retry
  • The benefit is that long-running claude_k8s runs survive server restarts, operators see actionable error messages when a K8s Job genuinely disappears, and the fix is fully backward-compatible with all existing local adapters

What Changed

  • packages/adapter-utils/src/types.ts — Added hasOutOfProcessLiveness?: boolean to ServerAdapterModule. Documents semantics: skip local pid checks, defer cold-startup reap, use adapter_liveness_lost error code.
  • server/src/services/heartbeat.ts — Added adapterHasOutOfProcessLiveness() helper (reads the new flag via getServerAdapter). In reapOrphanedRuns, inserted an early branch for flagged adapters that: (a) skips cold-startup sweep (staleThresholdMs <= 0), (b) sets run to failed with errorCode: adapter_liveness_lost and a K8s-aware message, (c) calls releaseIssueExecutionAndPromote without the process_lost retry, and (d) emits a lifecycle run event.
  • server/src/services/heartbeat-stop-metadata.ts — Added "adapter_liveness_lost" to HeartbeatRunStopReason union and wired it into inferHeartbeatRunStopReason.
  • server/src/__tests__/heartbeat-process-recovery.test.ts — Updated mock to accept adapterType and return hasOutOfProcessLiveness: true for test_k8s_out_of_process. Added three new tests:
    1. Cold-startup sweep does not reap an out-of-process run (staleThresholdMs === 0).
    2. Stale out-of-process run is reaped with errorCode: adapter_liveness_lost, no child pid wording, no process_lost retry row.
    3. Regression guard: local adapter with no pid still reaches the classic process_lost path (not adapter_liveness_lost).

Verification

  • Run existing and new heartbeat reaper tests:
    pnpm vitest run server/src/__tests__/heartbeat-process-recovery.test.ts
    
  • All three new tests pass; existing tests unaffected.
  • To validate end-to-end: configure a claude_k8s adapter with hasOutOfProcessLiveness: true, start a long-running run, restart the server, confirm the run is not immediately killed and that its updatedAt keeps refreshing.

Risks

  • Low risk for existing local adapters. The new code path is gated strictly on hasOutOfProcessLiveness === true. No built-in adapter sets this flag today; all existing behaviour is unchanged.
  • K8s adapters must opt in. Without setting hasOutOfProcessLiveness: true in their ServerAdapterModule, claude_k8s and similar adapters will continue to see the old process_lost path. This is intentional — the fix is opt-in to avoid unintended changes.
  • Cold-startup window. Deferring the cold-startup reap for out-of-process adapters means a truly dead K8s run (Job deleted while server was down) won't be cleaned up until the 5-minute periodic sweep. Acceptable trade-off vs. killing live Jobs on every restart.

For core feature work, check ROADMAP.md first and discuss it in #dev before opening the PR. Feature PRs that overlap with planned core work may need to be redirected — check the roadmap first. See CONTRIBUTING.md.

Model Used

  • Provider: Anthropic
  • Model: Claude Sonnet 4.6 (claude-sonnet-4-6)
  • Context window: 200K tokens
  • Capabilities: Tool use, code execution, extended reasoning, multi-file editing
  • Mode: Paperclip heartbeat agent (Maverick) — autonomous execution with issue context

Checklist

  • I have included a thinking path that traces from project context to this change
  • I have specified the model used (with version and capability details)
  • I have checked ROADMAP.md and confirmed this PR does not duplicate planned core work
  • I have run tests locally and they pass
  • I have added or updated tests where applicable
  • If this change affects the UI, I have included before/after screenshots (no UI changes)
  • I have updated relevant documentation to reflect my changes
  • I have considered and documented any risks above
  • I will address all Greptile and reviewer comments before requesting merge

Adds `hasOutOfProcessLiveness` capability flag to `ServerAdapterModule`
so adapters like `claude_k8s` can declare that their execution runs in a
remote Kubernetes Job — not a local child process. The reaper now:

- Skips cold-startup sweeps for flagged adapters so still-running K8s
  Jobs survive server restarts without being killed.
- Emits `adapter_liveness_lost` error code and a K8s-aware message when
  a genuinely stale out-of-process run is reaped, replacing the
  confusing "child pid -1" sentinel.
- Omits the `process_lost` retry queue entry (local pid disappearing is
  irrelevant for remote adapters).

Adds `inferHeartbeatRunStopReason` mapping for `adapter_liveness_lost`
and three regression tests in heartbeat-process-recovery that cover the
cold-startup no-reap, stale reap with correct codes, and a guard that
local adapters with no pid still reach the classic `process_lost` path.

Closes FAR-108.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 20, 2026

Greptile Summary

Adds hasOutOfProcessLiveness?: boolean to ServerAdapterModule and threads a new reaper branch through reapOrphanedRuns so K8s-style adapters (a) survive server restarts without being killed at cold startup, (b) emit a meaningful adapter_liveness_lost error code instead of the misleading "child pid -1" sentinel, and (c) skip the process_lost retry that only makes sense for local child processes. The change is strictly opt-in and all three new tests pass, with the existing tests unaffected.

Confidence Score: 5/5

Safe to merge — new reaper path is fully opt-in behind the hasOutOfProcessLiveness flag and no built-in adapter sets it today.

All remaining findings are P2 (a test-cleanup performance nit). Core logic, staleness gating, cold-startup guard, error code wiring, and stop-metadata inference are all correct. Existing tests are unaffected.

No files require special attention for correctness; heartbeat-process-recovery.test.ts has a minor afterEach slowdown for the cold-startup test case.

Important Files Changed

Filename Overview
packages/adapter-utils/src/types.ts Added hasOutOfProcessLiveness?: boolean to ServerAdapterModule with thorough JSDoc explaining the three reaper behaviours when set to true.
server/src/services/heartbeat.ts Added adapterHasOutOfProcessLiveness helper and out-of-process reaper branch; logic is gated correctly, cold-startup guard and adapter_liveness_lost error code wired properly, no process_lost retry queued.
server/src/services/heartbeat-stop-metadata.ts Extended HeartbeatRunStopReason union and inferHeartbeatRunStopReason to handle the new adapter_liveness_lost error code cleanly.
server/src/tests/heartbeat-process-recovery.test.ts Three new tests cover cold-startup skip, stale reap with adapter_liveness_lost, and the regression guard; the cold-startup test leaves a "running" run with no pid in the DB, causing the afterEach cleanup loop to spin ~5 s unnecessarily.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: server/src/__tests__/heartbeat-process-recovery.test.ts
Line: 588-609

Comment:
**Cold-startup test leaves a stuck "running" run that slows `afterEach` cleanup**

The cold-startup test intentionally preserves the run in `"running"` state with `processPid = null` / `processGroupId = null`. The `afterEach` polling loop treats any run matching `status = "running" && !processPid && !processGroupId` as "managed execution still active" and waits until `idlePolls >= 3`. Because nothing ever transitions this run to a terminal state, the loop spins all 100 iterations (≈5 s) before the DB cleanup finally deletes it.

A lightweight fix is to mark the run terminal at the end of the test since the assertion only needs the returned `result.reaped`:

```ts
// After the reapOrphanedRuns assertion, mark the run failed so afterEach
// cleanup does not spin the full 100-iteration timeout.
await db
  .update(heartbeatRuns)
  .set({ status: "failed" })
  .where(eq(heartbeatRuns.id, runId));
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "refactor(heartbeat): remove unreachable ..." | Re-trigger Greptile

Comment thread server/src/services/heartbeat.ts
…venessLostMessage

The staleThresholdMs <= 0 fallback was dead code — the sole call site
already guards with `if (staleThresholdMs <= 0) continue` before
invoking this function, so staleThresholdMs is always positive.

Addresses Greptile P2 finding on PR paperclipai#4162.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@cpfarhood
Copy link
Copy Markdown
Contributor Author

@greptile-apps The dead branch in buildAdapterLivenessLostMessage has been removed. The unreachable staleThresholdMs <= 0 fallback is gone — the function now unconditionally formats with the staleness duration (since the call site already guards with if (staleThresholdMs <= 0) continue before ever reaching it). Please re-review.

github-actions bot pushed a commit to farhoodliquor/paperclip that referenced this pull request Apr 21, 2026
…venessLostMessage

The staleThresholdMs <= 0 fallback was dead code — the sole call site
already guards with `if (staleThresholdMs <= 0) continue` before
invoking this function, so staleThresholdMs is always positive.

Addresses Greptile P2 finding on PR paperclipai#4162.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
github-actions bot pushed a commit to farhoodliquor/paperclip that referenced this pull request Apr 21, 2026
…venessLostMessage

The staleThresholdMs <= 0 fallback was dead code — the sole call site
already guards with `if (staleThresholdMs <= 0) continue` before
invoking this function, so staleThresholdMs is always positive.

Addresses Greptile P2 finding on PR paperclipai#4162.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
github-actions bot pushed a commit to farhoodliquor/paperclip that referenced this pull request Apr 21, 2026
…venessLostMessage

The staleThresholdMs <= 0 fallback was dead code — the sole call site
already guards with `if (staleThresholdMs <= 0) continue` before
invoking this function, so staleThresholdMs is always positive.

Addresses Greptile P2 finding on PR paperclipai#4162.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Mark the intentionally-preserved "running" run as failed at the end of
the cold-startup reaper test so the afterEach polling loop does not spin
its full 100-iteration (~5 s) timeout waiting for the run to settle.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant