Skip to content

Commit 0e318a4

Browse files
jasonczcclaude
andcommitted
fix(hub): start-local-worker evicts stale worker.lock and verifies readiness
Two root causes of the reported "I created a worker, list says none" bug: 1. When a local worker loses heartbeats long enough to fall out of the "active" set — typically because the hub bounced or the network hiccupped — its bun process usually stays alive and keeps holding `~/.hapi-dev/worker.lock`. A fresh start-local-worker call would enroll a new worker, hit the lock, and immediately exit with "Another worker is already running"; the endpoint still returned `started: true` because it only checked that spawn() itself succeeded. Net effect: UI shows empty workers list forever. 2. The endpoint never waited to see if the child actually stayed up, so it lied about success even on hard failures (e.g. crash during enrollment, bun interpreter missing, etc.). Fix: - evictStaleWorkerLock() runs before each spawn. Reads <dataDir>/worker.lock, SIGTERMs the holder if still alive, unlinks the lock. Safe because the lock always belongs to a worker the hub itself spawned (same HAPI_HOME). - awaitWorkerReady() polls the child's logs up to 4s after spawn, looking for either the "Worker starting..." milestone (runner loop entered) or a terminal exit. On failure returns a structured 500 with reason=worker_lock_conflict|worker_exited plus the tail of child logs so the UI can surface something concrete. - Successful responses now carry optional evictedPid so the caller can see that a cleanup happened. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 478f44a commit 0e318a4

1 file changed

Lines changed: 104 additions & 3 deletions

File tree

hub/src/web/routes/cloud.ts

Lines changed: 104 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
import { Hono } from 'hono'
22
import { z } from 'zod'
33
import { execFile, spawn } from 'node:child_process'
4-
import { resolve } from 'node:path'
5-
import { readFileSync } from 'node:fs'
4+
import { join, resolve } from 'node:path'
5+
import { readFileSync, unlinkSync, existsSync } from 'node:fs'
66
import os from 'node:os'
77
import { promisify } from 'node:util'
8+
import { configuration } from '../../configuration'
89
import { CLOUD_PROVIDER_NAMES } from '../../cloud/provider'
910
import type { SyncEngine } from '../../sync/syncEngine'
1011
import type { WebAppEnv } from '../middleware/auth'
@@ -25,6 +26,75 @@ function isPidAlive(pid: number): boolean {
2526
}
2627
}
2728

29+
/**
30+
* Evict whatever holds the hub's own worker.lock.
31+
*
32+
* Background: a worker process enrolls, then periodically heartbeats the
33+
* hub. If the hub restarts or a network hiccup drops heartbeats for long
34+
* enough, the worker is marked inactive in the machine registry —
35+
* invisible in the UI — but the underlying bun process is usually still
36+
* running and still holding `worker.lock`. Any subsequent
37+
* `start-local-worker` spawns enroll fine, then exit instantly with
38+
* "Another worker is already running" and the caller sees empty lists.
39+
*
40+
* Since the lock file always lives under the hub's HAPI_HOME, we own
41+
* that pid end-to-end — it can only belong to a worker we previously
42+
* spawned. SIGTERM it and unlink the lock. Idempotent: no-op if the
43+
* lock doesn't exist or the holder is already dead.
44+
*
45+
* Returns the pid we evicted (if any), for surfacing in the response.
46+
*/
47+
function evictStaleWorkerLock(): { evictedPid?: number } {
48+
const lockPath = join(configuration.dataDir, 'worker.lock')
49+
if (!existsSync(lockPath)) return {}
50+
let pid: number | undefined
51+
try {
52+
const raw = readFileSync(lockPath, 'utf8').trim()
53+
const parsed = Number(raw)
54+
if (Number.isInteger(parsed) && parsed > 0) pid = parsed
55+
} catch {
56+
// corrupt lock — just unlink
57+
}
58+
if (pid && isPidAlive(pid)) {
59+
try { process.kill(pid, 'SIGTERM') } catch { /* already gone */ }
60+
}
61+
try { unlinkSync(lockPath) } catch { /* already unlinked */ }
62+
return { evictedPid: pid }
63+
}
64+
65+
/**
66+
* Wait up to timeoutMs for the spawned worker to either:
67+
* - exit (failure — exitCode set), or
68+
* - log that it's past enrollment + lock acquisition
69+
* ("Worker starting..." is the first line runRunnerLoop prints).
70+
*
71+
* Polls every 100ms. Returns the final liveness snapshot so the caller
72+
* can decide whether to tell the client "started" or "failed".
73+
*/
74+
async function awaitWorkerReady(
75+
proc: { pid: number; exitCode: number | null; logs: string[] },
76+
timeoutMs: number
77+
): Promise<{ ready: boolean; reason?: string }> {
78+
const deadline = Date.now() + timeoutMs
79+
while (Date.now() < deadline) {
80+
if (proc.exitCode !== null) {
81+
// Known-failure signatures we want to surface cleanly.
82+
const stuckOnLock = proc.logs.some((l) => l.includes('Another worker is already running'))
83+
return {
84+
ready: false,
85+
reason: stuckOnLock ? 'worker_lock_conflict' : 'worker_exited'
86+
}
87+
}
88+
if (proc.logs.some((l) => l.includes('Worker starting...'))) {
89+
return { ready: true }
90+
}
91+
await new Promise((r) => setTimeout(r, 100))
92+
}
93+
// Timed out without exit AND without the "starting" marker — treat
94+
// as provisional ready (the child may still be enrolling).
95+
return { ready: true }
96+
}
97+
2898
const execFileAsync = promisify(execFile)
2999
const LOCAL_RUNTIME_IMAGE = 'haqi-workspace:dev'
30100

@@ -662,6 +732,18 @@ export function createCloudRoutes(getSyncEngine: () => SyncEngine | null, hubUrl
662732
}
663733
}
664734

735+
// Evict any orphan worker that still holds the per-HAPI_HOME
736+
// worker.lock. Without this the spawn below enrolls cleanly but
737+
// exits immediately with "Another worker is already running"
738+
// whenever a previous worker lost heartbeats without exiting.
739+
const evicted = evictStaleWorkerLock()
740+
if (evicted.evictedPid) {
741+
appendTrackedWorkerLog(`[hub] Evicted stale worker.lock holder pid=${evicted.evictedPid}`)
742+
// Give it a moment to release file handles before we spawn
743+
// the replacement.
744+
await new Promise((r) => setTimeout(r, 300))
745+
}
746+
665747
// Create enrollment token
666748
const tokenResult = engine.createCloudWorkerEnrollmentToken({
667749
namespace,
@@ -727,10 +809,29 @@ export function createCloudRoutes(getSyncEngine: () => SyncEngine | null, hubUrl
727809
})
728810
child.unref()
729811

812+
// Don't lie to the caller: wait a few seconds for the child to
813+
// either (a) cross the "Worker starting..." log line (enrollment
814+
// succeeded, runner loop entered) or (b) exit. Returning
815+
// `started: true` when the child dies 800ms later to a lock
816+
// conflict is what sent users hunting in empty worker lists.
817+
const readiness = await awaitWorkerReady(localWorkerProcess, 4000)
818+
if (!readiness.ready) {
819+
return c.json({
820+
started: false,
821+
reason: readiness.reason ?? 'worker_failed_to_start',
822+
pid: child.pid,
823+
startedAt: localWorkerProcess.startedAt,
824+
exitCode: localWorkerProcess.exitCode,
825+
evictedPid: evicted.evictedPid,
826+
logs: localWorkerProcess.logs.slice(-10)
827+
}, 500)
828+
}
829+
730830
return c.json({
731831
started: true,
732832
pid: child.pid,
733-
startedAt: localWorkerProcess.startedAt
833+
startedAt: localWorkerProcess.startedAt,
834+
evictedPid: evicted.evictedPid
734835
})
735836
})
736837

0 commit comments

Comments
 (0)