Robustness: detached-worker spawn error handling, dead-worker reconciliation, atomic + serialized state writes

Thanks for the plugin. While hardening our setup we hit several robustness issues in the job-state / worker lifecycle (v1.0.4) that can silently lose job records or leave jobs wedged. File:line are against the v1.0.4 tag.

## Medium

1. **Detached worker spawn has no `error` handler** — `scripts/codex-companion.mjs`, `spawnDetachedTaskWorker`. `child.unref()` is called with no `child.on("error", …)`; a failed spawn (ENOENT/EACCES) goes undetected, the job is recorded with `pid: null`, and a subsequent cancel silently no-ops. Suggest attaching an `error` handler that marks the job `failed`.

2. **Crashed worker stuck `running` forever** — `scripts/codex-companion.mjs`, `waitForSingleJobSnapshot`. Status is polled with no PID-liveness check, so a `SIGKILL`'d worker never self-transitions (the `catch` that writes `status: "failed"` never runs). Suggest a `process.kill(pid, 0)` liveness probe (ESRCH → failed) or a timeout→failed fallback in the read path.

3. **`state.json` written non-atomically, and the read-modify-write is unserialized** — `scripts/lib/state.mjs` (`saveState`, `writeJobFile`, `updateState`). Two distinct problems:
   - Plain `fs.writeFileSync` can be observed mid-write; `loadState` then hits its `catch` and silently returns empty state, dropping all job records. Suggest temp-file + `fs.renameSync` for an atomic replace.
   - `updateState` does `loadState → mutate → saveState` with no lock. Concurrent writers — parent enqueue, the worker's progress updates (`createJobProgressUpdater`), and the SessionEnd `cleanupSessionJobs` in `scripts/session-lifecycle-hook.mjs` — each load, patch their own copy, and last-writer-wins drops other jobs. Suggest a lock (or compare-and-swap) around the RMW. Repro: ~25 concurrent writers adding distinct jobs → most are lost.

## Low

4. `generateJobId` uses `Math.random()` (`scripts/lib/state.mjs`) — prefer `crypto.randomUUID()` / `crypto.randomBytes`.
5. `ensureGitRepository` runs twice per review.
6. `handleResult` / `handleTaskResumeCandidate` are not awaited.
7. `status --wait` exits `0` on timeout (reads as success); `readJobFile` has no JSON parse guard.

Happy to send a PR for any of these if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robustness: detached-worker spawn error handling, dead-worker reconciliation, atomic + serialized state writes #377

Medium

Low

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Robustness: detached-worker spawn error handling, dead-worker reconciliation, atomic + serialized state writes #377

Description

Medium

Low

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions