Skip to content

Robustness: detached-worker spawn error handling, dead-worker reconciliation, atomic + serialized state writes #377

Description

@Dvorf

Thanks for the plugin. While hardening our setup we hit several robustness issues in the job-state / worker lifecycle (v1.0.4) that can silently lose job records or leave jobs wedged. File:line are against the v1.0.4 tag.

Medium

  1. Detached worker spawn has no error handlerscripts/codex-companion.mjs, spawnDetachedTaskWorker. child.unref() is called with no child.on("error", …); a failed spawn (ENOENT/EACCES) goes undetected, the job is recorded with pid: null, and a subsequent cancel silently no-ops. Suggest attaching an error handler that marks the job failed.

  2. Crashed worker stuck running foreverscripts/codex-companion.mjs, waitForSingleJobSnapshot. Status is polled with no PID-liveness check, so a SIGKILL'd worker never self-transitions (the catch that writes status: "failed" never runs). Suggest a process.kill(pid, 0) liveness probe (ESRCH → failed) or a timeout→failed fallback in the read path.

  3. state.json written non-atomically, and the read-modify-write is unserializedscripts/lib/state.mjs (saveState, writeJobFile, updateState). Two distinct problems:

    • Plain fs.writeFileSync can be observed mid-write; loadState then hits its catch and silently returns empty state, dropping all job records. Suggest temp-file + fs.renameSync for an atomic replace.
    • updateState does loadState → mutate → saveState with no lock. Concurrent writers — parent enqueue, the worker's progress updates (createJobProgressUpdater), and the SessionEnd cleanupSessionJobs in scripts/session-lifecycle-hook.mjs — each load, patch their own copy, and last-writer-wins drops other jobs. Suggest a lock (or compare-and-swap) around the RMW. Repro: ~25 concurrent writers adding distinct jobs → most are lost.

Low

  1. generateJobId uses Math.random() (scripts/lib/state.mjs) — prefer crypto.randomUUID() / crypto.randomBytes.
  2. ensureGitRepository runs twice per review.
  3. handleResult / handleTaskResumeCandidate are not awaited.
  4. status --wait exits 0 on timeout (reads as success); readJobFile has no JSON parse guard.

Happy to send a PR for any of these if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions