Thanks for the plugin. While hardening our setup we hit several robustness issues in the job-state / worker lifecycle (v1.0.4) that can silently lose job records or leave jobs wedged. File:line are against the v1.0.4 tag.
Medium
-
Detached worker spawn has no error handler — scripts/codex-companion.mjs, spawnDetachedTaskWorker. child.unref() is called with no child.on("error", …); a failed spawn (ENOENT/EACCES) goes undetected, the job is recorded with pid: null, and a subsequent cancel silently no-ops. Suggest attaching an error handler that marks the job failed.
-
Crashed worker stuck running forever — scripts/codex-companion.mjs, waitForSingleJobSnapshot. Status is polled with no PID-liveness check, so a SIGKILL'd worker never self-transitions (the catch that writes status: "failed" never runs). Suggest a process.kill(pid, 0) liveness probe (ESRCH → failed) or a timeout→failed fallback in the read path.
-
state.json written non-atomically, and the read-modify-write is unserialized — scripts/lib/state.mjs (saveState, writeJobFile, updateState). Two distinct problems:
- Plain
fs.writeFileSync can be observed mid-write; loadState then hits its catch and silently returns empty state, dropping all job records. Suggest temp-file + fs.renameSync for an atomic replace.
updateState does loadState → mutate → saveState with no lock. Concurrent writers — parent enqueue, the worker's progress updates (createJobProgressUpdater), and the SessionEnd cleanupSessionJobs in scripts/session-lifecycle-hook.mjs — each load, patch their own copy, and last-writer-wins drops other jobs. Suggest a lock (or compare-and-swap) around the RMW. Repro: ~25 concurrent writers adding distinct jobs → most are lost.
Low
generateJobId uses Math.random() (scripts/lib/state.mjs) — prefer crypto.randomUUID() / crypto.randomBytes.
ensureGitRepository runs twice per review.
handleResult / handleTaskResumeCandidate are not awaited.
status --wait exits 0 on timeout (reads as success); readJobFile has no JSON parse guard.
Happy to send a PR for any of these if useful.
Thanks for the plugin. While hardening our setup we hit several robustness issues in the job-state / worker lifecycle (v1.0.4) that can silently lose job records or leave jobs wedged. File:line are against the v1.0.4 tag.
Medium
Detached worker spawn has no
errorhandler —scripts/codex-companion.mjs,spawnDetachedTaskWorker.child.unref()is called with nochild.on("error", …); a failed spawn (ENOENT/EACCES) goes undetected, the job is recorded withpid: null, and a subsequent cancel silently no-ops. Suggest attaching anerrorhandler that marks the jobfailed.Crashed worker stuck
runningforever —scripts/codex-companion.mjs,waitForSingleJobSnapshot. Status is polled with no PID-liveness check, so aSIGKILL'd worker never self-transitions (thecatchthat writesstatus: "failed"never runs). Suggest aprocess.kill(pid, 0)liveness probe (ESRCH → failed) or a timeout→failed fallback in the read path.state.jsonwritten non-atomically, and the read-modify-write is unserialized —scripts/lib/state.mjs(saveState,writeJobFile,updateState). Two distinct problems:fs.writeFileSynccan be observed mid-write;loadStatethen hits itscatchand silently returns empty state, dropping all job records. Suggest temp-file +fs.renameSyncfor an atomic replace.updateStatedoesloadState → mutate → saveStatewith no lock. Concurrent writers — parent enqueue, the worker's progress updates (createJobProgressUpdater), and the SessionEndcleanupSessionJobsinscripts/session-lifecycle-hook.mjs— each load, patch their own copy, and last-writer-wins drops other jobs. Suggest a lock (or compare-and-swap) around the RMW. Repro: ~25 concurrent writers adding distinct jobs → most are lost.Low
generateJobIdusesMath.random()(scripts/lib/state.mjs) — prefercrypto.randomUUID()/crypto.randomBytes.ensureGitRepositoryruns twice per review.handleResult/handleTaskResumeCandidateare not awaited.status --waitexits0on timeout (reads as success);readJobFilehas no JSON parse guard.Happy to send a PR for any of these if useful.