Fix job worker reliability: retry on error, reduce poll interval#2453
Fix job worker reliability: retry on error, reduce poll interval#2453mpscholten merged 1 commit intomasterfrom
Conversation
After the job worker redesign (820cf00), runJobLoop exits without retrying when fetchNextJob throws a transient error (pool exhaustion, connection timeout). Since the NOTIFY signal was already consumed from the TBQueue, nothing triggers a new worker spawn, so the job sits orphaned until the 60-second poller picks it up. The old MVar-based workers were persistent and always looped back to takeMVar after any outcome. The new on-demand workers are ephemeral, so exiting means the job is lost until the poller runs. Add runJobLoop call to the error branch so the worker retries after the 1-second backoff, matching how the poller handles errors. Fixes amitaibu/ihp-sensors#18 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 35deb3b1e9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Left exception -> do | ||
| Log.error ("Job worker: Failed to fetch next job: " <> tshow exception) | ||
| Concurrent.threadDelay 1000000 -- 1s backoff to avoid tight error loops | ||
| runJobLoop -- retry after transient error |
There was a problem hiding this comment.
Stop retry loop when worker shutdown is requested
Re-entering runJobLoop after every fetch error means an active worker never terminates while the database keeps failing, so the dispatcher’s Stop path can block indefinitely waiting for activeCount == 0 during shutdown. This is a regression from the previous behavior (worker exited on fetch failure): with subscriptions/poller already stopped, there is no new work to drain, but this retry loop keeps the worker alive forever unless a second forced cancellation signal is sent.
Useful? React with 👍 / 👎.
Summary
runJobLoopsilently exits whenfetchNextJobthrows (pool exhaustion, connection timeout). Since the NOTIFY signal was already consumed from the TBQueue, nothing triggers a new worker — the job waits for the poller. Now it retries with 1s backoff (async exceptions still rethrown for clean shutdown).Fixes amitaibu/ihp-sensors#18
Test plan
DEBUG=1, verify "Received pg_notify" appears when a job is created🤖 Generated with Claude Code