Skip to content

Fix job worker reliability: retry on error, reduce poll interval#2453

Merged
mpscholten merged 1 commit intomasterfrom
fix/job-worker-retry-on-error
Feb 19, 2026
Merged

Fix job worker reliability: retry on error, reduce poll interval#2453
mpscholten merged 1 commit intomasterfrom
fix/job-worker-retry-on-error

Conversation

@mpscholten
Copy link
Member

@mpscholten mpscholten commented Feb 19, 2026

Summary

  • Retry on transient errors: After the job worker redesign (Redesign job worker system for reliability and performance #2327), runJobLoop silently exits when fetchNextJob throws (pool exhaustion, connection timeout). Since the NOTIFY signal was already consumed from the TBQueue, nothing triggers a new worker — the job waits for the poller. Now it retries with 1s backoff (async exceptions still rethrown for clean shutdown).
  • Reduce poll interval 60s → 10s: The poller is the safety net when pg_notify is missed. 60s is too long as a fallback. 10s is one lightweight COUNT query per job type — negligible overhead.
  • Add notification logging: Debug log when pg_notify is received, warn when a notification is dropped due to a full queue.

Fixes amitaibu/ihp-sensors#18

Test plan

  • Create a job and verify it's picked up within seconds
  • Verify graceful shutdown still works (CTRL+C waits for jobs, second CTRL+C force-kills)
  • With DEBUG=1, verify "Received pg_notify" appears when a job is created
  • Verify normal job processing is unaffected

🤖 Generated with Claude Code

After the job worker redesign (820cf00), runJobLoop exits without
retrying when fetchNextJob throws a transient error (pool exhaustion,
connection timeout). Since the NOTIFY signal was already consumed from
the TBQueue, nothing triggers a new worker spawn, so the job sits
orphaned until the 60-second poller picks it up.

The old MVar-based workers were persistent and always looped back to
takeMVar after any outcome. The new on-demand workers are ephemeral,
so exiting means the job is lost until the poller runs.

Add runJobLoop call to the error branch so the worker retries after
the 1-second backoff, matching how the poller handles errors.

Fixes amitaibu/ihp-sensors#18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mpscholten mpscholten merged commit 528b01a into master Feb 19, 2026
1 check passed
@mpscholten mpscholten deleted the fix/job-worker-retry-on-error branch February 19, 2026 09:36
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35deb3b1e9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Left exception -> do
Log.error ("Job worker: Failed to fetch next job: " <> tshow exception)
Concurrent.threadDelay 1000000 -- 1s backoff to avoid tight error loops
runJobLoop -- retry after transient error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop retry loop when worker shutdown is requested

Re-entering runJobLoop after every fetch error means an active worker never terminates while the database keeps failing, so the dispatcher’s Stop path can block indefinitely waiting for activeCount == 0 during shutdown. This is a regression from the previous behavior (worker exited on fetch failure): with subscriptions/poller already stopped, there is no new work to drain, but this retry loop keeps the worker alive forever unless a second forced cancellation signal is sent.

Useful? React with 👍 / 👎.

@mpscholten mpscholten changed the title Fix job worker exiting on transient fetch error Fix job worker reliability: retry on error, reduce poll interval Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments