Fix job worker reliability: retry on error, reduce poll interval by mpscholten · Pull Request #2453 · digitallyinduced/ihp

mpscholten · 2026-02-19T09:34:29Z

Summary

Retry on transient errors: After the job worker redesign (Redesign job worker system for reliability and performance #2327), runJobLoop silently exits when fetchNextJob throws (pool exhaustion, connection timeout). Since the NOTIFY signal was already consumed from the TBQueue, nothing triggers a new worker — the job waits for the poller. Now it retries with 1s backoff (async exceptions still rethrown for clean shutdown).
Reduce poll interval 60s → 10s: The poller is the safety net when pg_notify is missed. 60s is too long as a fallback. 10s is one lightweight COUNT query per job type — negligible overhead.
Add notification logging: Debug log when pg_notify is received, warn when a notification is dropped due to a full queue.

Fixes amitaibu/ihp-sensors#18

Test plan

Create a job and verify it's picked up within seconds
Verify graceful shutdown still works (CTRL+C waits for jobs, second CTRL+C force-kills)
With DEBUG=1, verify "Received pg_notify" appears when a job is created
Verify normal job processing is unaffected

🤖 Generated with Claude Code

After the job worker redesign (820cf00), runJobLoop exits without retrying when fetchNextJob throws a transient error (pool exhaustion, connection timeout). Since the NOTIFY signal was already consumed from the TBQueue, nothing triggers a new worker spawn, so the job sits orphaned until the 60-second poller picks it up. The old MVar-based workers were persistent and always looped back to takeMVar after any outcome. The new on-demand workers are ephemeral, so exiting means the job is lost until the poller runs. Add runJobLoop call to the error branch so the worker retries after the 1-second backoff, matching how the poller handles errors. Fixes amitaibu/ihp-sensors#18 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35deb3b1e9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-19T09:36:18Z

ihp/IHP/Job/Runner.hs

                Left exception -> do
                    Log.error ("Job worker: Failed to fetch next job: " <> tshow exception)
                    Concurrent.threadDelay 1000000  -- 1s backoff to avoid tight error loops
+                    runJobLoop -- retry after transient error


Stop retry loop when worker shutdown is requested

Re-entering runJobLoop after every fetch error means an active worker never terminates while the database keeps failing, so the dispatcher’s Stop path can block indefinitely waiting for activeCount == 0 during shutdown. This is a regression from the previous behavior (worker exited on fetch failure): with subscriptions/poller already stopped, there is no new work to drain, but this retry loop keeps the worker alive forever unless a second forced cancellation signal is sent.

Useful? React with 👍 / 👎.

mpscholten merged commit 528b01a into master Feb 19, 2026
1 check passed

mpscholten deleted the fix/job-worker-retry-on-error branch February 19, 2026 09:36

chatgpt-codex-connector bot reviewed Feb 19, 2026

View reviewed changes

mpscholten changed the title ~~Fix job worker exiting on transient fetch error~~ Fix job worker reliability: retry on error, reduce poll interval Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix job worker reliability: retry on error, reduce poll interval#2453

Fix job worker reliability: retry on error, reduce poll interval#2453
mpscholten merged 1 commit intomasterfrom
fix/job-worker-retry-on-error

mpscholten commented Feb 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

mpscholten commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

mpscholten commented Feb 19, 2026 •

edited

Loading