Fix job worker silently exiting on transient fetchNextJob error

mpscholten · claude · mpscholten · commit 35deb3b1e93b · 2026-02-19T10:34:18.000+01:00
After the job worker redesign (820cf00), runJobLoop exits without retrying when fetchNextJob throws a transient error (pool exhaustion, connection timeout). Since the NOTIFY signal was already consumed from the TBQueue, nothing triggers a new worker spawn, so the job sits orphaned until the 60-second poller picks it up. The old MVar-based workers were persistent and always looped back to takeMVar after any outcome. The new on-demand workers are ephemeral, so exiting means the job is lost until the poller runs. Add runJobLoop call to the error branch so the worker retries after the 1-second backoff, matching how the poller handles errors. Fixes amitaibu/ihp-sensors#18 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
diff --git a/ihp/IHP/Job/Runner.hs b/ihp/IHP/Job/Runner.hs
@@ -171,6 +171,7 @@ jobWorkerFetchAndRunLoop JobWorkerArgs { .. } = do
                 Left exception -> do
                     Log.error ("Job worker: Failed to fetch next job: " <> tshow exception)
                     Concurrent.threadDelay 1000000  -- 1s backoff to avoid tight error loops
+                    runJobLoop -- retry after transient error
                 Right (Just job) -> do
                     Log.info ("Starting job: " <> tshow job)