Skip to content

Add exponential backoff and log suppression for scheduler disconnection#6

Merged
phillipleblanc merged 1 commit into
spiceai-50from
phillip/260113-executor-backoff
Jan 13, 2026
Merged

Add exponential backoff and log suppression for scheduler disconnection#6
phillipleblanc merged 1 commit into
spiceai-50from
phillip/260113-executor-backoff

Conversation

@phillipleblanc
Copy link
Copy Markdown

Summary

When the scheduler disconnects or becomes unavailable, the executor poll loop previously logged a WARN message every ~100ms, spamming logs during graceful shutdown scenarios (e.g., when shutting down the scheduler before the executor).

Before:

2025-12-15T23:02:44.486012Z  WARN ballista_executor::execution_loop: Executor poll work loop failed. If this continues to happen the Scheduler might be marked as dead. Error: status: Unavailable, message: "tcp connect error", details: [], metadata: MetadataMap { headers: {} }
2025-12-15T23:02:44.588781Z  WARN ballista_executor::execution_loop: Executor poll work loop failed...
[repeats every ~100ms]

After:

  • First 5 failures: WARN with attempt count
  • Subsequent failures: DEBUG level (reduces noise)
  • Exponential backoff from 100ms to 30s between retries
  • Info log when connection is restored

Changes

  • Add exponential backoff (100ms initial, 30s max) when scheduler connection fails
  • Reduce log level from WARN to DEBUG after 5 consecutive failures
  • Log restoration message when connection is re-established
  • Prevents log spam when scheduler shuts down before executor

@phillipleblanc phillipleblanc self-assigned this Jan 13, 2026
@phillipleblanc phillipleblanc force-pushed the phillip/260113-executor-backoff branch from 44ea3ba to e18eab0 Compare January 13, 2026 02:43
When the scheduler disconnects or becomes unavailable, the executor poll loop previously logged a WARN message every ~100ms, spamming logs during graceful shutdown scenarios.

Changes:
- Add exponential backoff (100ms to 30s) when scheduler connection fails
- Reduce log level from WARN to DEBUG after 5 consecutive failures
- Log restoration message when connection is re-established
- Prevents log spam when scheduler shuts down before executor
@phillipleblanc phillipleblanc force-pushed the phillip/260113-executor-backoff branch from e18eab0 to b9b273d Compare January 13, 2026 03:31
@phillipleblanc phillipleblanc added the bug Something isn't working label Jan 13, 2026
@phillipleblanc phillipleblanc requested a review from a team January 13, 2026 04:04
@phillipleblanc phillipleblanc merged commit b9b273d into spiceai-50 Jan 13, 2026
23 of 31 checks passed
@phillipleblanc phillipleblanc deleted the phillip/260113-executor-backoff branch January 13, 2026 04:58
phillipleblanc added a commit to spiceai/spiceai that referenced this pull request Jan 13, 2026
Updates datafusion-ballista fork to include fix for executor log spam when
scheduler disconnects before executor. The executor poll loop now uses
exponential backoff (100ms to 30s) and reduces log level from WARN to DEBUG
after 5 consecutive failures.

See: spiceai/datafusion-ballista#6
github-merge-queue Bot pushed a commit to spiceai/spiceai that referenced this pull request Jan 13, 2026
#8905)

Updates datafusion-ballista fork to include fix for executor log spam when
scheduler disconnects before executor. The executor poll loop now uses
exponential backoff (100ms to 30s) and reduces log level from WARN to DEBUG
after 5 consecutive failures.

See: spiceai/datafusion-ballista#6
lukekim pushed a commit to spiceai/spiceai that referenced this pull request Jan 15, 2026
#8905)

Updates datafusion-ballista fork to include fix for executor log spam when
scheduler disconnects before executor. The executor poll loop now uses
exponential backoff (100ms to 30s) and reduces log level from WARN to DEBUG
after 5 consecutive failures.

See: spiceai/datafusion-ballista#6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants