fix(csharp): add error resilience to heartbeat poller#372
Conversation
…nt death The PollOperationStatus method had no try-catch, so any transient exception (e.g. ObjectDisposedException from TLS connection recycling on Mono) would kill the heartbeat poller silently. Without heartbeats, the server-side commandInactivityTimeout (20 min) expires and terminates the query, causing CloudFetch failures in Power BI. This wraps the polling logic in a try-catch that logs errors via Activity telemetry and continues polling. Part of ES-1778880. Co-authored-by: Isaac
| // Wait before retrying to avoid tight error loops | ||
| try | ||
| { | ||
| await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken); |
There was a problem hiding this comment.
Can we move this to finally? so we do not have duplicate
There was a problem hiding this comment.
And we should not need try catch here inside the catch
…d try-catch - Move Task.Delay to after the try-catch block (shared by success and error paths) - Remove nested try-catch inside the error handler - On cancellation during delay, OperationCanceledException propagates to Dispose() which handles it Co-authored-by: Isaac
| { | ||
| while (!cancellationToken.IsCancellationRequested) | ||
| { | ||
| TOperationHandle? operationHandle = _response.OperationHandle; |
There was a problem hiding this comment.
This is failing this test:
Failed AdbcDrivers.Databricks.Tests.Unit.DatabricksOperationStatusPollerTests.StopsPollingOnException
There was a problem hiding this comment.
I think we need to think more about this change, would it cause any unwanted poller keep running which prevent the connection from shutting down?
There was a problem hiding this comment.
Updated the logic to avoid this scenario
There was a problem hiding this comment.
Addressed with MaxConsecutiveFailures = 10. After 10 consecutive errors (~10 min at 60s interval), the poller stops itself. Also linked the timeout token with the cancellation token so Stop()/Dispose() immediately aborts any in-flight RPC — no more blocking on hung network calls.
The heartbeat poller now continues polling through transient exceptions instead of stopping. Update the test to assert this new behavior. Co-authored-by: Isaac
The previous commit accidentally removed the try-catch. Re-add error resilience using a finally block for Task.Delay per reviewer feedback, eliminating duplication and nested try-catch. Co-authored-by: Isaac
- Add MaxConsecutiveFailures=10 (~10min at 60s interval) so persistent errors (auth expired, server gone) don't cause infinite polling - Move Task.Delay out of finally block to avoid unnecessary delay on break paths (terminal state, cancellation, null handle) - Reset failure counter on successful poll so transient blips recover Co-authored-by: Isaac
There was a problem hiding this comment.
Pull request overview
Improves resilience of the Databricks operation-status heartbeat poller so transient failures don’t permanently stop heartbeating (which can lead to server-side inactivity timeouts).
Changes:
- Wrap polling loop body with exception handling, add telemetry events for poll errors, and enforce a max consecutive failure threshold.
- Update the unit test to assert polling continues despite exceptions and ensure the poller is disposed via
using.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| csharp/src/Reader/DatabricksOperationStatusPoller.cs | Adds try/catch resilience, consecutive-failure limiting, and Activity events for poll errors. |
| csharp/test/Unit/DatabricksOperationStatusPollerTests.cs | Renames/updates exception behavior test and disposes poller via using. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| CancellationToken GetOperationStatusTimeoutToken = ApacheUtility.GetCancellationToken(_requestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds); | ||
|
|
||
| var request = new TGetOperationStatusReq(operationHandle); | ||
| var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken); | ||
| var request = new TGetOperationStatusReq(operationHandle); | ||
| var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken); | ||
|
|
There was a problem hiding this comment.
GetOperationStatus is invoked with a timeout-only token, so Stop()/Dispose() cancellation may not abort an in-flight request and can block up to _requestTimeoutSeconds. Consider linking the poller’s cancellationToken with the timeout token (as used elsewhere in the repo) so shutdown is prompt even during a hung network call.
There was a problem hiding this comment.
Fixed. Using CancellationTokenSource.CreateLinkedTokenSource(cancellationToken) with CancelAfter so Stop()/Dispose() immediately aborts in-flight GetOperationStatus calls.
| CancellationToken GetOperationStatusTimeoutToken = ApacheUtility.GetCancellationToken(_requestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds); | ||
|
|
||
| var request = new TGetOperationStatusReq(operationHandle); | ||
| var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken); | ||
| var request = new TGetOperationStatusReq(operationHandle); | ||
| var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken); | ||
|
|
There was a problem hiding this comment.
Local variable naming and typing here deviates from repo guidelines: locals should be camelCase and the codebase prefers explicit types over var (see csharp/CODING_GUIDELINES.md). Rename GetOperationStatusTimeoutToken to a camelCase name and consider using explicit TGetOperationStatusReq/TGetOperationStatusResp types for request/response to match conventions.
There was a problem hiding this comment.
Fixed. Renamed to camelCase (timeoutCts) and using explicit types TGetOperationStatusReq/TGetOperationStatusResp.
| var request = new TGetOperationStatusReq(operationHandle); | ||
| var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken); | ||
|
|
There was a problem hiding this comment.
This library code awaits without .ConfigureAwait(false), which is a stated guideline in csharp/CODING_GUIDELINES.md and is used widely across the reader code. Consider adding .ConfigureAwait(false) to the await _statement.Client.GetOperationStatus(...) call.
There was a problem hiding this comment.
Fixed. Added .ConfigureAwait(false) to both awaits.
| // Maximum number of consecutive poll failures before giving up. | ||
| // At the default 60s heartbeat interval this allows ~10 minutes of transient errors | ||
| // before the poller stops itself. | ||
| private const int MaxConsecutiveFailures = 10; | ||
|
|
There was a problem hiding this comment.
MaxConsecutiveFailures changes runtime behavior (poller stops after N consecutive errors), but there’s no unit test asserting (1) polling stops after the threshold and (2) a single successful poll resets the failure counter. Adding coverage for those cases would prevent regressions in the resilience logic.
There was a problem hiding this comment.
Good suggestion — will add in a follow-up. The existing 8 poller tests all pass with the current changes.
|
|
||
| // Wait before next poll. On cancellation this throws OperationCanceledException | ||
| // which propagates up to the caller (Dispose catches it). | ||
| await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken); |
There was a problem hiding this comment.
This library code awaits without .ConfigureAwait(false), which is a stated guideline in csharp/CODING_GUIDELINES.md. Consider adding .ConfigureAwait(false) to the await Task.Delay(...) call as well to avoid capturing a synchronization context.
| await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken); | |
| await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken).ConfigureAwait(false); |
eric-wang-1990
left a comment
There was a problem hiding this comment.
LGTM, please review comments from copilot
…ConfigureAwait - Link poller cancellation token with request timeout token so Stop()/Dispose() immediately aborts in-flight GetOperationStatus calls (Copilot feedback) - Rename GetOperationStatusTimeoutToken to camelCase (Copilot feedback) - Use explicit types for TGetOperationStatusReq/Resp (Copilot feedback) - Add .ConfigureAwait(false) to all awaits (Copilot feedback) Co-authored-by: Isaac
Summary
PollOperationStatuspolling loop in a try-catch so that transient exceptions (e.g.ObjectDisposedExceptionfrom TLS connection recycling) no longer kill the heartbeat poller silentlyMaxConsecutiveFailures = 10) so persistent errors (auth expired, server gone) don't cause infinite polling — at the default 60s heartbeat interval this gives ~10 minutes of tolerance before the poller stops itselfActivity.Current?.AddEvent()telemetry with error type, message, poll count, and consecutive failure countOperationCanceledExceptionfrom the cancellation token to still allow graceful shutdownStopsPollingOnExceptiontest toContinuesPollingOnExceptionto match the new resilient behaviorContext
Without this fix, a single transient network error permanently stops the heartbeat poller. The server-side
commandInactivityTimeout(default 20 minutes) then expires because noGetOperationStatuscalls refresh it, causing the server to terminate the query. This manifests as CloudFetch failures in Power BI (ES-1778880).Design decisions
finallyforTask.Delay? Afinallyblock runs even onbreak, which would add an unnecessary 60s delay on every clean exit path (terminal state, cancellation, null handle). Placing the delay after the try-catch means it only executes when the loop continues.GetOperationStatusTimeoutTokenthrowsOperationCanceledExceptionbut the cancellation filter (when cancellationToken.IsCancellationRequested) correctly routes it to the general catch since the main token isn't cancelled.Test plan
StopsPollingOnException→ContinuesPollingOnExceptionto assertpollCount > 1This pull request was AI-assisted by Isaac.