Skip to content

fix(csharp): add error resilience to heartbeat poller#372

Merged
msrathore-db merged 6 commits intomainfrom
fix/heartbeat-poller-error-resilience
Mar 29, 2026
Merged

fix(csharp): add error resilience to heartbeat poller#372
msrathore-db merged 6 commits intomainfrom
fix/heartbeat-poller-error-resilience

Conversation

@msrathore-db
Copy link
Copy Markdown
Collaborator

@msrathore-db msrathore-db commented Mar 27, 2026

Summary

  • Wraps the PollOperationStatus polling loop in a try-catch so that transient exceptions (e.g. ObjectDisposedException from TLS connection recycling) no longer kill the heartbeat poller silently
  • Adds a max consecutive failure limit (MaxConsecutiveFailures = 10) so persistent errors (auth expired, server gone) don't cause infinite polling — at the default 60s heartbeat interval this gives ~10 minutes of tolerance before the poller stops itself
  • A single successful poll resets the failure counter, so intermittent transient errors are handled gracefully
  • Logs errors via Activity.Current?.AddEvent() telemetry with error type, message, poll count, and consecutive failure count
  • Properly handles OperationCanceledException from the cancellation token to still allow graceful shutdown
  • Updates the StopsPollingOnException test to ContinuesPollingOnException to match the new resilient behavior

Context

Without this fix, a single transient network error permanently stops the heartbeat poller. The server-side commandInactivityTimeout (default 20 minutes) then expires because no GetOperationStatus calls refresh it, causing the server to terminate the query. This manifests as CloudFetch failures in Power BI (ES-1778880).

Design decisions

  • Why not finally for Task.Delay? A finally block runs even on break, which would add an unnecessary 60s delay on every clean exit path (terminal state, cancellation, null handle). Placing the delay after the try-catch means it only executes when the loop continues.
  • Why 10 max failures? At 60s intervals, 10 failures = ~10 minutes — enough to ride out transient network issues but not so long that a permanently broken connection wastes resources indefinitely.
  • Request timeouts are treated as transient errors. The per-request GetOperationStatusTimeoutToken throws OperationCanceledException but the cancellation filter (when cancellationToken.IsCancellationRequested) correctly routes it to the general catch since the main token isn't cancelled.

Test plan

  • Verify build succeeds (confirmed locally, 0 warnings, 0 errors)
  • Update StopsPollingOnExceptionContinuesPollingOnException to assert pollCount > 1
  • Verify existing unit tests pass
  • Manual validation: inject a transient exception during polling and confirm the poller recovers and continues heartbeating
  • Verify cancellation still stops the poller gracefully
  • Verify persistent errors stop the poller after ~10 consecutive failures

This pull request was AI-assisted by Isaac.

…nt death

The PollOperationStatus method had no try-catch, so any transient exception
(e.g. ObjectDisposedException from TLS connection recycling on Mono) would
kill the heartbeat poller silently. Without heartbeats, the server-side
commandInactivityTimeout (20 min) expires and terminates the query, causing
CloudFetch failures in Power BI. This wraps the polling logic in a try-catch
that logs errors via Activity telemetry and continues polling.

Part of ES-1778880.

Co-authored-by: Isaac
// Wait before retrying to avoid tight error loops
try
{
await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to finally? so we do not have duplicate

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should not need try catch here inside the catch

…d try-catch

- Move Task.Delay to after the try-catch block (shared by success and error paths)
- Remove nested try-catch inside the error handler
- On cancellation during delay, OperationCanceledException propagates to Dispose() which handles it

Co-authored-by: Isaac
@eric-wang-1990 eric-wang-1990 added this pull request to the merge queue Mar 27, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 27, 2026
@eric-wang-1990 eric-wang-1990 added this pull request to the merge queue Mar 27, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 27, 2026
{
while (!cancellationToken.IsCancellationRequested)
{
TOperationHandle? operationHandle = _response.OperationHandle;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing this test:
Failed AdbcDrivers.Databricks.Tests.Unit.DatabricksOperationStatusPollerTests.StopsPollingOnException

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to think more about this change, would it cause any unwanted poller keep running which prevent the connection from shutting down?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the logic to avoid this scenario

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with MaxConsecutiveFailures = 10. After 10 consecutive errors (~10 min at 60s interval), the poller stops itself. Also linked the timeout token with the cancellation token so Stop()/Dispose() immediately aborts any in-flight RPC — no more blocking on hung network calls.

The heartbeat poller now continues polling through transient exceptions
instead of stopping. Update the test to assert this new behavior.

Co-authored-by: Isaac
The previous commit accidentally removed the try-catch. Re-add error
resilience using a finally block for Task.Delay per reviewer feedback,
eliminating duplication and nested try-catch.

Co-authored-by: Isaac
- Add MaxConsecutiveFailures=10 (~10min at 60s interval) so persistent
  errors (auth expired, server gone) don't cause infinite polling
- Move Task.Delay out of finally block to avoid unnecessary delay on
  break paths (terminal state, cancellation, null handle)
- Reset failure counter on successful poll so transient blips recover

Co-authored-by: Isaac
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves resilience of the Databricks operation-status heartbeat poller so transient failures don’t permanently stop heartbeating (which can lead to server-side inactivity timeouts).

Changes:

  • Wrap polling loop body with exception handling, add telemetry events for poll errors, and enforce a max consecutive failure threshold.
  • Update the unit test to assert polling continues despite exceptions and ensure the poller is disposed via using.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
csharp/src/Reader/DatabricksOperationStatusPoller.cs Adds try/catch resilience, consecutive-failure limiting, and Activity events for poll errors.
csharp/test/Unit/DatabricksOperationStatusPollerTests.cs Renames/updates exception behavior test and disposes poller via using.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 99 to 103
CancellationToken GetOperationStatusTimeoutToken = ApacheUtility.GetCancellationToken(_requestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds);

var request = new TGetOperationStatusReq(operationHandle);
var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);
var request = new TGetOperationStatusReq(operationHandle);
var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetOperationStatus is invoked with a timeout-only token, so Stop()/Dispose() cancellation may not abort an in-flight request and can block up to _requestTimeoutSeconds. Consider linking the poller’s cancellationToken with the timeout token (as used elsewhere in the repo) so shutdown is prompt even during a hung network call.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Using CancellationTokenSource.CreateLinkedTokenSource(cancellationToken) with CancelAfter so Stop()/Dispose() immediately aborts in-flight GetOperationStatus calls.

Comment on lines 99 to 103
CancellationToken GetOperationStatusTimeoutToken = ApacheUtility.GetCancellationToken(_requestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds);

var request = new TGetOperationStatusReq(operationHandle);
var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);
var request = new TGetOperationStatusReq(operationHandle);
var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local variable naming and typing here deviates from repo guidelines: locals should be camelCase and the codebase prefers explicit types over var (see csharp/CODING_GUIDELINES.md). Rename GetOperationStatusTimeoutToken to a camelCase name and consider using explicit TGetOperationStatusReq/TGetOperationStatusResp types for request/response to match conventions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Renamed to camelCase (timeoutCts) and using explicit types TGetOperationStatusReq/TGetOperationStatusResp.

Comment on lines 101 to 103
var request = new TGetOperationStatusReq(operationHandle);
var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library code awaits without .ConfigureAwait(false), which is a stated guideline in csharp/CODING_GUIDELINES.md and is used widely across the reader code. Consider adding .ConfigureAwait(false) to the await _statement.Client.GetOperationStatus(...) call.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added .ConfigureAwait(false) to both awaits.

Comment on lines +52 to +56
// Maximum number of consecutive poll failures before giving up.
// At the default 60s heartbeat interval this allows ~10 minutes of transient errors
// before the poller stops itself.
private const int MaxConsecutiveFailures = 10;

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxConsecutiveFailures changes runtime behavior (poller stops after N consecutive errors), but there’s no unit test asserting (1) polling stops after the threshold and (2) a single successful poll resets the failure counter. Adding coverage for those cases would prevent regressions in the resilience logic.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion — will add in a follow-up. The existing 8 poller tests all pass with the current changes.


// Wait before next poll. On cancellation this throws OperationCanceledException
// which propagates up to the caller (Dispose catches it).
await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken);
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library code awaits without .ConfigureAwait(false), which is a stated guideline in csharp/CODING_GUIDELINES.md. Consider adding .ConfigureAwait(false) to the await Task.Delay(...) call as well to avoid capturing a synchronization context.

Suggested change
await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken);
await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken).ConfigureAwait(false);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@eric-wang-1990 eric-wang-1990 self-requested a review March 29, 2026 03:04
Copy link
Copy Markdown
Collaborator

@eric-wang-1990 eric-wang-1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please review comments from copilot

…ConfigureAwait

- Link poller cancellation token with request timeout token so Stop()/Dispose()
  immediately aborts in-flight GetOperationStatus calls (Copilot feedback)
- Rename GetOperationStatusTimeoutToken to camelCase (Copilot feedback)
- Use explicit types for TGetOperationStatusReq/Resp (Copilot feedback)
- Add .ConfigureAwait(false) to all awaits (Copilot feedback)

Co-authored-by: Isaac
@msrathore-db msrathore-db added this pull request to the merge queue Mar 29, 2026
Merged via the queue into main with commit 87e5223 Mar 29, 2026
16 checks passed
@msrathore-db msrathore-db deleted the fix/heartbeat-poller-error-resilience branch March 29, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants