fix(csharp): add error resilience to heartbeat poller by msrathore-db · Pull Request #372 · adbc-drivers/databricks

msrathore-db · 2026-03-27T05:51:38Z

Summary

Wraps the PollOperationStatus polling loop in a try-catch so that transient exceptions (e.g. ObjectDisposedException from TLS connection recycling) no longer kill the heartbeat poller silently
Adds a max consecutive failure limit (MaxConsecutiveFailures = 10) so persistent errors (auth expired, server gone) don't cause infinite polling — at the default 60s heartbeat interval this gives ~10 minutes of tolerance before the poller stops itself
A single successful poll resets the failure counter, so intermittent transient errors are handled gracefully
Logs errors via Activity.Current?.AddEvent() telemetry with error type, message, poll count, and consecutive failure count
Properly handles OperationCanceledException from the cancellation token to still allow graceful shutdown
Updates the StopsPollingOnException test to ContinuesPollingOnException to match the new resilient behavior

Context

Without this fix, a single transient network error permanently stops the heartbeat poller. The server-side commandInactivityTimeout (default 20 minutes) then expires because no GetOperationStatus calls refresh it, causing the server to terminate the query. This manifests as CloudFetch failures in Power BI (ES-1778880).

Design decisions

Why not finally for Task.Delay? A finally block runs even on break, which would add an unnecessary 60s delay on every clean exit path (terminal state, cancellation, null handle). Placing the delay after the try-catch means it only executes when the loop continues.
Why 10 max failures? At 60s intervals, 10 failures = ~10 minutes — enough to ride out transient network issues but not so long that a permanently broken connection wastes resources indefinitely.
Request timeouts are treated as transient errors. The per-request GetOperationStatusTimeoutToken throws OperationCanceledException but the cancellation filter (when cancellationToken.IsCancellationRequested) correctly routes it to the general catch since the main token isn't cancelled.

Test plan

Verify build succeeds (confirmed locally, 0 warnings, 0 errors)
Update StopsPollingOnException → ContinuesPollingOnException to assert pollCount > 1
Verify existing unit tests pass
Manual validation: inject a transient exception during polling and confirm the poller recovers and continues heartbeating
Verify cancellation still stops the poller gracefully
Verify persistent errors stop the poller after ~10 consecutive failures

This pull request was AI-assisted by Isaac.

…nt death The PollOperationStatus method had no try-catch, so any transient exception (e.g. ObjectDisposedException from TLS connection recycling on Mono) would kill the heartbeat poller silently. Without heartbeats, the server-side commandInactivityTimeout (20 min) expires and terminates the query, causing CloudFetch failures in Power BI. This wraps the polling logic in a try-catch that logs errors via Activity telemetry and continues polling. Part of ES-1778880. Co-authored-by: Isaac

eric-wang-1990 · 2026-03-27T16:25:28Z

csharp/src/Reader/DatabricksOperationStatusPoller.cs

+                    // Wait before retrying to avoid tight error loops
+                    try
+                    {
+                        await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken);


Can we move this to finally? so we do not have duplicate

And we should not need try catch here inside the catch

…d try-catch - Move Task.Delay to after the try-catch block (shared by success and error paths) - Remove nested try-catch inside the error handler - On cancellation during delay, OperationCanceledException propagates to Dispose() which handles it Co-authored-by: Isaac

eric-wang-1990 · 2026-03-27T21:00:18Z

csharp/src/Reader/DatabricksOperationStatusPoller.cs

        {
            while (!cancellationToken.IsCancellationRequested)
            {
-                TOperationHandle? operationHandle = _response.OperationHandle;


This is failing this test:
Failed AdbcDrivers.Databricks.Tests.Unit.DatabricksOperationStatusPollerTests.StopsPollingOnException

I think we need to think more about this change, would it cause any unwanted poller keep running which prevent the connection from shutting down?

Updated the logic to avoid this scenario

Addressed with MaxConsecutiveFailures = 10. After 10 consecutive errors (~10 min at 60s interval), the poller stops itself. Also linked the timeout token with the cancellation token so Stop()/Dispose() immediately aborts any in-flight RPC — no more blocking on hung network calls.

The heartbeat poller now continues polling through transient exceptions instead of stopping. Update the test to assert this new behavior. Co-authored-by: Isaac

The previous commit accidentally removed the try-catch. Re-add error resilience using a finally block for Task.Delay per reviewer feedback, eliminating duplication and nested try-catch. Co-authored-by: Isaac

- Add MaxConsecutiveFailures=10 (~10min at 60s interval) so persistent errors (auth expired, server gone) don't cause infinite polling - Move Task.Delay out of finally block to avoid unnecessary delay on break paths (terminal state, cancellation, null handle) - Reset failure counter on successful poll so transient blips recover Co-authored-by: Isaac

Copilot

Pull request overview

Improves resilience of the Databricks operation-status heartbeat poller so transient failures don’t permanently stop heartbeating (which can lead to server-side inactivity timeouts).

Changes:

Wrap polling loop body with exception handling, add telemetry events for poll errors, and enforce a max consecutive failure threshold.
Update the unit test to assert polling continues despite exceptions and ensure the poller is disposed via using.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
csharp/src/Reader/DatabricksOperationStatusPoller.cs	Adds try/catch resilience, consecutive-failure limiting, and Activity events for poll errors.
csharp/test/Unit/DatabricksOperationStatusPollerTests.cs	Renames/updates exception behavior test and disposes poller via `using`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-29T02:24:27Z

csharp/src/Reader/DatabricksOperationStatusPoller.cs

+                    CancellationToken GetOperationStatusTimeoutToken = ApacheUtility.GetCancellationToken(_requestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds);

-                var request = new TGetOperationStatusReq(operationHandle);
-                var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);
+                    var request = new TGetOperationStatusReq(operationHandle);
+                    var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);



GetOperationStatus is invoked with a timeout-only token, so Stop()/Dispose() cancellation may not abort an in-flight request and can block up to _requestTimeoutSeconds. Consider linking the poller’s cancellationToken with the timeout token (as used elsewhere in the repo) so shutdown is prompt even during a hung network call.

Fixed. Using CancellationTokenSource.CreateLinkedTokenSource(cancellationToken) with CancelAfter so Stop()/Dispose() immediately aborts in-flight GetOperationStatus calls.

Copilot · 2026-03-29T02:24:27Z

csharp/src/Reader/DatabricksOperationStatusPoller.cs

+                    CancellationToken GetOperationStatusTimeoutToken = ApacheUtility.GetCancellationToken(_requestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds);

-                var request = new TGetOperationStatusReq(operationHandle);
-                var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);
+                    var request = new TGetOperationStatusReq(operationHandle);
+                    var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);



Local variable naming and typing here deviates from repo guidelines: locals should be camelCase and the codebase prefers explicit types over var (see csharp/CODING_GUIDELINES.md). Rename GetOperationStatusTimeoutToken to a camelCase name and consider using explicit TGetOperationStatusReq/TGetOperationStatusResp types for request/response to match conventions.

Fixed. Renamed to camelCase (timeoutCts) and using explicit types TGetOperationStatusReq/TGetOperationStatusResp.

Copilot · 2026-03-29T02:24:27Z

csharp/src/Reader/DatabricksOperationStatusPoller.cs

+                    var request = new TGetOperationStatusReq(operationHandle);
+                    var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);



This library code awaits without .ConfigureAwait(false), which is a stated guideline in csharp/CODING_GUIDELINES.md and is used widely across the reader code. Consider adding .ConfigureAwait(false) to the await _statement.Client.GetOperationStatus(...) call.

Fixed. Added .ConfigureAwait(false) to both awaits.

Copilot · 2026-03-29T02:24:27Z

csharp/src/Reader/DatabricksOperationStatusPoller.cs

+        // Maximum number of consecutive poll failures before giving up.
+        // At the default 60s heartbeat interval this allows ~10 minutes of transient errors
+        // before the poller stops itself.
+        private const int MaxConsecutiveFailures = 10;
+


MaxConsecutiveFailures changes runtime behavior (poller stops after N consecutive errors), but there’s no unit test asserting (1) polling stops after the threshold and (2) a single successful poll resets the failure counter. Adding coverage for those cases would prevent regressions in the resilience logic.

Good suggestion — will add in a follow-up. The existing 8 poller tests all pass with the current changes.

Copilot · 2026-03-29T02:24:28Z

csharp/src/Reader/DatabricksOperationStatusPoller.cs

+
+                // Wait before next poll. On cancellation this throws OperationCanceledException
+                // which propagates up to the caller (Dispose catches it).
+                await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken);


This library code awaits without .ConfigureAwait(false), which is a stated guideline in csharp/CODING_GUIDELINES.md. Consider adding .ConfigureAwait(false) to the await Task.Delay(...) call as well to avoid capturing a synchronization context.

Suggested change

await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken);

await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken).ConfigureAwait(false);

eric-wang-1990

LGTM, please review comments from copilot

…ConfigureAwait - Link poller cancellation token with request timeout token so Stop()/Dispose() immediately aborts in-flight GetOperationStatus calls (Copilot feedback) - Rename GetOperationStatusTimeoutToken to camelCase (Copilot feedback) - Use explicit types for TGetOperationStatusReq/Resp (Copilot feedback) - Add .ConfigureAwait(false) to all awaits (Copilot feedback) Co-authored-by: Isaac

eric-wang-1990 reviewed Mar 27, 2026

View reviewed changes

eric-wang-1990 approved these changes Mar 27, 2026

View reviewed changes

eric-wang-1990 added this pull request to the merge queue Mar 27, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 27, 2026

eric-wang-1990 added this pull request to the merge queue Mar 27, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 27, 2026

eric-wang-1990 reviewed Mar 27, 2026

View reviewed changes

msrathore-db added 3 commits March 28, 2026 12:29

test(csharp): update poller test to expect resilience on exception

4229f55

The heartbeat poller now continues polling through transient exceptions instead of stopping. Update the test to assert this new behavior. Co-authored-by: Isaac

fix(csharp): re-add error resilience with finally-based delay

af73052

The previous commit accidentally removed the try-catch. Re-add error resilience using a finally block for Task.Delay per reviewer feedback, eliminating duplication and nested try-catch. Co-authored-by: Isaac

eric-wang-1990 requested a review from Copilot March 29, 2026 02:20

Copilot started reviewing on behalf of eric-wang-1990 March 29, 2026 02:21 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

eric-wang-1990 self-requested a review March 29, 2026 03:04

eric-wang-1990 reviewed Mar 29, 2026

View reviewed changes

msrathore-db added this pull request to the merge queue Mar 29, 2026

Merged via the queue into main with commit 87e5223 Mar 29, 2026
16 checks passed

msrathore-db deleted the fix/heartbeat-poller-error-resilience branch March 29, 2026 19:45

		var request = new TGetOperationStatusReq(operationHandle);
		var response = await _statement.Client.GetOperationStatus(request, GetOperationStatusTimeoutToken);

	await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken);
	await Task.Delay(TimeSpan.FromSeconds(_heartbeatIntervalSeconds), cancellationToken).ConfigureAwait(false);

Conversation

msrathore-db commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Design decisions

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wang-1990 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

msrathore-db commented Mar 27, 2026 •

edited

Loading