Skip to content

feat(csharp): convert cloudfetch retry from limit by # of retries to limit by time#375

Merged
eric-wang-1990 merged 12 commits intomainfrom
fix/cloudfetch-transport-error-retry
Mar 29, 2026
Merged

feat(csharp): convert cloudfetch retry from limit by # of retries to limit by time#375
eric-wang-1990 merged 12 commits intomainfrom
fix/cloudfetch-transport-error-retry

Conversation

@eric-wang-1990
Copy link
Copy Markdown
Collaborator

@eric-wang-1990 eric-wang-1990 commented Mar 28, 2026

Summary

  • Replace fixed 3-attempt retry with 5-minute time-budget approach using exponential backoff with jitter (matching RetryHttpHandler's pattern)
  • Retry all exceptions (502 Bad Gateway from proxy, connection drops, TCP resets, HttpClient.Timeout) — catch-all, not just transport errors
  • Add new adbc.databricks.cloudfetch.retry_timeout_seconds parameter (default 300s / 5 min)
  • Keep adbc.databricks.cloudfetch.max_retries parameter for PowerBI backward compatibility (default -1 = not set). When set, retry loop exits if either max retries or timeout is reached first
  • Add detailed retry tracing (attempt count, backoff delay, timeout budget remaining)

Problem

CloudFetch downloads used a fixed 3-retry loop with linear backoff (~1.5s total). When a proxy returns 502 or a firewall blocks cloud storage, all 3 retries exhaust within seconds — not enough time for transient issues to resolve.

Solution

Switch to a time-budget approach (same as RetryHttpHandler from #373): keep retrying with exponential backoff + jitter until the 5-minute budget is exhausted. This gives transient issues (proxy errors, firewall rules, connection drops) enough time to resolve.

Before After
Retry limit 3 attempts 5-minute time budget (+ optional max_retries)
Backoff Linear (500ms, 1s, 1.5s) Exponential with jitter (500ms → 1s → 2s → ... → 32s cap)
Total retry window ~3 seconds Up to 5 minutes
Error classification Retry all exceptions Retry all exceptions (same)
max_retries param Required (default 3) Optional (default -1 = not set)

Test plan

  • All 9 CloudFetch downloader unit tests pass
  • Full build succeeds (netstandard2.0 + net8.0)
  • Manual test: block cloud storage via MITM firewall, verify retries continue for 5 min
  • Manual test: unblock within 5 min, verify download resumes successfully

This pull request was AI-assisted by Isaac.

- Extract shared TransportErrorHelper from RetryHttpHandler for reuse
- Replace fixed 3-attempt retry with 5-minute time-budget approach
  using exponential backoff with jitter (matching RetryHttpHandler)
- Retry all transient errors (502, connection drops, timeouts) not
  just transport errors
- Replace adbc.databricks.cloudfetch.max_retries parameter with
  adbc.databricks.cloudfetch.retry_timeout_seconds (default 300s)
- Add detailed retry tracing (attempt count, backoff, timeout budget)

Co-authored-by: Isaac
Keep the PR scope minimal: only CloudFetch retry-count to time-budget change.
RetryHttpHandler remains unchanged from main.

Co-authored-by: Isaac
…loudFetch

Restore adbc.databricks.cloudfetch.max_retries parameter (needed by PowerBI)
as an optional guard alongside the new time-budget retry approach. Default is
0 (no limit), meaning only the 5-minute timeout applies. When set, the retry
loop exits if either the max retry count or the timeout is reached first.

Co-authored-by: Isaac
Change DefaultMaxRetries from 0 to -1 so that -1 clearly means "not set"
and 0 can mean "no retries at all" if needed. Guard checks updated from
> 0 to >= 0 accordingly.

Co-authored-by: Isaac
…loader

Remove CalculateBackoffWithJitter method and MaxBackoffMs constant,
inline the single-use logic at the call site.

Co-authored-by: Isaac
@eric-wang-1990 eric-wang-1990 changed the title feat(csharp): add CloudFetch download retry resilience feat(csharp): convert cloudfetch retry from limit by # of retries to limit by time Mar 28, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the C# CloudFetch downloader retry behavior from a small fixed retry count to a time-budgeted retry loop with exponential backoff + jitter, and adds a new connection parameter to configure the retry time budget.

Changes:

  • Introduce adbc.databricks.cloudfetch.retry_timeout_seconds (default 300s) and wire it into CloudFetchDownloader.
  • Replace the fixed retry loop with a budget-based loop that logs richer retry telemetry and supports an optional max_retries cap for backward compatibility.
  • Update CloudFetch downloader tests to pass the new constructor parameters and adjust expectations/timings.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
csharp/src/Reader/CloudFetch/CloudFetchDownloader.cs Implements the new time-budget retry loop, adds retry telemetry, and plumbs through RetryTimeoutSeconds.
csharp/src/Reader/CloudFetch/CloudFetchConfiguration.cs Adds defaults + parsing for the new retry timeout parameter and changes MaxRetries default/handling.
csharp/src/DatabricksParameters.cs Adds the new parameter constant and updates parameter documentation for CloudFetch retry settings.
csharp/test/E2E/CloudFetch/CloudFetchDownloaderTest.cs Updates tests to use the new retry configuration and adjusts assertions/waits for the new behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix maxRetries semantics: -1=unlimited, 0=no retries, >0=max retries.
  Move check to catch block so first attempt always runs.
- Use real elapsed time (stopwatch) for retry budget instead of summing
  backoff delays, so slow/hanging attempts count against the budget.
- Use long for retryTimeoutMs to prevent int overflow on large values.
- Remove raw exception messages from trace events to avoid leaking
  pre-signed URL credentials in telemetry.
- Throw on invalid max_retries property values instead of silent fallback.
- Replace flaky Task.Delay(3000) in test with deterministic polling.

Co-authored-by: Isaac
Keep consistent with RetryHttpHandler which also tracks cumulative
backoff sleep time rather than real elapsed time. Both rely on
per-request timeouts (HttpClient.Timeout / per-request CTS) to
bound individual attempt duration.

Co-authored-by: Isaac
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@msrathore-db msrathore-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- Use Random.Shared instead of new Random() per retry to avoid
  identical jitter values across parallel downloads.
- Use long math for backoff doubling to prevent int overflow.
- Fix maxRetries check from > to >= for backward compatibility
  (max_retries=3 means 3 total attempts, not 4).

Co-authored-by: Isaac
0 = no limit (use timeout only), >0 = total attempts including first try.
Matches old for-loop semantics where maxRetries=3 meant 3 total attempts.
Removes -1 sentinel value, check is now before try block like the old code.

Co-authored-by: Isaac
Random.Shared requires .NET 6+, not available in netstandard2.0.

Co-authored-by: Isaac
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…time, backoff clamp

- Replace SanitizeUrl(url) calls with existing sanitizedUrl variable
- Use stopwatch.Elapsed in exception message for accurate elapsed time
- Clamp initial currentBackoffMs to 32s cap to prevent overflow

Co-authored-by: Isaac
@eric-wang-1990 eric-wang-1990 requested a review from Copilot March 29, 2026 03:11
@eric-wang-1990 eric-wang-1990 added this pull request to the merge queue Mar 29, 2026
github-merge-queue bot pushed a commit that referenced this pull request Mar 29, 2026
…limit by time (#375)

## Summary
- Replace fixed 3-attempt retry with **5-minute time-budget** approach
using exponential backoff with jitter (matching RetryHttpHandler's
pattern)
- Retry all exceptions (502 Bad Gateway from proxy, connection drops,
TCP resets, HttpClient.Timeout) — catch-all, not just transport errors
- Add new `adbc.databricks.cloudfetch.retry_timeout_seconds` parameter
(default 300s / 5 min)
- Keep `adbc.databricks.cloudfetch.max_retries` parameter for PowerBI
backward compatibility (default -1 = not set). When set, retry loop
exits if either max retries or timeout is reached first
- Add detailed retry tracing (attempt count, backoff delay, timeout
budget remaining)

### Problem
CloudFetch downloads used a fixed 3-retry loop with linear backoff
(~1.5s total). When a proxy returns 502 or a firewall blocks cloud
storage, all 3 retries exhaust within seconds — not enough time for
transient issues to resolve.

### Solution
Switch to a time-budget approach (same as RetryHttpHandler from #373):
keep retrying with exponential backoff + jitter until the 5-minute
budget is exhausted. This gives transient issues (proxy errors, firewall
rules, connection drops) enough time to resolve.

| | Before | After |
|---|---|---|
| Retry limit | 3 attempts | 5-minute time budget (+ optional
max_retries) |
| Backoff | Linear (500ms, 1s, 1.5s) | Exponential with jitter (500ms →
1s → 2s → ... → 32s cap) |
| Total retry window | ~3 seconds | Up to 5 minutes |
| Error classification | Retry all exceptions | Retry all exceptions
(same) |
| max_retries param | Required (default 3) | Optional (default -1 = not
set) |

## Test plan
- [x] All 9 CloudFetch downloader unit tests pass
- [x] Full build succeeds (netstandard2.0 + net8.0)
- [ ] Manual test: block cloud storage via MITM firewall, verify retries
continue for 5 min
- [ ] Manual test: unblock within 5 min, verify download resumes
successfully

This pull request was AI-assisted by Isaac.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +647 to +648
$"Failed to download file from {sanitizedUrl} after {attemptCount} attempts over {stopwatch.Elapsed.TotalSeconds:F1}s ({retryLimits}).",
lastException);
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thrown InvalidOperationException now includes lastException as an inner exception. Some inner exceptions (e.g., HttpRequestException) can include the original request URI in their message, which may contain pre-signed URL query parameters. Since this exception is also passed to tracing via activity?.AddException(ex, ...), the inner exception message can end up in logs/telemetry. Consider not attaching the raw exception as the inner exception, or wrapping it in a redacted exception type/message before attaching so pre-signed URL credentials can’t leak.

Suggested change
$"Failed to download file from {sanitizedUrl} after {attemptCount} attempts over {stopwatch.Elapsed.TotalSeconds:F1}s ({retryLimits}).",
lastException);
$"Failed to download file from {sanitizedUrl} after {attemptCount} attempts over {stopwatch.Elapsed.TotalSeconds:F1}s ({retryLimits}).");

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Done - Removed lastException as inner exception to prevent pre-signed URL leakage. Instead, appended the exception type name to the message: Last error: {lastException?.GetType().Name ?? "unknown"}. This gives operators enough context for diagnosis without exposing credentials.

@eric-wang-1990 eric-wang-1990 removed this pull request from the merge queue due to a manual request Mar 29, 2026
Drop lastException as inner exception from the retry-exhausted
InvalidOperationException. HttpRequestException.Message can contain
the full pre-signed URL with credentials/SAS tokens. Instead, append
the exception type name to the message for operator diagnosis.

Co-authored-by: Isaac
@eric-wang-1990 eric-wang-1990 added this pull request to the merge queue Mar 29, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 29, 2026
@eric-wang-1990 eric-wang-1990 added this pull request to the merge queue Mar 29, 2026
Merged via the queue into main with commit adf4853 Mar 29, 2026
16 checks passed
@eric-wang-1990 eric-wang-1990 deleted the fix/cloudfetch-transport-error-retry branch March 29, 2026 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants