feat(csharp): convert cloudfetch retry from limit by # of retries to limit by time#375
Conversation
- Extract shared TransportErrorHelper from RetryHttpHandler for reuse - Replace fixed 3-attempt retry with 5-minute time-budget approach using exponential backoff with jitter (matching RetryHttpHandler) - Retry all transient errors (502, connection drops, timeouts) not just transport errors - Replace adbc.databricks.cloudfetch.max_retries parameter with adbc.databricks.cloudfetch.retry_timeout_seconds (default 300s) - Add detailed retry tracing (attempt count, backoff, timeout budget) Co-authored-by: Isaac
Keep the PR scope minimal: only CloudFetch retry-count to time-budget change. RetryHttpHandler remains unchanged from main. Co-authored-by: Isaac
…loudFetch Restore adbc.databricks.cloudfetch.max_retries parameter (needed by PowerBI) as an optional guard alongside the new time-budget retry approach. Default is 0 (no limit), meaning only the 5-minute timeout applies. When set, the retry loop exits if either the max retry count or the timeout is reached first. Co-authored-by: Isaac
Change DefaultMaxRetries from 0 to -1 so that -1 clearly means "not set" and 0 can mean "no retries at all" if needed. Guard checks updated from > 0 to >= 0 accordingly. Co-authored-by: Isaac
…loader Remove CalculateBackoffWithJitter method and MaxBackoffMs constant, inline the single-use logic at the call site. Co-authored-by: Isaac
There was a problem hiding this comment.
Pull request overview
This PR updates the C# CloudFetch downloader retry behavior from a small fixed retry count to a time-budgeted retry loop with exponential backoff + jitter, and adds a new connection parameter to configure the retry time budget.
Changes:
- Introduce
adbc.databricks.cloudfetch.retry_timeout_seconds(default 300s) and wire it intoCloudFetchDownloader. - Replace the fixed retry loop with a budget-based loop that logs richer retry telemetry and supports an optional
max_retriescap for backward compatibility. - Update CloudFetch downloader tests to pass the new constructor parameters and adjust expectations/timings.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| csharp/src/Reader/CloudFetch/CloudFetchDownloader.cs | Implements the new time-budget retry loop, adds retry telemetry, and plumbs through RetryTimeoutSeconds. |
| csharp/src/Reader/CloudFetch/CloudFetchConfiguration.cs | Adds defaults + parsing for the new retry timeout parameter and changes MaxRetries default/handling. |
| csharp/src/DatabricksParameters.cs | Adds the new parameter constant and updates parameter documentation for CloudFetch retry settings. |
| csharp/test/E2E/CloudFetch/CloudFetchDownloaderTest.cs | Updates tests to use the new retry configuration and adjusts assertions/waits for the new behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix maxRetries semantics: -1=unlimited, 0=no retries, >0=max retries. Move check to catch block so first attempt always runs. - Use real elapsed time (stopwatch) for retry budget instead of summing backoff delays, so slow/hanging attempts count against the budget. - Use long for retryTimeoutMs to prevent int overflow on large values. - Remove raw exception messages from trace events to avoid leaking pre-signed URL credentials in telemetry. - Throw on invalid max_retries property values instead of silent fallback. - Replace flaky Task.Delay(3000) in test with deterministic polling. Co-authored-by: Isaac
Keep consistent with RetryHttpHandler which also tracks cumulative backoff sleep time rather than real elapsed time. Both rely on per-request timeouts (HttpClient.Timeout / per-request CTS) to bound individual attempt duration. Co-authored-by: Isaac
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Use Random.Shared instead of new Random() per retry to avoid identical jitter values across parallel downloads. - Use long math for backoff doubling to prevent int overflow. - Fix maxRetries check from > to >= for backward compatibility (max_retries=3 means 3 total attempts, not 4). Co-authored-by: Isaac
0 = no limit (use timeout only), >0 = total attempts including first try. Matches old for-loop semantics where maxRetries=3 meant 3 total attempts. Removes -1 sentinel value, check is now before try block like the old code. Co-authored-by: Isaac
Random.Shared requires .NET 6+, not available in netstandard2.0. Co-authored-by: Isaac
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…time, backoff clamp - Replace SanitizeUrl(url) calls with existing sanitizedUrl variable - Use stopwatch.Elapsed in exception message for accurate elapsed time - Clamp initial currentBackoffMs to 32s cap to prevent overflow Co-authored-by: Isaac
…limit by time (#375) ## Summary - Replace fixed 3-attempt retry with **5-minute time-budget** approach using exponential backoff with jitter (matching RetryHttpHandler's pattern) - Retry all exceptions (502 Bad Gateway from proxy, connection drops, TCP resets, HttpClient.Timeout) — catch-all, not just transport errors - Add new `adbc.databricks.cloudfetch.retry_timeout_seconds` parameter (default 300s / 5 min) - Keep `adbc.databricks.cloudfetch.max_retries` parameter for PowerBI backward compatibility (default -1 = not set). When set, retry loop exits if either max retries or timeout is reached first - Add detailed retry tracing (attempt count, backoff delay, timeout budget remaining) ### Problem CloudFetch downloads used a fixed 3-retry loop with linear backoff (~1.5s total). When a proxy returns 502 or a firewall blocks cloud storage, all 3 retries exhaust within seconds — not enough time for transient issues to resolve. ### Solution Switch to a time-budget approach (same as RetryHttpHandler from #373): keep retrying with exponential backoff + jitter until the 5-minute budget is exhausted. This gives transient issues (proxy errors, firewall rules, connection drops) enough time to resolve. | | Before | After | |---|---|---| | Retry limit | 3 attempts | 5-minute time budget (+ optional max_retries) | | Backoff | Linear (500ms, 1s, 1.5s) | Exponential with jitter (500ms → 1s → 2s → ... → 32s cap) | | Total retry window | ~3 seconds | Up to 5 minutes | | Error classification | Retry all exceptions | Retry all exceptions (same) | | max_retries param | Required (default 3) | Optional (default -1 = not set) | ## Test plan - [x] All 9 CloudFetch downloader unit tests pass - [x] Full build succeeds (netstandard2.0 + net8.0) - [ ] Manual test: block cloud storage via MITM firewall, verify retries continue for 5 min - [ ] Manual test: unblock within 5 min, verify download resumes successfully This pull request was AI-assisted by Isaac.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| $"Failed to download file from {sanitizedUrl} after {attemptCount} attempts over {stopwatch.Elapsed.TotalSeconds:F1}s ({retryLimits}).", | ||
| lastException); |
There was a problem hiding this comment.
The thrown InvalidOperationException now includes lastException as an inner exception. Some inner exceptions (e.g., HttpRequestException) can include the original request URI in their message, which may contain pre-signed URL query parameters. Since this exception is also passed to tracing via activity?.AddException(ex, ...), the inner exception message can end up in logs/telemetry. Consider not attaching the raw exception as the inner exception, or wrapping it in a redacted exception type/message before attaching so pre-signed URL credentials can’t leak.
| $"Failed to download file from {sanitizedUrl} after {attemptCount} attempts over {stopwatch.Elapsed.TotalSeconds:F1}s ({retryLimits}).", | |
| lastException); | |
| $"Failed to download file from {sanitizedUrl} after {attemptCount} attempts over {stopwatch.Elapsed.TotalSeconds:F1}s ({retryLimits})."); |
There was a problem hiding this comment.
✅ Done - Removed lastException as inner exception to prevent pre-signed URL leakage. Instead, appended the exception type name to the message: Last error: {lastException?.GetType().Name ?? "unknown"}. This gives operators enough context for diagnosis without exposing credentials.
Drop lastException as inner exception from the retry-exhausted InvalidOperationException. HttpRequestException.Message can contain the full pre-signed URL with credentials/SAS tokens. Instead, append the exception type name to the message for operator diagnosis. Co-authored-by: Isaac
Summary
adbc.databricks.cloudfetch.retry_timeout_secondsparameter (default 300s / 5 min)adbc.databricks.cloudfetch.max_retriesparameter for PowerBI backward compatibility (default -1 = not set). When set, retry loop exits if either max retries or timeout is reached firstProblem
CloudFetch downloads used a fixed 3-retry loop with linear backoff (~1.5s total). When a proxy returns 502 or a firewall blocks cloud storage, all 3 retries exhaust within seconds — not enough time for transient issues to resolve.
Solution
Switch to a time-budget approach (same as RetryHttpHandler from #373): keep retrying with exponential backoff + jitter until the 5-minute budget is exhausted. This gives transient issues (proxy errors, firewall rules, connection drops) enough time to resolve.
Test plan
This pull request was AI-assisted by Isaac.