feat(rust): implement idempotency-aware retry logic [PECOBLR-2091]#370
feat(rust): implement idempotency-aware retry logic [PECOBLR-2091]#370vikrantpuppala wants to merge 1 commit intomainfrom
Conversation
8b81b96 to
3855476
Compare
Range-diff: main (8b81b96 -> 3855476)
Reproduce locally: |
3855476 to
57ddb54
Compare
Range-diff: main (3855476 -> 57ddb54)
Reproduce locally: |
57ddb54 to
dbcd69a
Compare
Range-diff: main (57ddb54 -> dbcd69a)
Reproduce locally: |
dbcd69a to
5095574
Compare
Range-diff: main (dbcd69a -> 5095574)
Reproduce locally: |
5095574 to
d194c2f
Compare
Range-diff: main (5095574 -> d194c2f)
Reproduce locally: |
d194c2f to
de83d2e
Compare
Range-diff: main (d194c2f -> de83d2e)
Reproduce locally: |
|
|
||
| let http_client = Arc::new( | ||
| DatabricksHttpClient::new(HttpClientConfig::default()) | ||
| DatabricksHttpClient::with_default_retry(HttpClientConfig::default()) |
There was a problem hiding this comment.
we may need to have different retry configurations for Auth and APIs
There was a problem hiding this comment.
Done — added full config wiring in this PR. The infrastructure already supported per-category configs via build_retry_configs(global, overrides), and now it's exposed through ADBC options:
Global options:
databricks.retry.min_wait_ms(default: 1000)databricks.retry.max_wait_ms(default: 60000)databricks.retry.overall_timeout_ms(default: 900000)databricks.retry.max_retries(default: 5)
Per-category overrides (inherit from global when unset):
databricks.retry.{sea,auth,cloudfetch}.min_wait_msdatabricks.retry.{sea,auth,cloudfetch}.max_wait_msdatabricks.retry.{sea,auth,cloudfetch}.overall_timeout_msdatabricks.retry.{sea,auth,cloudfetch}.max_retries
Auth category defaults to overall_timeout=30s, max_retries=3 (see next comment).
| - Requests classified as **idempotent** or **non-idempotent** with distinct retry strategies | ||
| - `Retry-After` header honored, with min/max clamping | ||
| - Exponential backoff with random jitter (50ms–750ms) | ||
| - Cumulative timeout (default 900s) in addition to per-attempt count |
There was a problem hiding this comment.
will this be applicable to all requests? 900s is too high for requests like OAuth, and even create connection
There was a problem hiding this comment.
Agreed — 900s is way too long for auth/session creation. Added per-category defaults:
- Auth (
AuthTokenRequest,AuthDiscovery):overall_timeout=30s,max_retries=3 - Sea (
CreateSession,ExecuteStatement, etc.): inherits global900s/5retries - CloudFetch: inherits global
900s/5retries
The Auth default is set in Database::default() so it applies out of the box. Users can further customize via databricks.retry.auth.overall_timeout_ms etc.
CreateSession/CloseSession are in the Sea category along with query execution. We could split them into their own category if needed, but for now they share the same retry config. Let me know if you think session management should have its own timeout.
rust/src/client/retry.rs
Outdated
| // Non-idempotent: retry only when the server explicitly signals retry | ||
| // via Retry-After header on 429/503 | ||
| RequestIdempotency::NonIdempotent => { | ||
| NON_IDEMPOTENT_RETRYABLE.contains(&code) && has_retry_after |
There was a problem hiding this comment.
that means even for 429, if there is no retry_after header, we won't retry with exponential backoff
There was a problem hiding this comment.
This is intentional per Mehmet's review feedback on the design doc — for non-idempotent requests (ExecuteStatement), we only retry 429/503 when the server explicitly signals it via Retry-After. Without the header, there's ambiguity about whether the server started processing the request before returning the error.
That said, I think there's a reasonable argument that 429 specifically is always safe to retry (the request was rate-limited, not executed). We could treat 429 as unconditionally retryable for non-idempotent while keeping 503 gated on Retry-After. Let me check with Mehmet on whether he'd be OK with that refinement.
There was a problem hiding this comment.
Updated — 429 is now unconditionally retryable for non-idempotent requests (rate-limiting means the request was rejected before execution). Only 503 requires Retry-After header for non-idempotent, since 503 is ambiguous about whether processing started.
Tests updated:
test_non_idempotent_strategy_429_always_retryable— verifies 429 retries with and without Retry-Aftertest_non_idempotent_strategy_503_requires_retry_after— verifies 503 only retries with headertest_non_idempotent_retries_on_429_without_retry_after(wiremock) — E2E prooftest_non_idempotent_no_retry_on_503_without_retry_after(wiremock) — E2E proof
rust/src/client/retry.rs
Outdated
| }; | ||
|
|
||
| // Add random jitter between 50ms and 750ms | ||
| let jitter_ms = rand::rng().random_range(50..=750); |
There was a problem hiding this comment.
should we add a proportional jitter? So, that it remains proportional to the backoff?
There was a problem hiding this comment.
Good idea. The spec defines jitter as random(50ms, 750ms) (fixed range), so the current implementation follows that. At higher backoff values (32s, 60s), a fixed 50-750ms jitter becomes negligible and doesn't meaningfully spread retries across clients.
Proportional jitter (e.g., 5-25% of base backoff) would scale better:
- Attempt 1 (2s base): 100-500ms jitter
- Attempt 3 (8s base): 400ms-2s jitter
- Attempt 5 (32s base): 1.6-8s jitter
I'd prefer to keep the spec-defined fixed jitter for now to stay aligned with other connectors, and revisit proportional jitter as a follow-up if we see thundering-herd issues in practice. Does that work for you?
There was a problem hiding this comment.
Done — switched to proportional jitter: random 5-25% of base wait, with a 50ms floor.
This scales naturally with backoff:
- Attempt 1 (2s base): 100-500ms jitter
- Attempt 3 (8s base): 400-2000ms jitter
- Attempt 5 (32s base): 1.6-8s jitter
Tests:
test_proportional_jitter— verifies jitter scales with base at both low and high attemptstest_jitter_minimum_50ms— verifies 50ms floor for very small base waits- All timing assertions updated for proportional ranges
de83d2e to
c2cd59b
Compare
Range-diff: main (de83d2e -> c2cd59b)
Reproduce locally: |
Implements the standardized DBSQL connector retry specification with: - RequestType enum as single source of truth for request classification - Idempotent vs non-idempotent retry strategies with distinct behavior - Exponential backoff with random jitter (50-750ms) - Retry-After header support (seconds and HTTP-date formats) - Cumulative overall timeout (default 900s) - Per-category retry config (SEA, CloudFetch, Auth) with global defaults - Structured DEBUG logging per attempt and per request summary - Metadata queries (list_*) classified as idempotent via ExecuteMetadataQuery Co-authored-by: Isaac
c2cd59b to
7f47f62
Compare
Range-diff: main (c2cd59b -> 7f47f62)
Reproduce locally: |
🥞 Stacked PR
Use this link to review incremental changes.
Summary
Implements the standardized DBSQL connector retry specification in the Rust ADBC driver, replacing the previous one-size-fits-all retry logic.
RequestTypeenum as single source of truth for request classification — callers pass one value, HTTP client resolves category (for config) and idempotency (for strategy) internallyExecuteStatementonly): retries only on connection errors (request never reached server) and 429/503Retry-Afterheader honored first, exponential backoff fallback, both clamped to[min_wait, max_wait], with random jitter (50–750ms)RetryConfigwith global defaultslist_catalogs,list_schemas, etc.) classified as idempotent viaExecuteMetadataQueryDesign doc: #369
Test plan
cargo clippy --all-targets -- -D warningscleancargo +stable fmt --allcleanThis pull request was AI-assisted by Isaac.