- 
                Notifications
    
You must be signed in to change notification settings  - Fork 328
 
feat: add robust retry/backoff for LIST/GET/HEAD #5393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ttling as transient; reduce glob fanout; add minimal backoff tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
This PR adds robust retry/backoff logic for S3-like storage operations (LIST/GET/HEAD) with exponential backoff and 45s max wait time. It also classifies TOS (Volcano Engine) throttling errors as transient and reduces the glob fanout limit from 1024 to 256.
Key Changes:
- Modified 
ExponentialBackoff::retrysignature fromFntoFnMutto allow closures to capture and mutate state - Added retry logic with transient error classification to 
single_url_getandsingle_url_get_sizeinlib.rs - Wrapped all S3 list operations in retry logic with proper error classification
 - Added TOS-specific throttling error codes (
ExceedAccountQPSLimit,ExceedAccountRateLimit, etc.) - Reduced 
DEFAULT_GLOB_FANOUT_LIMITfrom 1024 to 256 - Added basic unit tests for retry behavior
 
Critical Issue Found:
- The semaphore acquisition pattern in all three LIST retry blocks has a logic error where 
connection_pool_sema.acquire()is called outside the async block, preventing proper retry behavior 
Confidence Score: 2/5
- This PR has critical logic errors in semaphore acquisition that prevent proper retry behavior
 - Score reflects three critical logic bugs in the retry implementation where semaphore permits are acquired outside the async block, meaning retries will fail or behave incorrectly. The retry logic itself is sound, and the transient error classification is appropriate, but the semaphore acquisition pattern breaks the retry mechanism in all three LIST operations.
 - src/daft-io/src/s3_like.rs requires immediate attention - all three retry blocks (lines 1259, 1310, 1372) have the same semaphore acquisition bug
 
Important Files Changed
File Analysis
| Filename | Score | Overview | 
|---|---|---|
| src/daft-io/src/retry.rs | 5/5 | Changed retry method from Fn to FnMut to allow closures to mutate state; added basic unit tests for transient and permanent error handling | 
| src/daft-io/src/lib.rs | 4/5 | Added exponential backoff retry logic to GET and HEAD operations with 45s max wait time; classifies transient errors for retry | 
| src/daft-io/src/s3_like.rs | 3/5 | Added TOS throttling errors and retry logic to S3 list operations; reduced glob fanout from 1024 to 256; critical issue with semaphore acquisition in retry closures | 
Sequence Diagram
sequenceDiagram
    participant Client
    participant IOClient
    participant ExponentialBackoff
    participant S3LikeSource
    participant AWS_S3
    
    Client->>IOClient: single_url_get(url)
    IOClient->>ExponentialBackoff: retry(closure)
    
    loop Until success/max_tries/timeout
        ExponentialBackoff->>S3LikeSource: get(path, range)
        S3LikeSource->>AWS_S3: GET request
        
        alt Success
            AWS_S3-->>S3LikeSource: data
            S3LikeSource-->>ExponentialBackoff: Ok(data)
            ExponentialBackoff-->>IOClient: Ok(data)
        else Transient Error (Throttle/Timeout/Socket)
            AWS_S3-->>S3LikeSource: error
            S3LikeSource-->>ExponentialBackoff: RetryError::Transient
            ExponentialBackoff->>ExponentialBackoff: sleep with backoff
        else Permanent Error
            AWS_S3-->>S3LikeSource: error
            S3LikeSource-->>ExponentialBackoff: RetryError::Permanent
            ExponentialBackoff-->>IOClient: Err(error)
        end
    end
    
    IOClient-->>Client: Result
    3 files reviewed, 3 comments
          Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #5393      +/-   ##
==========================================
- Coverage   79.19%   71.46%   -7.73%     
==========================================
  Files         893      990      +97     
  Lines      124095   125496    +1401     
==========================================
- Hits        98279    89690    -8589     
- Misses      25816    35806    +9990     
 🚀 New features to boost your workflow:
  | 
    
| 
               | 
          ||
| let get_result = source | ||
| .get(path.as_ref(), range.clone(), io_stats.clone()) | ||
| let backoff = ExponentialBackoff { max_waittime_ms: Some(45_000), ..Default::default() }; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we consider using the retry classifier of s3?
| 
           @Jay-ju let us know when it's ready for review, thanks!  | 
    
Changes Made
Related Issues
Checklist
docs/mkdocs.ymlnavigation