Skip to content

Conversation

@Jay-ju
Copy link
Contributor

@Jay-ju Jay-ju commented Oct 16, 2025

Changes Made

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly

…ttling as transient; reduce glob fanout; add minimal backoff tests
@github-actions github-actions bot added the feat label Oct 16, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR adds robust retry/backoff logic for S3-like storage operations (LIST/GET/HEAD) with exponential backoff and 45s max wait time. It also classifies TOS (Volcano Engine) throttling errors as transient and reduces the glob fanout limit from 1024 to 256.

Key Changes:

  • Modified ExponentialBackoff::retry signature from Fn to FnMut to allow closures to capture and mutate state
  • Added retry logic with transient error classification to single_url_get and single_url_get_size in lib.rs
  • Wrapped all S3 list operations in retry logic with proper error classification
  • Added TOS-specific throttling error codes (ExceedAccountQPSLimit, ExceedAccountRateLimit, etc.)
  • Reduced DEFAULT_GLOB_FANOUT_LIMIT from 1024 to 256
  • Added basic unit tests for retry behavior

Critical Issue Found:

  • The semaphore acquisition pattern in all three LIST retry blocks has a logic error where connection_pool_sema.acquire() is called outside the async block, preventing proper retry behavior

Confidence Score: 2/5

  • This PR has critical logic errors in semaphore acquisition that prevent proper retry behavior
  • Score reflects three critical logic bugs in the retry implementation where semaphore permits are acquired outside the async block, meaning retries will fail or behave incorrectly. The retry logic itself is sound, and the transient error classification is appropriate, but the semaphore acquisition pattern breaks the retry mechanism in all three LIST operations.
  • src/daft-io/src/s3_like.rs requires immediate attention - all three retry blocks (lines 1259, 1310, 1372) have the same semaphore acquisition bug

Important Files Changed

File Analysis

Filename Score Overview
src/daft-io/src/retry.rs 5/5 Changed retry method from Fn to FnMut to allow closures to mutate state; added basic unit tests for transient and permanent error handling
src/daft-io/src/lib.rs 4/5 Added exponential backoff retry logic to GET and HEAD operations with 45s max wait time; classifies transient errors for retry
src/daft-io/src/s3_like.rs 3/5 Added TOS throttling errors and retry logic to S3 list operations; reduced glob fanout from 1024 to 256; critical issue with semaphore acquisition in retry closures

Sequence Diagram

sequenceDiagram
    participant Client
    participant IOClient
    participant ExponentialBackoff
    participant S3LikeSource
    participant AWS_S3
    
    Client->>IOClient: single_url_get(url)
    IOClient->>ExponentialBackoff: retry(closure)
    
    loop Until success/max_tries/timeout
        ExponentialBackoff->>S3LikeSource: get(path, range)
        S3LikeSource->>AWS_S3: GET request
        
        alt Success
            AWS_S3-->>S3LikeSource: data
            S3LikeSource-->>ExponentialBackoff: Ok(data)
            ExponentialBackoff-->>IOClient: Ok(data)
        else Transient Error (Throttle/Timeout/Socket)
            AWS_S3-->>S3LikeSource: error
            S3LikeSource-->>ExponentialBackoff: RetryError::Transient
            ExponentialBackoff->>ExponentialBackoff: sleep with backoff
        else Permanent Error
            AWS_S3-->>S3LikeSource: error
            S3LikeSource-->>ExponentialBackoff: RetryError::Permanent
            ExponentialBackoff-->>IOClient: Err(error)
        end
    end
    
    IOClient-->>Client: Result
Loading

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@codecov
Copy link

codecov bot commented Oct 16, 2025

Codecov Report

❌ Patch coverage is 83.83838% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.46%. Comparing base (c323ad0) to head (1085d95).
⚠️ Report is 337 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-io/src/s3_like.rs 68.75% 10 Missing ⚠️
src/daft-io/src/lib.rs 81.81% 6 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5393      +/-   ##
==========================================
- Coverage   79.19%   71.46%   -7.73%     
==========================================
  Files         893      990      +97     
  Lines      124095   125496    +1401     
==========================================
- Hits        98279    89690    -8589     
- Misses      25816    35806    +9990     
Files with missing lines Coverage Δ
src/daft-io/src/retry.rs 100.00% <100.00%> (+100.00%) ⬆️
src/daft-io/src/lib.rs 72.76% <81.81%> (-0.61%) ⬇️
src/daft-io/src/s3_like.rs 61.00% <68.75%> (-8.04%) ⬇️

... and 885 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.


let get_result = source
.get(path.as_ref(), range.clone(), io_stats.clone())
let backoff = ExponentialBackoff { max_waittime_ms: Some(45_000), ..Default::default() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we consider using the retry classifier of s3?

@ohbh ohbh marked this pull request as draft October 17, 2025 22:19
@ohbh
Copy link
Contributor

ohbh commented Oct 17, 2025

@Jay-ju let us know when it's ready for review, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants