Skip to content

feat(csharp/src/Drivers/Apache): Add prefetch functionality to CloudFetch in Spark ADBC driver #2678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 24, 2025

Conversation

jadewang-db
Copy link
Contributor

@jadewang-db jadewang-db commented Apr 7, 2025

Add Prefetch Functionality to CloudFetch in Spark ADBC Driver

This PR enhances the CloudFetch feature in the Spark ADBC driver by implementing prefetch functionality, which improves performance by fetching multiple batches of results ahead of time.

Changes

CloudFetchResultFetcher Enhancements

  • Initial Prefetch: Added code to perform an initial prefetch of multiple batches when the fetcher starts, ensuring data is available immediately when needed.
  • State Management: Added tracking for current batch offset and size, with proper state reset when starting the fetcher.

Interface Updates

  • Added new methods to ICloudFetchResultFetcher interface:

Testing Infrastructure

  • Created ITestableHiveServer2Statement interface to facilitate testing
  • Updated tests to account for prefetch behavior
  • Ensured all tests pass with the new prefetch functionality

Benefits

  • Improved Performance: By prefetching multiple batches, data is available sooner, reducing wait times.
  • Better Reliability: Enhanced error handling and state management make the system more robust.
  • More Efficient Resource Usage: Link caching reduces unnecessary server requests.

This implementation maintains backward compatibility while providing significant performance improvements for CloudFetch operations.

@github-actions github-actions bot added this to the ADBC Libraries 18 milestone Apr 7, 2025
@jadewang-db jadewang-db force-pushed the cloudfetch-pipeline branch 2 times, most recently from 01daf70 to a388213 Compare April 14, 2025 19:49
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm still reviewing this logic but thought I'd give some initial feedback.

Also, please take a look at the linter output and make changes accordingly.

Update DatabricksParameters.cs

address comments

fix linter

rebase to master

refactor to fix unit test

refactor

some code refactoring

refactor

Delete CloudFetchDownloadManagerTest.cs

Initital changes
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks great!

@CurtHagenlocher CurtHagenlocher merged commit 7f3d33b into apache:main Apr 24, 2025
6 checks passed
colin-rogers-dbt pushed a commit to dbt-labs/arrow-adbc that referenced this pull request Jun 10, 2025
…etch in Spark ADBC driver (apache#2678)

# Add Prefetch Functionality to CloudFetch in Spark ADBC Driver

This PR enhances the CloudFetch feature in the Spark ADBC driver by
implementing prefetch functionality, which improves performance by
fetching multiple batches of results ahead of time.

## Changes

### CloudFetchResultFetcher Enhancements

- **Initial Prefetch**: Added code to perform an initial prefetch of
multiple batches when the fetcher starts, ensuring data is available
immediately when needed.
- **State Management**: Added tracking for current batch offset and
size, with proper state reset when starting the fetcher.


### Interface Updates

- Added new methods to `ICloudFetchResultFetcher` interface:


### Testing Infrastructure

- Created `ITestableHiveServer2Statement` interface to facilitate
testing
- Updated tests to account for prefetch behavior
- Ensured all tests pass with the new prefetch functionality

## Benefits

- **Improved Performance**: By prefetching multiple batches, data is
available sooner, reducing wait times.
- **Better Reliability**: Enhanced error handling and state management
make the system more robust.
- **More Efficient Resource Usage**: Link caching reduces unnecessary
server requests.

This implementation maintains backward compatibility while providing
significant performance improvements for CloudFetch operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants