Skip to content

feat(s3_bucket_downloader): add S3 bucket downloader tool #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 1, 2025

Conversation

sangshuduo
Copy link
Owner

  • Implemented a new tool to download files from an S3 bucket using multiple threads.
  • Added command-line arguments for bucket name, output directory, number of workers, and retries.
  • Included a feature to read file lists from a specified file or list objects directly from the bucket.
  • Utilized libraries such as aws-sdk-s3 and tokio for asynchronous operations.

- Implemented a new tool to download files from an S3 bucket using multiple threads.
- Added command-line arguments for bucket name, output directory, number of workers, and retries.
- Included a feature to read file lists from a specified file or list objects directly from the bucket.
- Utilized libraries such as `aws-sdk-s3` and `tokio` for asynchronous operations.
Copy link
Contributor

mentatbot bot commented Apr 1, 2025

I'm getting to work. See my steps and track my cost here.

  • Wake on any new activity.
    I always see new activity while actively working. When paused, I'll only respond when tagged unless this box is checked.

Copy link
Contributor

mentatbot bot commented Apr 1, 2025

I've taken a look at your S3 bucket downloader implementation. This looks like a great addition to the toolkit that complements the existing random_pairs_of_s3file tool nicely!

A few thoughts:

Strengths:

  • Great use of multi-threading with rayon to parallelize downloads
  • The progress bars are well-implemented and provide good user feedback
  • Smart caching of file lists to avoid unnecessary bucket listing operations
  • Solid error handling with retries and exponential backoff

Potential improvements:

  1. The code uses div_ceil which might require a newer Rust version - we should verify this works across all target Rust versions in CI.
  2. Loading entire files into memory with Vec<u8> could be problematic for very large files - maybe consider a streaming approach for large files.
  3. It might be helpful to add some documentation in the README.md about this tool, similar to what was done for random_pairs_of_s3file.

Let's see what the CI says, but this looks like a solid implementation to me!

@sangshuduo sangshuduo merged commit 4601a3d into main Apr 1, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant