A robust, high-performance Python tool for uploading massive amounts of files to Amazon S3 with confidence.
Let's face it – uploading hundreds of thousands of files to S3 is a special kind of headache. You start a simple script, go grab coffee, return to find it crashed halfway through, and now you're playing the "which files actually made it?" detective game. Not anymore!
This tool was born from the battle scars of managing 50k+ directories containing 400k+ images. It handles the complexities so you don't have to.
- Parallel Processing: Concurrent uploads to maximize throughput
- Robust Error Handling: Automatic retries with exponential backoff
- Structure Preservation: Maintains your folder hierarchy in S3
- Comprehensive Logging: Detailed logs of everything that happens
- Performance Metrics: Track speed, progress, and completion statistics
- Selective Uploads: Upload only specific files from a provided list
- Dry Run Mode: Test your setup without actually uploading anything
# Clone the repository
git clone https://github.com/poacosta/aws-s3-batch-uploader.git
cd aws-s3-batch-uploader
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
- boto3
- tqdm
Properly configuring your AWS CLI can significantly improve S3 transfer speeds. Here's how to optimize it:
Run the standard configuration
aws configure
Then
# Set the region (basic configuration)
aws configure set region us-east-1
# Configure S3 performance parameters
aws configure set s3.max_concurrent_requests 100
aws configure set s3.max_queue_size 10000
aws configure set s3.multipart_threshold 64MB
aws configure set s3.multipart_chunksize 16MB
aws configure set s3.max_bandwidth 50MB/s
aws configure set s3.use_accelerate_endpoint true
aws configure set s3.addressing_style virtual
- Performance Parameters Explained
Parameter | Description | Recommendation |
---|---|---|
max_concurrent_requests |
Maximum concurrent API requests | 100 for high-bandwidth connections |
max_queue_size |
Maximum number of tasks in queue | 10000 for large batch operations |
multipart_threshold |
Size threshold for multipart uploads | 64MB balances performance and reliability |
multipart_chunksize |
Size of each multipart chunk | 16MB works well for most connections |
max_bandwidth |
Bandwidth limit | Set based on your network capacity |
use_accelerate_endpoint |
Use S3 Transfer Acceleration | Enable if bucket has it configured |
addressing_style |
S3 endpoint addressing style | Virtual offers better compatibility |
- Ensure your AWS credentials are configured (via environment variables, ~/.aws/credentials, or IAM role)
- Create a text file listing paths to upload (one path per line)
- Run the uploader:
python s3_uploader.py paths_to_upload.txt my-bucket-name
That's it! The tool will handle the rest, including logging, progress display, and error handling.
python s3_uploader.py file_list.txt my-bucket
python s3_uploader.py file_list.txt my-bucket --base-path /data/images
python s3_uploader.py file_list.txt my-bucket --destination-prefix customer_uploads/batch1
python s3_uploader.py file_list.txt my-bucket --max-concurrency 20
python s3_uploader.py file_list.txt my-bucket --dry-run --verbose
python s3_uploader.py image_paths.csv my-bucket --skip-header --path-column 2
Argument | Description | Default |
---|---|---|
file_list |
Path to CSV or TXT file with files to upload | (Required) |
bucket |
S3 bucket name | (Required) |
--base-path |
Base path for calculating relative paths | None |
--destination-prefix |
Prefix for all S3 object keys | None |
--profile |
AWS profile name to use | Default profile |
--region |
AWS region | us-east-1 |
--storage-class |
S3 storage class | STANDARD |
--path-column |
Column index in CSV with file paths | 0 |
--skip-header |
Skip CSV file header | False |
--max-concurrency |
Maximum concurrent uploads | 10 |
--retry-count |
Number of retry attempts for failures | 3 |
--dry-run |
Simulate without actual uploads | False |
--verbose , -v |
Increase verbosity (can be used multiple times) | 0 |
--no-progress |
Disable progress bar | False |
After uploading more files than I care to count, here are some hard-won insights:
-
Start Small: Test with a small subset before committing to a 400k file upload
-
Monitor Bandwidth: If you're running this in production, be mindful of your network capacity
-
Appropriate Concurrency:
- Low-powered machine:
--max-concurrency 5
- Average laptop/desktop:
--max-concurrency 10
- Beefy server with great connection:
--max-concurrency 20
- Low-powered machine:
-
Storage Class Selection: Use
--storage-class STANDARD_IA
for less frequently accessed assets to save costs -
Failure Analysis: After completion, check logs and failure reports to identify patterns in any failed uploads
-
Resume Strategy: If a large upload job fails, you can:
- Fix the failure report CSV (remove successfully uploaded files)
- Use it as the new input file for a subsequent run
Problem | Possible Solution |
---|---|
Slow uploads | Reduce concurrency, check network bandwidth |
Memory issues | Reduce concurrency to lower memory usage |
AWS credential errors | Verify credentials with aws s3 ls |
Files not found | Verify paths are accessible from script location |
Permission denied | Check file system permissions on source files |
Based on extensive testing, here's what you can expect:
Network Speed | Files/second* | MB/second* |
---|---|---|
100 Mbps | ~5-10 | ~10 |
500 Mbps | ~15-30 | ~50 |
1 Gbps | ~25-50 | ~100 |
*Depends on file size, concurrency settings, and AWS region proximity
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- The boto3 team for their excellent AWS SDK
- tqdm for the lovely progress bars
- Everyone who's ever had to upload thousands of files to S3 and felt the pain
"I spent so much time uploading files that I automated myself out of a job. Worth it." - The Developer