aws-s3-batch-uploader 🚀

A robust, high-performance Python tool for uploading massive amounts of files to Amazon S3 with confidence.

The Problem This Solves

Let's face it – uploading hundreds of thousands of files to S3 is a special kind of headache. You start a simple script, go grab coffee, return to find it crashed halfway through, and now you're playing the "which files actually made it?" detective game. Not anymore!

This tool was born from the battle scars of managing 50k+ directories containing 400k+ images. It handles the complexities so you don't have to.

✨ Key Features

Parallel Processing: Concurrent uploads to maximize throughput
Robust Error Handling: Automatic retries with exponential backoff
Structure Preservation: Maintains your folder hierarchy in S3
Comprehensive Logging: Detailed logs of everything that happens
Performance Metrics: Track speed, progress, and completion statistics
Selective Uploads: Upload only specific files from a provided list
Dry Run Mode: Test your setup without actually uploading anything

🔧 Installation

# Clone the repository
git clone https://github.com/poacosta/aws-s3-batch-uploader.git
cd aws-s3-batch-uploader

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Dependencies

boto3
tqdm

⚙️ AWS Configuration for Optimal Performance

Properly configuring your AWS CLI can significantly improve S3 transfer speeds. Here's how to optimize it:

Run the standard configuration

aws configure

Then

# Set the region (basic configuration)
aws configure set region us-east-1

# Configure S3 performance parameters
aws configure set s3.max_concurrent_requests 100
aws configure set s3.max_queue_size 10000
aws configure set s3.multipart_threshold 64MB
aws configure set s3.multipart_chunksize 16MB
aws configure set s3.max_bandwidth 50MB/s
aws configure set s3.use_accelerate_endpoint true
aws configure set s3.addressing_style virtual

Performance Parameters Explained

Parameter	Description	Recommendation
`max_concurrent_requests`	Maximum concurrent API requests	100 for high-bandwidth connections
`max_queue_size`	Maximum number of tasks in queue	10000 for large batch operations
`multipart_threshold`	Size threshold for multipart uploads	64MB balances performance and reliability
`multipart_chunksize`	Size of each multipart chunk	16MB works well for most connections
`max_bandwidth`	Bandwidth limit	Set based on your network capacity
`use_accelerate_endpoint`	Use S3 Transfer Acceleration	Enable if bucket has it configured
`addressing_style`	S3 endpoint addressing style	Virtual offers better compatibility

🚀 Quick Start

Ensure your AWS credentials are configured (via environment variables, ~/.aws/credentials, or IAM role)
Create a text file listing paths to upload (one path per line)
Run the uploader:

python s3_uploader.py paths_to_upload.txt my-bucket-name

That's it! The tool will handle the rest, including logging, progress display, and error handling.

📋 Usage Examples

Basic Upload

python s3_uploader.py file_list.txt my-bucket

Preserve Directory Structure

python s3_uploader.py file_list.txt my-bucket --base-path /data/images

Upload to a Specific Folder in S3

python s3_uploader.py file_list.txt my-bucket --destination-prefix customer_uploads/batch1

Maximum Performance

python s3_uploader.py file_list.txt my-bucket --max-concurrency 20

Thorough Testing (Dry Run)

python s3_uploader.py file_list.txt my-bucket --dry-run --verbose

CSV File with Headers

python s3_uploader.py image_paths.csv my-bucket --skip-header --path-column 2

🎮 Command-Line Options

Argument	Description	Default
`file_list`	Path to CSV or TXT file with files to upload	(Required)
`bucket`	S3 bucket name	(Required)
`--base-path`	Base path for calculating relative paths	None
`--destination-prefix`	Prefix for all S3 object keys	None
`--profile`	AWS profile name to use	Default profile
`--region`	AWS region	us-east-1
`--storage-class`	S3 storage class	STANDARD
`--path-column`	Column index in CSV with file paths	0
`--skip-header`	Skip CSV file header	False
`--max-concurrency`	Maximum concurrent uploads	10
`--retry-count`	Number of retry attempts for failures	3
`--dry-run`	Simulate without actual uploads	False
`--verbose`, `-v`	Increase verbosity (can be used multiple times)	0
`--no-progress`	Disable progress bar	False

💡 Pro Tips & Best Practices

After uploading more files than I care to count, here are some hard-won insights:

Start Small: Test with a small subset before committing to a 400k file upload
Monitor Bandwidth: If you're running this in production, be mindful of your network capacity
Appropriate Concurrency:
- Low-powered machine: --max-concurrency 5
- Average laptop/desktop: --max-concurrency 10
- Beefy server with great connection: --max-concurrency 20
Storage Class Selection: Use --storage-class STANDARD_IA for less frequently accessed assets to save costs
Failure Analysis: After completion, check logs and failure reports to identify patterns in any failed uploads
Resume Strategy: If a large upload job fails, you can:
- Fix the failure report CSV (remove successfully uploaded files)
- Use it as the new input file for a subsequent run

🛠️ Troubleshooting

Problem	Possible Solution
Slow uploads	Reduce concurrency, check network bandwidth
Memory issues	Reduce concurrency to lower memory usage
AWS credential errors	Verify credentials with `aws s3 ls`
Files not found	Verify paths are accessible from script location
Permission denied	Check file system permissions on source files

📊 Performance Analysis

Based on extensive testing, here's what you can expect:

Network Speed	Files/second*	MB/second*
100 Mbps	~5-10	~10
500 Mbps	~15-30	~50
1 Gbps	~25-50	~100

*Depends on file size, concurrency settings, and AWS region proximity

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

The boto3 team for their excellent AWS SDK
tqdm for the lovely progress bars
Everyone who's ever had to upload thousands of files to S3 and felt the pain

"I spent so much time uploading files that I automated myself out of a job. Worth it." - The Developer

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
requirements.txt		requirements.txt
s3_uploader.py		s3_uploader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

aws-s3-batch-uploader 🚀

The Problem This Solves

✨ Key Features

🔧 Installation

Dependencies

⚙️ AWS Configuration for Optimal Performance

🚀 Quick Start

📋 Usage Examples

Basic Upload

Preserve Directory Structure

Upload to a Specific Folder in S3

Maximum Performance

Thorough Testing (Dry Run)

CSV File with Headers

🎮 Command-Line Options

💡 Pro Tips & Best Practices

🛠️ Troubleshooting

📊 Performance Analysis

🤝 Contributing

📜 License

🙏 Acknowledgements

About

Uh oh!

Uh oh!

Languages

License

poacosta/aws-s3-batch-uploader

Folders and files

Latest commit

History

Repository files navigation

aws-s3-batch-uploader 🚀

The Problem This Solves

✨ Key Features

🔧 Installation

Dependencies

⚙️ AWS Configuration for Optimal Performance

🚀 Quick Start

📋 Usage Examples

Basic Upload

Preserve Directory Structure

Upload to a Specific Folder in S3

Maximum Performance

Thorough Testing (Dry Run)

CSV File with Headers

🎮 Command-Line Options

💡 Pro Tips & Best Practices

🛠️ Troubleshooting

📊 Performance Analysis

🤝 Contributing

📜 License

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages