Eliminate HEAD requests during downloads, especially for faster transfers of small files by crowecawcaw · Pull Request #363 · boto/s3transfer

crowecawcaw · 2025-11-21T21:57:54Z

This PR optimizes downloads through the TransferManager by removing the upfront HEAD request on an opt-in basis. Previously, every download issued a HEAD request to determine object size before starting the GET. This change eliminates that extra round-trip by extracting metadata from the first GET response instead, when the caller has configured response_checksum_validation='when_required' on the client. The default path still uses HEAD so that full-object checksum validation continues to work.

For small files, the download time is dominated by request latency, so eliminating one of the two requests results in a ~50% download time reduction. For large files, the effect is less noticeable because the download time is dominated by the transfer itself and there are multiple chunks. In both cases, we save the cost of the HEAD request.

What Changed

• No-HEAD download path added, gated on response_checksum_validation='when_required': Downloads on that path start immediately with a ranged GET request for the first chunk. The default path is unchanged.
• Dynamic size detection (no-HEAD path): Extract object size and ETag from the first GET response headers (ContentRange or ContentLength).
• Dynamic chunk scheduling (no-HEAD path): After the first chunk completes, schedule additional chunks only if the object is larger than the chunk size.
• ETag consistency for all chunks: When an ETag is available (either pre-provided via a subscriber, e.g. from a prior HEAD request by the AWS CLI, or extracted from the first GET response), all subsequent ranged GET requests include an IfMatch header with that ETag. This ensures S3 rejects the request if the object changes mid-download. The first GET request also includes IfMatch if an ETag was pre-provided before the download started.
• Zero-byte object handling (no-HEAD path): The first ranged GET on a 0-byte object returns InvalidRange. The new path handles this by returning an empty response instead of retrying with a non-ranged GET.

Testing

Unit, functional, and integ tests pass. I also added a new script to benchmark downloading many small files (scripts/performance/time-batch-download.py). For 1000 × 1 KB files on my laptop, total download duration dropped ~41% (from 15.0s to 8.9s) on the opt-in path.

Backward Compatibility

External API unchanged. All download methods have the same signatures. The default path (HEAD + size-branching GET) is preserved, so no behavior change for callers who don't set response_checksum_validation='when_required'.

Flow Diagrams

Default path (unchanged)

flowchart TD
    A[HEAD Request] --> C{Size < 8MB?}
    C -->|Yes| D[GET Request]
    C -->|No| E[Multiple GET Requests]
    D --> F[Complete]
    E --> F

No-HEAD path (opt-in via response_checksum_validation="when_required")

flowchart TD
    A[GET First Chunk] --> B{Size < 8MB?}
    B -->|Yes| C[Complete]
    B -->|No| D[GET Remaining Chunks]
    D --> C

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

aemous

I left 1 nit. Note that I am not the primary reviewer for this PR, and we are still pending the primary review.

aemous · 2026-03-06T15:21:43Z

Looks like downloads may be failing when the object has a size of 0 bytes. The error I'm seeing is

botocore.exceptions.ClientError: An error occurred (InvalidRange) when calling the GetObject operation: The requested range is not satisfiable

Can you confirm whether you are able to reproduce this, and look into resolving it?

InvalidRange occurs when the range we are requesting has no overlap with the object itself. For 0 byte objects, this will always be the case if we are requesting the first 0-x bytes, since there are no bytes at all. The only feasible patch I can imagine here would be a try-except clause that handles InvalidRange by performing a non-ranged GET, which may be undesirable. If you have a better alternative for resolving this that doesn't involve needing an extra request, that would be preferred.

crowecawcaw · 2026-03-06T16:19:52Z

I see the same issue. I added an integ test, fixed the unit test to mirror the same behavior, then fixed the issue. Instead of sending a second Get request for an object we know is empty, it just returns the empty content.

aemous · 2026-03-12T20:03:08Z

@crowecawcaw Can you push a revision that satisfies the 'Lint code' GH Action?

In case you cannot see the Action details, I pasted it here:

pre-commit hook(s) made changes.
If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
diff --git a/scripts/performance/time-batch-download.py b/scripts/performance/time-batch-download.py
index 93e8a2a..1cf04fc 100755
--- a/scripts/performance/time-batch-download.py
+++ b/scripts/performance/time-batch-download.py
@@ -2,10 +2,12 @@
 """Direct timing of batch downloads without shell wrapper."""
 
 import argparse
-import tempfile
 import shutil
+import tempfile
 import time
+
 from botocore.session import get_session
+
 from s3transfer.manager import TransferManager
 
 
@@ -20,13 +22,13 @@ def main():
     parser.add_argument('--file-size', type=int, required=True)
     parser.add_argument('--s3-bucket', required=True)
     args = parser.parse_args()
-    
+
     session = get_session()
     client = session.create_client('s3')
-    
+
     tempdir = tempfile.mkdtemp()
     s3_keys = []
-    
+
     try:
         # Upload files
         print(f"Uploading {args.file_count} files...")
@@ -37,7 +39,7 @@ def main():
                 s3_key = f"perf_test_{i}"
                 manager.upload(file_path, args.s3_bucket, s3_key)
                 s3_keys.append(s3_key)
-        
+
         # Download files
         print(f"Downloading {args.file_count} files...")
         start_time = time.time()
@@ -46,9 +48,9 @@ def main():
                 download_path = f"{tempdir}/download_{i}"
                 manager.download(args.s3_bucket, s3_key, download_path)
         duration = time.time() - start_time
-        
+
         print(f"Download duration: {duration:.2f} seconds")
-        
+
         # Cleanup
         for s3_key in s3_keys:
             client.delete_object(Bucket=args.s3_bucket, Key=s3_key)

Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>

codecov-commenter · 2026-04-24T19:02:02Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 98.95833% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 81.14%. Comparing base (9281a7c) to head (62927b7).
⚠️ Report is 14 commits behind head on develop.

Files with missing lines	Patch %	Lines
s3transfer/download.py	98.95%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #363      +/-   ##
===========================================
+ Coverage    80.41%   81.14%   +0.72%     
===========================================
  Files           16       16              
  Lines         3013     3103      +90     
===========================================
+ Hits          2423     2518      +95     
+ Misses         590      585       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

aemous

Loogs good overall. Left a few questions/comments.

Address reviewer feedback that the HEAD request is needed by default to enable full-object checksum validation on downloads. Only skip the HEAD when the caller has opted out via botocore's `response_checksum_validation = when_required` config. The prior HEAD + size-based GET logic is restored as the default path. Also: - Add TestDownloadWhenRequiredChecksumValidation covering the HEAD-less path for non-empty and 0-byte objects (InvalidRange handling) - Fix ruff import ordering / formatting in time-batch-download.py flagged by the Lint CI action

aemous

Looks good overall, 1 small change requested.

aemous · 2026-05-04T13:39:44Z

+    def __call__(self):
+        # Always check if we have a task and response first
+        assert self._task is not None, (
+            "set_task() must be called before the task is submitted"


Can we replace this assertion with raising a RuntimeError with a similar message wrapped? It would be more idiomatic with our codebase.

Replace the assert with a RuntimeError to match the codebase's idiomatic handling of programmer errors, per PR review feedback. Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>

crowecawcaw mentioned this pull request Dec 10, 2025

perf: avoid S3 HEAD requests when downloading job outputs aws-deadline/deadline-cloud#924

Closed

aemous reviewed Mar 2, 2026

View reviewed changes

Comment thread s3transfer/download.py Outdated

aemous reviewed Mar 4, 2026

View reviewed changes

Comment thread s3transfer/download.py

aemous reviewed Mar 4, 2026

View reviewed changes

Comment thread s3transfer/download.py

crowecawcaw added 4 commits April 23, 2026 08:43

Avoid HEAD requests for downloads

d019d12

Add benchmarking script for small files

c482815

address PR notes

eb31f44

Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>

handle 0 range files

9fee33f

Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>

crowecawcaw force-pushed the callback branch 5 times, most recently from b1fe8b8 to 5635ee3 Compare April 24, 2026 04:37

crowecawcaw force-pushed the callback branch 2 times, most recently from 9d25e90 to 65d4cde Compare April 30, 2026 20:54

aemous requested changes Apr 30, 2026

View reviewed changes

Comment thread .changes/next-release/feature-download-85557.json Outdated

Comment thread s3transfer/download.py Outdated

Comment thread scripts/performance/time-batch-download.py

Comment thread tests/unit/test_download.py Outdated

crowecawcaw force-pushed the callback branch 2 times, most recently from afcb032 to 65d4cde Compare May 1, 2026 18:42

crowecawcaw force-pushed the callback branch from 65d4cde to 62927b7 Compare May 1, 2026 22:37

crowecawcaw requested a review from aemous May 1, 2026 22:51

aemous requested changes May 4, 2026

View reviewed changes

Raise RuntimeError when first-chunk callback runs without a task

401cd1b

Replace the assert with a RuntimeError to match the codebase's idiomatic handling of programmer errors, per PR review feedback. Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>

crowecawcaw requested a review from aemous May 11, 2026 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate HEAD requests during downloads, especially for faster transfers of small files#363

Eliminate HEAD requests during downloads, especially for faster transfers of small files#363
crowecawcaw wants to merge 6 commits into
boto:developfrom
crowecawcaw:callback

crowecawcaw commented Nov 21, 2025 •

edited

Loading

Uh oh!

aemous left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aemous commented Mar 6, 2026 •

edited

Loading

Uh oh!

crowecawcaw commented Mar 6, 2026

Uh oh!

aemous commented Mar 12, 2026

Uh oh!

codecov-commenter commented Apr 24, 2026 •

edited

Loading

Uh oh!

aemous left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aemous left a comment •

edited

Loading

Uh oh!

aemous May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

crowecawcaw commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Changed

Testing

Backward Compatibility

Flow Diagrams

Default path (unchanged)

No-HEAD path (opt-in via response_checksum_validation="when_required")

Uh oh!

aemous left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aemous commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crowecawcaw commented Mar 6, 2026

Uh oh!

aemous commented Mar 12, 2026

Uh oh!

codecov-commenter commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aemous left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aemous left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aemous May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

crowecawcaw commented Nov 21, 2025 •

edited

Loading

aemous commented Mar 6, 2026 •

edited

Loading

codecov-commenter commented Apr 24, 2026 •

edited

Loading

aemous left a comment •

edited

Loading