Eliminate HEAD requests during downloads, especially for faster transfers of small files#363
Eliminate HEAD requests during downloads, especially for faster transfers of small files#363crowecawcaw wants to merge 6 commits into
Conversation
aemous
left a comment
There was a problem hiding this comment.
I left 1 nit. Note that I am not the primary reviewer for this PR, and we are still pending the primary review.
|
Looks like downloads may be failing when the object has a size of 0 bytes. The error I'm seeing is Can you confirm whether you are able to reproduce this, and look into resolving it? InvalidRange occurs when the range we are requesting has no overlap with the object itself. For 0 byte objects, this will always be the case if we are requesting the first 0-x bytes, since there are no bytes at all. The only feasible patch I can imagine here would be a try-except clause that handles InvalidRange by performing a non-ranged GET, which may be undesirable. If you have a better alternative for resolving this that doesn't involve needing an extra request, that would be preferred. |
|
I see the same issue. I added an integ test, fixed the unit test to mirror the same behavior, then fixed the issue. Instead of sending a second Get request for an object we know is empty, it just returns the empty content. |
|
@crowecawcaw Can you push a revision that satisfies the 'Lint code' GH Action? In case you cannot see the Action details, I pasted it here: |
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
b1fe8b8 to
5635ee3
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #363 +/- ##
===========================================
+ Coverage 80.41% 81.14% +0.72%
===========================================
Files 16 16
Lines 3013 3103 +90
===========================================
+ Hits 2423 2518 +95
+ Misses 590 585 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9d25e90 to
65d4cde
Compare
aemous
left a comment
There was a problem hiding this comment.
Loogs good overall. Left a few questions/comments.
afcb032 to
65d4cde
Compare
Address reviewer feedback that the HEAD request is needed by default to enable full-object checksum validation on downloads. Only skip the HEAD when the caller has opted out via botocore's `response_checksum_validation = when_required` config. The prior HEAD + size-based GET logic is restored as the default path. Also: - Add TestDownloadWhenRequiredChecksumValidation covering the HEAD-less path for non-empty and 0-byte objects (InvalidRange handling) - Fix ruff import ordering / formatting in time-batch-download.py flagged by the Lint CI action
| def __call__(self): | ||
| # Always check if we have a task and response first | ||
| assert self._task is not None, ( | ||
| "set_task() must be called before the task is submitted" |
There was a problem hiding this comment.
Can we replace this assertion with raising a RuntimeError with a similar message wrapped? It would be more idiomatic with our codebase.
Replace the assert with a RuntimeError to match the codebase's idiomatic handling of programmer errors, per PR review feedback. Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
This PR optimizes downloads through the
TransferManagerby removing the upfront HEAD request on an opt-in basis. Previously, every download issued a HEAD request to determine object size before starting the GET. This change eliminates that extra round-trip by extracting metadata from the first GET response instead, when the caller has configuredresponse_checksum_validation='when_required'on the client. The default path still uses HEAD so that full-object checksum validation continues to work.For small files, the download time is dominated by request latency, so eliminating one of the two requests results in a ~50% download time reduction. For large files, the effect is less noticeable because the download time is dominated by the transfer itself and there are multiple chunks. In both cases, we save the cost of the HEAD request.
What Changed
• No-HEAD download path added, gated on
response_checksum_validation='when_required': Downloads on that path start immediately with a ranged GET request for the first chunk. The default path is unchanged.• Dynamic size detection (no-HEAD path): Extract object size and ETag from the first GET response headers (
ContentRangeorContentLength).• Dynamic chunk scheduling (no-HEAD path): After the first chunk completes, schedule additional chunks only if the object is larger than the chunk size.
• ETag consistency for all chunks: When an ETag is available (either pre-provided via a subscriber, e.g. from a prior HEAD request by the AWS CLI, or extracted from the first GET response), all subsequent ranged GET requests include an
IfMatchheader with that ETag. This ensures S3 rejects the request if the object changes mid-download. The first GET request also includesIfMatchif an ETag was pre-provided before the download started.• Zero-byte object handling (no-HEAD path): The first ranged GET on a 0-byte object returns
InvalidRange. The new path handles this by returning an empty response instead of retrying with a non-ranged GET.Testing
Unit, functional, and integ tests pass. I also added a new script to benchmark downloading many small files (
scripts/performance/time-batch-download.py). For 1000 × 1 KB files on my laptop, total download duration dropped ~41% (from 15.0s to 8.9s) on the opt-in path.Backward Compatibility
External API unchanged. All download methods have the same signatures. The default path (HEAD + size-branching GET) is preserved, so no behavior change for callers who don't set
response_checksum_validation='when_required'.Flow Diagrams
Default path (unchanged)
flowchart TD A[HEAD Request] --> C{Size < 8MB?} C -->|Yes| D[GET Request] C -->|No| E[Multiple GET Requests] D --> F[Complete] E --> FNo-HEAD path (opt-in via response_checksum_validation="when_required")
flowchart TD A[GET First Chunk] --> B{Size < 8MB?} B -->|Yes| C[Complete] B -->|No| D[GET Remaining Chunks] D --> CBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.