-
Notifications
You must be signed in to change notification settings - Fork 226
Adapt GetObject part size to the requested range #1584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Alessandro Passaro <[email protected]>
Signed-off-by: Alessandro Passaro <[email protected]>
Signed-off-by: Alessandro Passaro <[email protected]>
| params | ||
| .range | ||
| .as_ref() | ||
| .map(|range| range.end - range.start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means the first_request_stream will always request the part size min(INITIAL_READ_WINDOW_SIZE, object_size), right?
Few questions:
- Is it the right understanding that the new logic also makes the splitting of metarequests more sensible, given we always start the second metarequest with range [INITIAL_READ_WINDOW_SIZE, end of object] (and not [PART_SIZE, end of object]) even though the first request has caused CRT to return up to part size?
- I believe this should also reduce the ttfb for the first read?
- Do we have this risk of blocked memory in writes too, and do we intend to make any changes there to reduce default part-size etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means the
first_request_streamwill always request the part sizemin(INITIAL_READ_WINDOW_SIZE, object_size), right?
Right.
Few questions:
- Is it the right understanding that the new logic also makes the splitting of metarequests more sensible, given we always start the second metarequest with range [INITIAL_READ_WINDOW_SIZE, end of object] (and not [PART_SIZE, end of object]) even though the first request has caused CRT to return up to part size?
That is not changed. We are still requesting INITIAL_READ_WINDOW_SIZE bytes as before, the only difference is that we also tell the CRT to use INITIAL_READ_WINDOW_SIZE as the part size.
- I believe this should also reduce the ttfb for the first read?
No changes either.
- Do we have this risk of blocked memory in writes too, and do we intend to make any changes there to reduce default part-size etc.?
That's always been the case, but for writes we do not know the size in advance, so no trivial fix there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not changed. We are still requesting INITIAL_READ_WINDOW_SIZE bytes as before, the only difference is that we also tell the CRT to use INITIAL_READ_WINDOW_SIZE as the part size.
Right! But won't this also change the S3 request range (and hence, size) now that we're requesting a 1.128M part instead of an 8M part? Previously we would receive data upto 8M offset but then re-request the 1.128M-8M range in the second meta request. Now we only receive 1.128M in the first go because it's the part size. So that is a change with this fix, no?
(^My comment about reduced ttfb was also based on this understanding, that smaller part GET => smaller S3 latency)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were already only requesting INITIAL_READ_WINDOW_SIZE (1.125 MiB) before the change. read_part_size only comes into play for the buffer reservation, not for the data that is downloaded.
| let mut options = message.into_options(S3Operation::GetObject); | ||
| options.part_size(self.inner.read_part_size as u64); | ||
|
|
||
| // Use the client read part size, or the requested range if smaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pls add a slightly more elaborate explanation of why we're doing this, for future context? The PR description is sufficient, however that sometimes gets lost when we refactor code for unrelated reasons :(
| // Set up a paged memory pool | ||
| let pool = PagedPool::new_with_candidate_sizes([ | ||
| args.cache_block_size_in_bytes() as usize, | ||
| mountpoint_s3_fs::s3::config::INITIAL_READ_WINDOW_SIZE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, this implies that for objects with [INITIAL_READ_WINDOW_SIZE < objects_size < read_part_size], we will still be using a peak memory of [INITIAL_READ_WINDOW_SIZE + part_size] right (because we advance the CRT backpressure window halfway through the first metarequest being read)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if 2 * INITIAL_READ_WINDOW_SIZE <= object_size < read_part_size. What matters is the range on the second request, which will be object_size - INITIAL_READ_WINDOW_SIZE. If that is greater than INITIAL_READ_WINDOW_SIZE, it will require a full read_part_size buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, yes, makes sense
|
Closing this PR. The issue will be addressed on the CRT side (PR under review awslabs/aws-c-s3#563). |
Set the part size on GetObject requests to the requested range, if it is smaller than the client read part size. Since in this case the request will not be split in parts, the change only affects the way the CRT internally reserves the memory for the data to return.
Addresses an issue where the Prefetcher could cause higher peak memory usage with the unified memory pool compared to the previous approach (internal CRT pool + copy into a new buffer), since small requests would consume part-size buffers for longer.
Does this change impact existing behavior?
No functional changes.
Does this change need a changelog entry? Does it require a version change?
TODO: update client and fs
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).