Skip to content

Conversation

@passaro
Copy link
Contributor

@passaro passaro commented Aug 22, 2025

Set the part size on GetObject requests to the requested range, if it is smaller than the client read part size. Since in this case the request will not be split in parts, the change only affects the way the CRT internally reserves the memory for the data to return.

Addresses an issue where the Prefetcher could cause higher peak memory usage with the unified memory pool compared to the previous approach (internal CRT pool + copy into a new buffer), since small requests would consume part-size buffers for longer.

Does this change impact existing behavior?

No functional changes.

Does this change need a changelog entry? Does it require a version change?

TODO: update client and fs


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).

@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro had a problem deploying to PR integration tests August 22, 2025 10:57 — with GitHub Actions Failure
@passaro passaro had a problem deploying to PR integration tests August 22, 2025 10:57 — with GitHub Actions Failure
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
@passaro passaro had a problem deploying to PR integration tests August 22, 2025 10:57 — with GitHub Actions Failure
@passaro passaro temporarily deployed to PR integration tests August 22, 2025 10:57 — with GitHub Actions Inactive
params
.range
.as_ref()
.map(|range| range.end - range.start)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the first_request_stream will always request the part size min(INITIAL_READ_WINDOW_SIZE, object_size), right?

Few questions:

  • Is it the right understanding that the new logic also makes the splitting of metarequests more sensible, given we always start the second metarequest with range [INITIAL_READ_WINDOW_SIZE, end of object] (and not [PART_SIZE, end of object]) even though the first request has caused CRT to return up to part size?
  • I believe this should also reduce the ttfb for the first read?
  • Do we have this risk of blocked memory in writes too, and do we intend to make any changes there to reduce default part-size etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the first_request_stream will always request the part size min(INITIAL_READ_WINDOW_SIZE, object_size), right?

Right.

Few questions:

  • Is it the right understanding that the new logic also makes the splitting of metarequests more sensible, given we always start the second metarequest with range [INITIAL_READ_WINDOW_SIZE, end of object] (and not [PART_SIZE, end of object]) even though the first request has caused CRT to return up to part size?

That is not changed. We are still requesting INITIAL_READ_WINDOW_SIZE bytes as before, the only difference is that we also tell the CRT to use INITIAL_READ_WINDOW_SIZE as the part size.

  • I believe this should also reduce the ttfb for the first read?

No changes either.

  • Do we have this risk of blocked memory in writes too, and do we intend to make any changes there to reduce default part-size etc.?

That's always been the case, but for writes we do not know the size in advance, so no trivial fix there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not changed. We are still requesting INITIAL_READ_WINDOW_SIZE bytes as before, the only difference is that we also tell the CRT to use INITIAL_READ_WINDOW_SIZE as the part size.

Right! But won't this also change the S3 request range (and hence, size) now that we're requesting a 1.128M part instead of an 8M part? Previously we would receive data upto 8M offset but then re-request the 1.128M-8M range in the second meta request. Now we only receive 1.128M in the first go because it's the part size. So that is a change with this fix, no?

(^My comment about reduced ttfb was also based on this understanding, that smaller part GET => smaller S3 latency)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were already only requesting INITIAL_READ_WINDOW_SIZE (1.125 MiB) before the change. read_part_size only comes into play for the buffer reservation, not for the data that is downloaded.

let mut options = message.into_options(S3Operation::GetObject);
options.part_size(self.inner.read_part_size as u64);

// Use the client read part size, or the requested range if smaller.
Copy link
Contributor

@mansi153 mansi153 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pls add a slightly more elaborate explanation of why we're doing this, for future context? The PR description is sufficient, however that sometimes gets lost when we refactor code for unrelated reasons :(

// Set up a paged memory pool
let pool = PagedPool::new_with_candidate_sizes([
args.cache_block_size_in_bytes() as usize,
mountpoint_s3_fs::s3::config::INITIAL_READ_WINDOW_SIZE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, this implies that for objects with [INITIAL_READ_WINDOW_SIZE < objects_size < read_part_size], we will still be using a peak memory of [INITIAL_READ_WINDOW_SIZE + part_size] right (because we advance the CRT backpressure window halfway through the first metarequest being read)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if 2 * INITIAL_READ_WINDOW_SIZE <= object_size < read_part_size. What matters is the range on the second request, which will be object_size - INITIAL_READ_WINDOW_SIZE. If that is greater than INITIAL_READ_WINDOW_SIZE, it will require a full read_part_size buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, yes, makes sense

@passaro passaro had a problem deploying to PR integration tests August 22, 2025 14:05 — with GitHub Actions Failure
@passaro
Copy link
Contributor Author

passaro commented Sep 23, 2025

Closing this PR. The issue will be addressed on the CRT side (PR under review awslabs/aws-c-s3#563).

@passaro passaro closed this Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants