Skip to content

Conversation

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Oct 30, 2025

Description

This adds a CLI flag to reset the memory pool between iterations in the PDSH benchmark.

We really only expect this to have an effect for the default memory resource, which uses UVM. We've observed some instability in memory usage, which this flag might help address.

The main decision point here is whether to reset the memory resource by default or not. IMO, we do want to reset it to improve the stability of the benchmark between iterations. But this would be a change from previous behavior so I've left the default as not reseting the memory resource.

Here's a screenshot of an nsys profile for two iterations of query 2:

image

On the left is --reset-memory-resource. You can see "Managed memory usage" drop to zero after the first iteration as the memory resource is dropped and its memory freed.

On the right is --no-reset-memory-resource. The reference to the MR is cached, and so memory is never freed from the pool.

This adds a CLI flag to reset the memory pool between iterations in the
PDSH benchmark. The default is to reset the MR, which is a change in
behavior from previous behavior.

We really only expect this to have an effect for the default memory
resource, which uses UVM. We've observed some instability in memory
usage, which this flag might help address.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 30, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Oct 30, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Oct 30, 2025
@TomAugspurger TomAugspurger added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Oct 30, 2025
@TomAugspurger
Copy link
Contributor Author

Here are some numbers with --reset-memory-resource and --no-reset-memory-resource. The median of 3 runs:

Query MEDIAN of reset MEDIAN of no-reset
1 10.4514 9.2945
2 1.1276 1.0132
3 10.8834 10.5566
4 23.5367 23.3531
5 13.4738 12.8356
6 4.8805 4.6928
7 17.3064 13.8614
8 15.5725 14.8226
9 30.4728 30.6032
10 14.978 14.942
11 1.0839 1.0092
12 9.4655 9.6911
13 34.0298 33.8898
14 8.0887 8.5332
15 6.0896 6.5001
16 2.294 2.2463
17 43.2848 43.5608
18 96.909 90.3726
19 10.2506 10.6027
20 9.0296 8.5378
21 89.212 79.8668
22 1.2345 1.2196
Grand Total 10.98245 10.61455

Overall, most are a bit slower with --reset-memory-resource.

@TomAugspurger TomAugspurger marked this pull request as ready for review November 3, 2025 12:34
@TomAugspurger TomAugspurger requested a review from a team as a code owner November 3, 2025 12:34
@TomAugspurger TomAugspurger changed the title Add option to reset memory resource in PDSH Add option to reset memory resource in PDSH benchmarks Nov 3, 2025
@bdice
Copy link
Contributor

bdice commented Nov 3, 2025

@TomAugspurger This is expected behavior.

mr = rmm.mr.PrefetchResourceAdaptor(
rmm.mr.PoolMemoryResource(
rmm.mr.ManagedMemoryResource(),
initial_pool_size=free_memory,
)
)

This memory resource is a suballocating pool. There is a large underlying block of managed memory that is never freed until the memory resource is destroyed (by resetting, here). Suballocations are created and freed from that pool during its lifetime. I think merging this is a question of whether you think it is more fair as a benchmark to have "untouched" driver memory by resetting, or if it's okay to have pool of memory pre-warmed where the driver knows it is on device, etc. Personally I think it is okay to have a pre-allocated block of managed memory for the suballocating pool because any real-world query scenario is run with batches and not single queries. When resetting, the driver has to be told again "this managed memory is intended to be device memory" and it has to do some work for its internal page tracking. Prefetching is much simpler when the memory already exists on device.

This behavior could change completely if you use the experimental async managed memory resource. The driver has better knowledge of its allocations with that MR, which is better than RMM's pool adaptor having knowledge that the driver lacks. We need to get numbers for that -- I expect we will want to replace our default MR with that new feature from CUDA 13 on systems with CUDA 13+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants