Add option to reset memory resource in PDSH benchmarks #20432

TomAugspurger · 2025-10-30T17:22:36Z

Description

This adds a CLI flag to reset the memory pool between iterations in the PDSH benchmark.

We really only expect this to have an effect for the default memory resource, which uses UVM. We've observed some instability in memory usage, which this flag might help address.

The main decision point here is whether to reset the memory resource by default or not. IMO, we do want to reset it to improve the stability of the benchmark between iterations. But this would be a change from previous behavior so I've left the default as not reseting the memory resource.

Here's a screenshot of an nsys profile for two iterations of query 2:

On the left is --reset-memory-resource. You can see "Managed memory usage" drop to zero after the first iteration as the memory resource is dropped and its memory freed.

On the right is --no-reset-memory-resource. The reference to the MR is cached, and so memory is never freed from the pool.

This adds a CLI flag to reset the memory pool between iterations in the PDSH benchmark. The default is to reset the MR, which is a change in behavior from previous behavior. We really only expect this to have an effect for the default memory resource, which uses UVM. We've observed some instability in memory usage, which this flag might help address.

copy-pr-bot · 2025-10-30T17:22:39Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

TomAugspurger · 2025-10-31T20:09:58Z

Here are some numbers with --reset-memory-resource and --no-reset-memory-resource. The median of 3 runs:

Query	MEDIAN of reset	MEDIAN of no-reset
1	10.4514	9.2945
2	1.1276	1.0132
3	10.8834	10.5566
4	23.5367	23.3531
5	13.4738	12.8356
6	4.8805	4.6928
7	17.3064	13.8614
8	15.5725	14.8226
9	30.4728	30.6032
10	14.978	14.942
11	1.0839	1.0092
12	9.4655	9.6911
13	34.0298	33.8898
14	8.0887	8.5332
15	6.0896	6.5001
16	2.294	2.2463
17	43.2848	43.5608
18	96.909	90.3726
19	10.2506	10.6027
20	9.0296	8.5378
21	89.212	79.8668
22	1.2345	1.2196
Grand Total	10.98245	10.61455

Overall, most are a bit slower with --reset-memory-resource.

bdice · 2025-11-03T15:02:39Z

@TomAugspurger This is expected behavior.

cudf/python/cudf_polars/cudf_polars/callback.py

Lines 78 to 83 in 5f8ee01

    
           mr = rmm.mr.PrefetchResourceAdaptor( 
        
               rmm.mr.PoolMemoryResource( 
        
                   rmm.mr.ManagedMemoryResource(), 
        
                   initial_pool_size=free_memory, 
        
               ) 
        
           )

This memory resource is a suballocating pool. There is a large underlying block of managed memory that is never freed until the memory resource is destroyed (by resetting, here). Suballocations are created and freed from that pool during its lifetime. I think merging this is a question of whether you think it is more fair as a benchmark to have "untouched" driver memory by resetting, or if it's okay to have pool of memory pre-warmed where the driver knows it is on device, etc. Personally I think it is okay to have a pre-allocated block of managed memory for the suballocating pool because any real-world query scenario is run with batches and not single queries. When resetting, the driver has to be told again "this managed memory is intended to be device memory" and it has to do some work for its internal page tracking. Prefetching is much simpler when the memory already exists on device.

This behavior could change completely if you use the experimental async managed memory resource. The driver has better knowledge of its allocations with that MR, which is better than RMM's pool adaptor having knowledge that the driver lacks. We need to get numbers for that -- I expect we will want to replace our default MR with that new feature from CUDA 13 on systems with CUDA 13+.

github-actions bot assigned TomAugspurger Oct 30, 2025

github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Oct 30, 2025

github-project-automation bot added this to cuDF Python Oct 30, 2025

GPUtester moved this to In Progress in cuDF Python Oct 30, 2025

TomAugspurger force-pushed the tom/pdsh-reset-mr branch from e7273cf to 5eb659a Compare October 30, 2025 17:44

TomAugspurger added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Oct 30, 2025

TomAugspurger added 3 commits October 31, 2025 07:03

Merge remote-tracking branch 'upstream/main' into tom/pdsh-reset-mr

d0550fc

fixup

e3e44b8

Add single-cluster note

bb52f7f

Merge remote-tracking branch 'upstream/main' into tom/pdsh-reset-mr

8d2d232

TomAugspurger marked this pull request as ready for review November 3, 2025 12:34

TomAugspurger requested a review from a team as a code owner November 3, 2025 12:34

TomAugspurger requested review from bdice and galipremsagar November 3, 2025 12:34

TomAugspurger changed the title ~~Add option to reset memory resource in PDSH~~ Add option to reset memory resource in PDSH benchmarks Nov 3, 2025

fixup

c549c0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to reset memory resource in PDSH benchmarks #20432

Add option to reset memory resource in PDSH benchmarks #20432

Uh oh!

TomAugspurger commented Oct 30, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 30, 2025

Uh oh!

TomAugspurger commented Oct 31, 2025

Uh oh!

bdice commented Nov 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add option to reset memory resource in PDSH benchmarks #20432

Are you sure you want to change the base?

Add option to reset memory resource in PDSH benchmarks #20432

Uh oh!

Conversation

TomAugspurger commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

copy-pr-bot bot commented Oct 30, 2025

Uh oh!

TomAugspurger commented Oct 31, 2025

Uh oh!

bdice commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TomAugspurger commented Oct 30, 2025 •

edited

Loading

bdice commented Nov 3, 2025 •

edited

Loading