- 
                Notifications
    
You must be signed in to change notification settings  - Fork 981
 
Add option to reset memory resource in PDSH benchmarks #20432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This adds a CLI flag to reset the memory pool between iterations in the PDSH benchmark. The default is to reset the MR, which is a change in behavior from previous behavior. We really only expect this to have an effect for the default memory resource, which uses UVM. We've observed some instability in memory usage, which this flag might help address.
| 
           Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here.  | 
    
e7273cf    to
    5eb659a      
    Compare
  
    | 
           Here are some numbers with  
 Overall, most are a bit slower with   | 
    
| 
           @TomAugspurger This is expected behavior. cudf/python/cudf_polars/cudf_polars/callback.py Lines 78 to 83 in 5f8ee01 
 This memory resource is a suballocating pool. There is a large underlying block of managed memory that is never freed until the memory resource is destroyed (by resetting, here). Suballocations are created and freed from that pool during its lifetime. I think merging this is a question of whether you think it is more fair as a benchmark to have "untouched" driver memory by resetting, or if it's okay to have pool of memory pre-warmed where the driver knows it is on device, etc. Personally I think it is okay to have a pre-allocated block of managed memory for the suballocating pool because any real-world query scenario is run with batches and not single queries. When resetting, the driver has to be told again "this managed memory is intended to be device memory" and it has to do some work for its internal page tracking. Prefetching is much simpler when the memory already exists on device. This behavior could change completely if you use the experimental async managed memory resource. The driver has better knowledge of its allocations with that MR, which is better than RMM's pool adaptor having knowledge that the driver lacks. We need to get numbers for that -- I expect we will want to replace our default MR with that new feature from CUDA 13 on systems with CUDA 13+.  | 
    
Description
This adds a CLI flag to reset the memory pool between iterations in the PDSH benchmark.
We really only expect this to have an effect for the default memory resource, which uses UVM. We've observed some instability in memory usage, which this flag might help address.
The main decision point here is whether to reset the memory resource by default or not. IMO, we do want to reset it to improve the stability of the benchmark between iterations. But this would be a change from previous behavior so I've left the default as not reseting the memory resource.
Here's a screenshot of an nsys profile for two iterations of query 2:
On the left is
--reset-memory-resource. You can see "Managed memory usage" drop to zero after the first iteration as the memory resource is dropped and its memory freed.On the right is
--no-reset-memory-resource. The reference to the MR is cached, and so memory is never freed from the pool.