feat: force unregister shared pin memory buffer supported#62
feat: force unregister shared pin memory buffer supported#62blahgeek merged 2 commits intoMoonshotAI:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for force unregistration of shared pin memory buffer in the checkpoint engine. Previously, the engine could only mark the shared memory pool as unused but couldn't physically release it. The new force=True parameter in unregister_checkpoint() enables complete deallocation of the shared memory pool.
- Added
forceparameter tounregister_checkpoint()method to enable physical release of shared memory pool - Updated p2p_store registration/unregistration to use consistent naming for shared memory pool
- Enhanced test coverage with scenario testing force unregistration and subsequent re-registration
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| checkpoint_engine/ps.py | Added force parameter to unregister_checkpoint method and implemented logic to physically delete and reset shared memory pool when force=True; updated p2p_store naming consistency for shared pool |
| tests/test_pin_memory.py | Added test case for force unregistration of shared pool and re-registration with different checkpoint sizes |
Comments suppressed due to low confidence (1)
checkpoint_engine/ps.py:923
- The docstring for unregister_checkpoint should be updated to document the new 'force' parameter. The documentation should explain that when force=True, it physically releases the shared memory pool instead of just marking it as unused. This is important for API consumers to understand the difference between normal and force unregistration.
def unregister_checkpoint(self, checkpoint_name: str, force: bool = False) -> None:
"""
Unregister a checkpoint from the parameter server. This function will also unregister the checkpoint
from p2p store if p2p store is initialized.
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ps.register_checkpoint( | ||
| "test_checkpoint_shared3", named_tensors=checkpoint_shared3, use_shared_memory_pool=True | ||
| ) |
There was a problem hiding this comment.
The test registers checkpoint_shared3 with different tensor shapes (4096x2048 and 4096) than checkpoint_shared2 (1024x1024, 1024, 2048x1024, 2048) after force unregistering the shared pool. However, according to the register_checkpoint docstring (line 875-877 in ps.py), "The pool's shape is fixed on first use and cannot accommodate checkpoints with different memory requirements." This test should verify that registering a checkpoint with different memory requirements after a force unregister works correctly, or it should assert/test for failure if the fixed shape restriction still applies after the shared pool is cleared.
Current checkpoint engine doesn't support any approach to release the memory in
__shared_memory_pool__. We need an interface to do so