Skip to content

feat: force unregister shared pin memory buffer supported#62

Merged
blahgeek merged 2 commits intoMoonshotAI:mainfrom
specture724:feat/force-unregister
Dec 11, 2025
Merged

feat: force unregister shared pin memory buffer supported#62
blahgeek merged 2 commits intoMoonshotAI:mainfrom
specture724:feat/force-unregister

Conversation

@specture724
Copy link
Collaborator

Current checkpoint engine doesn't support any approach to release the memory in __shared_memory_pool__. We need an interface to do so

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for force unregistration of shared pin memory buffer in the checkpoint engine. Previously, the engine could only mark the shared memory pool as unused but couldn't physically release it. The new force=True parameter in unregister_checkpoint() enables complete deallocation of the shared memory pool.

  • Added force parameter to unregister_checkpoint() method to enable physical release of shared memory pool
  • Updated p2p_store registration/unregistration to use consistent naming for shared memory pool
  • Enhanced test coverage with scenario testing force unregistration and subsequent re-registration

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
checkpoint_engine/ps.py Added force parameter to unregister_checkpoint method and implemented logic to physically delete and reset shared memory pool when force=True; updated p2p_store naming consistency for shared pool
tests/test_pin_memory.py Added test case for force unregistration of shared pool and re-registration with different checkpoint sizes
Comments suppressed due to low confidence (1)

checkpoint_engine/ps.py:923

  • The docstring for unregister_checkpoint should be updated to document the new 'force' parameter. The documentation should explain that when force=True, it physically releases the shared memory pool instead of just marking it as unused. This is important for API consumers to understand the difference between normal and force unregistration.
    def unregister_checkpoint(self, checkpoint_name: str, force: bool = False) -> None:
        """
        Unregister a checkpoint from the parameter server. This function will also unregister the checkpoint
        from p2p store if p2p store is initialized.
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +69 to +71
ps.register_checkpoint(
"test_checkpoint_shared3", named_tensors=checkpoint_shared3, use_shared_memory_pool=True
)
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test registers checkpoint_shared3 with different tensor shapes (4096x2048 and 4096) than checkpoint_shared2 (1024x1024, 1024, 2048x1024, 2048) after force unregistering the shared pool. However, according to the register_checkpoint docstring (line 875-877 in ps.py), "The pool's shape is fixed on first use and cannot accommodate checkpoints with different memory requirements." This test should verify that registering a checkpoint with different memory requirements after a force unregister works correctly, or it should assert/test for failure if the fixed shape restriction still applies after the shared pool is cleared.

Copilot uses AI. Check for mistakes.
@blahgeek blahgeek merged commit f69e116 into MoonshotAI:main Dec 11, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants