Skip to content

misc: split ps.py file into multiple files#64

Closed
specture724 wants to merge 8 commits intoMoonshotAI:mainfrom
specture724:misc/split-files
Closed

misc: split ps.py file into multiple files#64
specture724 wants to merge 8 commits intoMoonshotAI:mainfrom
specture724:misc/split-files

Conversation

@specture724
Copy link
Collaborator

@specture724 specture724 commented Dec 11, 2025

Most of the functions and classes are in ps.py, making it so large. We need to split it.

  • Move code related to RDMA devices and P2PStore into p2p_store.py
  • Move all data structures into data_types.py
  • Move code related to read files and register pin memory into pin_memory.py
  • Move fastapi code into api.py
  • Move run_from_cli to __main__.py
  • export all public interfaces in __init__.py

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the large ps.py file into multiple smaller, more focused modules to improve code organization and maintainability. The core functionality remains unchanged; the refactoring only reorganizes existing code into logical units based on their responsibilities.

Key changes:

  • Extracted RDMA device handling and P2P store functionality into p2p_store.py
  • Moved all Pydantic data models and type definitions into data_types.py
  • Separated checkpoint loading and pin memory operations into pin_memory.py
  • Isolated FastAPI endpoints and HTTP communication into api.py
  • Created __main__.py as the CLI entry point
  • Updated __init__.py to expose all public interfaces

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
checkpoint_engine/data_types.py New file containing all Pydantic models and type definitions (ParameterMeta, BucketRange, H2DBucket, etc.) with validators and serializers
checkpoint_engine/p2p_store.py New file with RDMA device detection/parsing functions and the P2PStore class for managing peer-to-peer memory transfers
checkpoint_engine/pin_memory.py New file with checkpoint loading functions (_load_checkpoint_file, _register_checkpoint) and memory pinning operations
checkpoint_engine/api.py New file containing FastAPI application initialization (_init_api) and HTTP endpoint definitions for checkpoint management
checkpoint_engine/__main__.py New file with CLI entry point (run_from_cli) for starting the parameter server via uvicorn
checkpoint_engine/__init__.py Updated to export all public interfaces from the new modules with explicit all list
checkpoint_engine/ps.py Significantly reduced file that now imports from the new modules and focuses solely on ParameterServer class implementation
tests/test_rdma_parser.py Updated imports and patch paths to reference the new checkpoint_engine.p2p_store module location

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@specture724 specture724 force-pushed the misc/split-files branch 2 times, most recently from 008a747 to a9f3642 Compare December 11, 2025 10:13
@specture724 specture724 force-pushed the misc/split-files branch 2 times, most recently from 093f569 to c4a1e7c Compare December 22, 2025 11:10
@blahgeek
Copy link
Collaborator

blahgeek commented Jan 4, 2026

merged in #75

@blahgeek blahgeek closed this Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants