-
Notifications
You must be signed in to change notification settings - Fork 473
Refactored save, load, and from_checkpoint to support zip format by d… #1185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…efault Signed-off-by: Charlelie Laurent <[email protected]>
|
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR refactors the checkpoint persistence mechanism in Module.save, Module.load, and Module.from_checkpoint to use zip archives instead of tar archives. The new implementation writes checkpoint data (model state dict, args, metadata) directly to zip archives using in-memory buffers, eliminating the need for temporary directory creation and reducing I/O overhead. A new _detect_checkpoint_format() static method enables automatic format detection, allowing the load methods to handle both new zip-based checkpoints and legacy tar-based checkpoints transparently. This change integrates with the existing checkpoint infrastructure that uses fsspec for filesystem abstraction and torch.save/torch.load for serialization, maintaining the same public API while improving performance for large model checkpoints.
PR Description Notes:
- Typo in PR body: "checkooints" should be "checkpoints" (second bullet point)
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| physicsnemo/models/module.py | 4/5 | Refactored checkpoint save/load from tar to zip format with backward compatibility detection and in-memory buffer-based writing |
Confidence score: 4/5
- This PR is safe to merge with minimal risk, as it maintains backward compatibility and follows established patterns
- Score reflects well-structured implementation with proper error handling and format detection, but deducted one point due to lack of explicit testing for edge cases like corrupted zip files, partial writes, or concurrent access scenarios
- Pay close attention to the zip file writing logic in the
save()method (lines 527-546) to ensure proper buffer management and the format detection logic in_detect_checkpoint_format()to verify it handles malformed archives gracefully
Sequence Diagram
sequenceDiagram
participant User
participant Module
participant FileSystem
participant ZipFile
participant TarFile
participant Torch
Note over User,Torch: Save Checkpoint Flow (New ZIP Format)
User->>Module: save(file_name)
Module->>Module: _save_process(recursively serialize nested modules)
Module->>ZipFile: Create temporary zip file
Module->>Torch: torch.save(state_dict)
Torch-->>Module: state_dict bytes
Module->>ZipFile: writestr("model.pt", state_dict)
Module->>ZipFile: writestr("args.json", args)
Module->>ZipFile: writestr("metadata.json", metadata)
Module->>FileSystem: Upload zip to destination
FileSystem-->>User: Checkpoint saved
Note over User,Torch: Load Checkpoint Flow (ZIP or TAR)
User->>Module: load(file_name) or from_checkpoint(file_name)
Module->>FileSystem: _download_cached(file_name)
FileSystem-->>Module: cached_file_name
Module->>Module: _detect_checkpoint_format(cached_file_name)
alt ZIP Format
Module->>ZipFile: Open zip archive
ZipFile->>Module: Read args.json
ZipFile->>Module: Read metadata.json
Module->>Module: _from_checkpoint_process(recursively instantiate)
ZipFile->>Module: Read model.pt
Module->>Torch: torch.load(model.pt)
Torch-->>Module: state_dict
Module->>Module: _load_state_dict_with_logging(state_dict)
else TAR Format (Backward Compatibility)
Module->>TarFile: Extract to temporary directory
TarFile-->>Module: Extracted files
Module->>Module: Read args.json from disk
Module->>Module: Read metadata.json from disk
Module->>Module: _from_checkpoint_process(recursively instantiate)
Module->>Torch: torch.load(model.pt from disk)
Torch-->>Module: state_dict
Module->>Module: _load_state_dict_with_logging(state_dict)
end
Module-->>User: Model loaded
1 file reviewed, 3 comments
Signed-off-by: Charlelie Laurent <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
|
/blossom-ci |
|
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR successfully refactors checkpoint serialization from tar to zip format, delivering the performance improvements described in issue #1170. The changes eliminate double-writing of checkpoint data and enable direct in-memory operations.
Key Changes:
save()now writes directly to zip archives usingio.BytesIObuffers, removing the need for temporary directoriesload()andfrom_checkpoint()read directly from zip files without extraction to disk- Backward compatibility maintained by detecting checkpoint format (tar vs zip) with tar checked first
- Improved docstrings with comprehensive examples for all three methods
- New
_detect_checkpoint_format()helper method for format detection
Architecture Benefits:
- Eliminates temporary directory overhead during saves
- Enables direct reading from zip archives via file-like objects
- Zip's central directory structure allows quick metadata inspection
- Matches common Python packaging standards (
.whl,.pthfiles)
The implementation correctly handles the migration path, allowing the codebase to read both legacy tar checkpoints and new zip checkpoints seamlessly.
Confidence Score: 4/5
- This PR is safe to merge with minor style improvements suggested
- The refactoring is well-structured with proper backward compatibility. Previous reviewers identified three style-level resource management improvements (exception handling specificity, context manager scope optimization, and temp file cleanup timing). No critical logical errors or security vulnerabilities were found. The implementation follows Python best practices for zip file handling and maintains API compatibility.
- No files require special attention - the suggested improvements are minor style enhancements
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| physicsnemo/models/module.py | 4/5 | Refactored checkpoint save/load from tar to zip format with backward compatibility; previous style comments flagged resource management issues |
Sequence Diagram
sequenceDiagram
participant User
participant Module
participant ZipFile
participant FileSystem
participant TorchLoad
Note over User,TorchLoad: Save Operation (New Zip Format)
User->>Module: save(file_name)
Module->>Module: Prepare metadata & args
Module->>FileSystem: _get_fs(file_name)
Module->>Module: Create temp file (delete=False)
Module->>ZipFile: Open zip for writing
Module->>Module: state_dict() -> BytesIO buffer
Module->>ZipFile: writestr("model.pt", buffer)
Module->>ZipFile: writestr("args.json", json)
Module->>ZipFile: writestr("metadata.json", json)
ZipFile-->>Module: Close archive
Module->>FileSystem: put(tmp_path, file_name)
Module->>Module: Clean up temp file
Note over User,TorchLoad: Load Operation (Zip Format)
User->>Module: load(file_name)
Module->>Module: _download_cached(file_name)
Module->>Module: _detect_checkpoint_format()
alt Zip Format
Module->>ZipFile: Open zip for reading
Module->>ZipFile: Check expected files exist
Module->>ZipFile: open("model.pt")
Module->>TorchLoad: torch.load(file_obj, device)
TorchLoad-->>Module: state_dict
ZipFile-->>Module: Close archive
else Tar Format (Backward Compatibility)
Module->>Module: Create temp directory
Module->>Module: Extract tar to temp dir
Module->>Module: _check_checkpoint()
Module->>TorchLoad: torch.load(path, device)
TorchLoad-->>Module: state_dict
Module->>Module: Clean up temp directory
end
Module->>Module: _load_state_dict_with_logging()
Note over User,TorchLoad: From Checkpoint Operation
User->>Module: from_checkpoint(file_name)
Module->>Module: _download_cached(file_name)
Module->>Module: _detect_checkpoint_format()
alt Zip Format
Module->>ZipFile: Open zip for reading
Module->>ZipFile: Load args.json and metadata.json
Module->>Module: _from_checkpoint_process() - instantiate model
Module->>ZipFile: open("model.pt")
Module->>TorchLoad: torch.load(file_obj, device)
TorchLoad-->>Module: state_dict
ZipFile-->>Module: Close archive
else Tar Format
Module->>Module: Extract and process tar checkpoint
Module->>Module: _from_checkpoint_process() - instantiate model
Module->>TorchLoad: Load state_dict from extracted files
end
Module->>Module: _load_state_dict_with_logging()
Module-->>User: Return instantiated model
1 file reviewed, no comments
|
@CharlelieLrt, Sorry if I am out of the loop but what was the ask for this? |
@loliverhennigh it's an ask from @nbren12 in #1170 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the code nothing jumps out at me. It would be nice to have a specific unit test for this but I know we already have a lot of unit tests for saving and loading models. I can't think of any case that will not come out in the unit cases for this.
|
I would also say it would be nice to be able to still save old tar style checkpoints. Probably not really needed though. |
Signed-off-by: Charlelie Laurent <[email protected]>
|
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR successfully refactors the checkpoint format from tar to zip, improving I/O performance by eliminating redundant disk writes and enabling efficient metadata access.
Key Changes:
- Default checkpoint format changed from tar to zip using
zipfile.ZipFile - Added
legacy_formatparameter tosave()method to support tar format when needed - Implemented
_detect_checkpoint_format()for automatic format detection during load - Zip format writes data directly using
BytesIObuffers, avoiding temporary directories - Both
load()andfrom_checkpoint()support both formats with automatic detection - Enhanced docstrings with detailed examples and parameter descriptions
Backward Compatibility:
- Existing tar-based checkpoints can still be loaded automatically
- Format detection checks tar first (since old checkpoints may be both zip and tar)
- Legacy tar format available via
legacy_format=Trueparameter
Performance Benefits:
- Eliminates double-write overhead (data written once directly to zip)
- No temporary directory extraction needed for zip format
- Faster metadata access due to zip's central directory structure
Confidence Score: 4/5
- This PR is safe to merge with some minor concerns about error handling
- The implementation is solid with good backward compatibility. The zip format change is well-implemented with proper resource management. Previous review comments identified valid issues with exception handling and temp file cleanup that should be addressed, but these don't block functionality. The lack of explicit tests for legacy_format parameter and format detection is a minor concern, though existing tests will exercise the default zip path.
- No files require special attention - the issues identified in previous comments (exception handling at line 620, state_dict loading context at line 710, temp file cleanup at line 540) are worth addressing but not critical
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| physicsnemo/models/module.py | 4/5 | Refactors checkpoint save/load from tar to zip format for better performance, maintains backward compatibility with tar format, improves docstrings with examples |
Sequence Diagram
sequenceDiagram
participant User
participant Module
participant ZipFile
participant FileSystem
participant TempFile
Note over User,TempFile: Save Operation (Zip Format - Default)
User->>Module: save(file_name, legacy_format=False)
Module->>Module: Prepare metadata & args
Module->>TempFile: Create temp .zip file
Module->>ZipFile: Open temp file for writing
Module->>ZipFile: Write model.pt (via BytesIO)
Module->>ZipFile: Write args.json
Module->>ZipFile: Write metadata.json
ZipFile-->>Module: Close archive
Module->>FileSystem: Upload to destination
Module->>TempFile: Clean up temp file
Module-->>User: Success
Note over User,TempFile: Load Operation (Auto-detect Format)
User->>Module: load(file_name)
Module->>FileSystem: Download & cache checkpoint
Module->>Module: _detect_checkpoint_format()
alt Zip Format
Module->>ZipFile: Open checkpoint
Module->>ZipFile: Validate expected files
Module->>ZipFile: Read model.pt
ZipFile-->>Module: Return state_dict
ZipFile-->>Module: Close archive
else Tar Format (Legacy)
Module->>TempFile: Create temp directory
Module->>Module: Extract tar to temp dir
Module->>TempFile: Read model.pt from disk
TempFile-->>Module: Return state_dict
Module->>TempFile: Cleanup temp directory
end
Module->>Module: Load state_dict into model
Module-->>User: Success
1 file reviewed, no comments
Good point. I added the option to Regarding the tests, the reason why I didn't add any new one is because we have already a lot of tests that fall under these 2 categories:
|
|
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR successfully migrates checkpoint format from tar to zip archives, eliminating redundant I/O operations and temporary directory usage. The implementation maintains full backward compatibility with existing tar-based checkpoints through automatic format detection.
Key changes:
- New default zip format writes directly to in-memory buffers before creating archive, reducing disk I/O by half
legacy_formatparameter added tosave()method to optionally use tar format_detect_checkpoint_format()static method intelligently determines checkpoint type (checks tar first since old checkpoints are both)- Both
load()andfrom_checkpoint()handle zip and tar formats transparently - Enhanced docstrings with detailed parameter descriptions and usage examples
Confidence Score: 4/5
- This PR is safe to merge with only minor style improvements recommended
- The implementation correctly handles the migration from tar to zip format with proper backward compatibility. The code follows good practices including proper resource cleanup with try/finally blocks, format auto-detection, and comprehensive docstrings. Previous reviewers already identified the main style issues (catch-all exception handling, resource management patterns). No critical logic errors or security vulnerabilities were found. The confidence score of 4/5 reflects that while the implementation is solid, the style issues should be addressed.
- No files require special attention - the style issues flagged in previous comments are minor
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| physicsnemo/models/module.py | 4/5 | Refactored checkpoint format from tar to zip by default with backward compatibility for tar format. Implementation is sound but has minor style issues already flagged. |
Sequence Diagram
sequenceDiagram
participant User
participant Module
participant ZipFile
participant TarFile
participant FileSystem as fs
alt Save (zip format - default)
User->>Module: save(file_name)
Module->>Module: Prepare metadata & args
Module->>ZipFile: Create temp zip file
Module->>ZipFile: Write state_dict to BytesIO
ZipFile->>ZipFile: writestr("model.pt")
ZipFile->>ZipFile: writestr("args.json")
ZipFile->>ZipFile: writestr("metadata.json")
Module->>FileSystem: put(tmp_path, file_name)
Module->>Module: cleanup temp file
end
alt Save (legacy tar format)
User->>Module: save(file_name, legacy_format=True)
Module->>Module: Prepare metadata & args
Module->>TarFile: Create temp directory
TarFile->>TarFile: Write model.pt to disk
TarFile->>TarFile: Write args.json to disk
TarFile->>TarFile: Write metadata.json to disk
TarFile->>TarFile: Create tar from files
Module->>FileSystem: put(tar_path, file_name)
end
alt Load/from_checkpoint (zip format)
User->>Module: load(file_name) or from_checkpoint(file_name)
Module->>FileSystem: _download_cached(file_name)
Module->>Module: _detect_checkpoint_format()
Module->>ZipFile: Open zip archive
ZipFile->>ZipFile: Validate expected files present
ZipFile->>Module: Read args.json & metadata.json
Module->>Module: _from_checkpoint_process() [if from_checkpoint]
ZipFile->>Module: torch.load(model.pt)
Module->>Module: load_state_dict()
end
alt Load/from_checkpoint (tar format - backward compat)
User->>Module: load(file_name) or from_checkpoint(file_name)
Module->>FileSystem: _download_cached(file_name)
Module->>Module: _detect_checkpoint_format()
Module->>TarFile: Extract to temp directory
TarFile->>Module: Read files from disk
Module->>Module: _from_checkpoint_process() [if from_checkpoint]
Module->>Module: torch.load(model.pt)
Module->>Module: load_state_dict()
end
1 file reviewed, no comments
|
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Refactored checkpoint saving/loading from tar to zip format to eliminate double I/O overhead and enable direct file access without extraction. The implementation maintains backward compatibility by detecting and handling both tar and zip formats during load operations.
Key Changes:
save()now writes directly to zip archives using in-memory buffers, avoiding temporary directory creation- Added
legacy_formatparameter tosave()for optional tar-based checkpoints load()andfrom_checkpoint()auto-detect format and handle both zip (direct access) and tar (extraction-based) checkpoints- New
_detect_checkpoint_format()helper method for format detection
Issues Found:
- Resource leak risk in
save()- temp file created without immediate try block protection (line 540-541) - Performance issue in
load()andfrom_checkpoint()- zip archives remain open during potentially slowtorch.loadoperations
Confidence Score: 3/5
- This PR is moderately safe to merge with some resource management concerns
- The refactoring from tar to zip format is well-implemented with proper backward compatibility, but contains a critical resource leak risk in the save() method where temporary files could be leaked if exceptions occur during zip file creation. Additionally, zip archives are held open during torch.load operations which could cause performance issues with large models. The backward compatibility logic appears sound and the format detection is properly ordered.
- physicsnemo/models/module.py - Pay attention to the temporary file handling in save() method (lines 540-563) and the resource management in load/from_checkpoint methods
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| physicsnemo/models/module.py | 3/5 | Refactored checkpoint format from tar to zip with backward compatibility. Found resource management issue in save() (temp file leak risk) and performance issues in load/from_checkpoint (zip file held open during torch.load) |
Sequence Diagram
sequenceDiagram
participant User
participant Module
participant ZipFile
participant TarFile
participant FileSystem
participant TempFile
Note over User,TempFile: Save Operation (New Zip Format)
User->>Module: save(file_name, legacy_format=False)
Module->>TempFile: Create NamedTemporaryFile(.zip)
Module->>ZipFile: Open zip archive (tmp_path)
Module->>ZipFile: writestr("model.pt", state_dict)
Module->>ZipFile: writestr("args.json", args)
Module->>ZipFile: writestr("metadata.json", metadata)
ZipFile-->>Module: Close archive
Module->>FileSystem: put(tmp_path, file_name)
Module->>TempFile: Remove temp file
Module-->>User: Checkpoint saved
Note over User,TempFile: Load Operation (Zip Format)
User->>Module: load(file_name)
Module->>FileSystem: Download and cache checkpoint
Module->>Module: _detect_checkpoint_format()
alt Zip Format
Module->>ZipFile: Open cached_file_name
Module->>ZipFile: Check expected files exist
Module->>ZipFile: open("model.pt")
Module->>Module: torch.load(from zip)
ZipFile-->>Module: Close archive
else Tar Format (Backward Compatibility)
Module->>TempFile: Create TemporaryDirectory
Module->>TarFile: Open and extract to temp_dir
Module->>Module: torch.load(from temp_dir/model.pt)
Module->>TempFile: Cleanup temp directory
end
Module->>Module: load_state_dict()
Module-->>User: Weights loaded
Note over User,TempFile: From Checkpoint (Zip Format)
User->>Module: from_checkpoint(file_name)
Module->>FileSystem: Download and cache checkpoint
Module->>Module: _detect_checkpoint_format()
Module->>ZipFile: Open cached_file_name
Module->>ZipFile: open("args.json")
Module->>ZipFile: open("metadata.json")
Module->>Module: Instantiate model from args
Module->>ZipFile: open("model.pt")
Module->>Module: torch.load(from zip)
ZipFile-->>Module: Close archive
Module->>Module: load_state_dict()
Module-->>User: Return instantiated model
1 file reviewed, 3 comments
…ive + safer temporaruy file creation Signed-off-by: Charlelie Laurent <[email protected]>
…/modulus into checkpoint-zip-file
|
/blossom-ci |
|
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything seems reasonable to me.
|
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR refactors checkpoint format from tar to zip for improved I/O performance and eliminates double-writing overhead. The implementation successfully adds backward compatibility for reading legacy tar checkpoints.
Key Changes
save()now uses zip format by default, writes directly to archive without temporary directoriesload()andfrom_checkpoint()detect format automatically and handle both zip and tar- Improved resource management: model data loaded into memory before closing zip archives
- Enhanced docstrings with clear examples and parameter descriptions
Issues Found
- Critical bug in
save()method: Variabletmp_pathreferenced infinallyblock can be undefined if exception occurs during temp file creation, causingNameError(line 540-563)
Notes
- The comment on line 608-610 states "older checkpoints will be both zip and tar archives" which seems unclear - tar files are typically not also zip files
- Exception handling in
_detect_checkpoint_format()appropriately re-raises with context preservation
Confidence Score: 2/5
- Critical bug in exception handling makes this unsafe to merge without fix
- The
save()method has a logic error wheretmp_pathcan be undefined in thefinallyblock, causing runtimeNameError. This will break checkpoint saving if any exception occurs during temp file creation. The rest of the implementation is solid with good backward compatibility and improved resource management. physicsnemo/models/module.pylines 540-563 - fix undefined variable in exception handler
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| physicsnemo/models/module.py | 2/5 | Refactored checkpoint format from tar to zip with backward compatibility. Critical bug: undefined variable in exception handling (line 540-563). Good resource management improvements in load methods. |
Sequence Diagram
sequenceDiagram
participant User
participant Module
participant ZipFile
participant FileSystem
participant TempFile
Note over User,FileSystem: Save Operation (Zip Format)
User->>Module: save(file_name)
Module->>TempFile: Create NamedTemporaryFile
TempFile-->>Module: tmp_path
Module->>ZipFile: Create archive at tmp_path
Module->>Module: torch.save(state_dict) to BytesIO
Module->>ZipFile: writestr("model.pt", buffer)
Module->>ZipFile: writestr("args.json", json)
Module->>ZipFile: writestr("metadata.json", json)
Module->>ZipFile: Close archive
Module->>FileSystem: put(tmp_path, file_name)
Module->>TempFile: Remove temporary file
Note over User,FileSystem: Load Operation (Zip Format)
User->>Module: load(file_name)
Module->>FileSystem: Download and cache file
FileSystem-->>Module: cached_file_name
Module->>Module: _detect_checkpoint_format()
Module->>ZipFile: Open archive (read mode)
Module->>ZipFile: Validate expected files present
Module->>ZipFile: read("model.pt") into memory
Module->>ZipFile: Close archive
Module->>Module: torch.load(BytesIO(model_bytes))
Module->>Module: Load state_dict into model
Note over User,FileSystem: from_checkpoint Operation (Zip Format)
User->>Module: from_checkpoint(file_name)
Module->>FileSystem: Download and cache file
FileSystem-->>Module: cached_file_name
Module->>Module: _detect_checkpoint_format()
Module->>ZipFile: Open archive (read mode)
Module->>ZipFile: Validate expected files present
Module->>ZipFile: open("args.json") and read
Module->>ZipFile: open("metadata.json") and read
Module->>Module: Instantiate model from args
Module->>ZipFile: read("model.pt") into memory
Module->>ZipFile: Close archive
Module->>Module: torch.load(BytesIO(model_bytes))
Module->>Module: Load state_dict into model
Module-->>User: Return model instance
1 file reviewed, 1 comment
|
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR refactors the checkpoint saving/loading mechanism from tar to zip format to improve I/O performance and eliminate redundant disk writes.
Key Changes:
Module.save()now uses zip format by default (withlegacy_formatflag for backward compatibility)- Added
_detect_checkpoint_format()method to auto-detect checkpoint format Module.load()andModule.from_checkpoint()support both zip and tar formats- Zip format writes data directly to archive in memory (via
BytesIO), eliminating temporary directories - Enhanced docstrings with examples and clearer parameter descriptions
Benefits:
- Removes double-write overhead (temp dir + tar archive)
- Faster metadata access (zip stores directory at end of file)
- Maintains full backward compatibility with existing tar checkpoints
- Aligns with industry standards (Python wheels, PyTorch
.pthfiles use zip)
Testing: Existing test suite should cover both formats through the auto-detection mechanism.
Confidence Score: 4/5
- This PR is safe to merge with the existing style suggestions addressed
- The implementation is technically sound with proper backward compatibility. Previous reviewers identified several style improvements around resource management and error handling that should be addressed, but no critical logical bugs. The core refactoring from tar to zip is well-designed and follows best practices.
- No files require special attention beyond addressing the existing style comments
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| physicsnemo/models/module.py | 4/5 | Refactors checkpoint format from tar to zip, improving I/O performance. Previous comments addressed major issues with temp file handling and resource management. Implementation is sound with backward compatibility for tar format. |
Sequence Diagram
sequenceDiagram
participant User
participant Module
participant ZipFile
participant TempFile
participant FileSystem
Note over User,FileSystem: Save Operation (New Zip Format)
User->>Module: save(file_name)
Module->>Module: Prepare metadata & args
Module->>TempFile: Create temp .zip file
Module->>ZipFile: Open zip archive (write mode)
Module->>ZipFile: Write model.pt (state_dict in memory)
Module->>ZipFile: Write args.json
Module->>ZipFile: Write metadata.json
ZipFile-->>Module: Close archive
Module->>FileSystem: Upload temp file to destination
Module->>TempFile: Clean up temp file
Note over User,FileSystem: Load Operation (Auto-detect Format)
User->>Module: load(file_name) or from_checkpoint(file_name)
Module->>FileSystem: Download & cache checkpoint
Module->>Module: _detect_checkpoint_format()
alt Zip Format (New)
Module->>ZipFile: Open zip archive (read mode)
Module->>ZipFile: Validate expected files exist
Module->>ZipFile: Read model.pt into memory
ZipFile-->>Module: Close archive
Module->>Module: torch.load from BytesIO
else Tar Format (Legacy)
Module->>TempFile: Create temp directory
Module->>TarFile: Extract to temp directory
Module->>Module: Load from extracted files
Module->>TempFile: Clean up temp directory
end
Module->>User: Return loaded model
1 file reviewed, no comments
PhysicsNeMo Pull Request
Description
This PR modifies the format backend for saving/loading model checkpoints:
Module.savenow saves.mdluscheckpoints using zip format rather thantar.Module.loadandModule.from_checkpointcan not only readzip-based checkpoint files but still support readingtar-based checkooints for backward compatibility.Closes #1170 .
Checklist
Dependencies