Skip to content

Conversation

@CharlelieLrt
Copy link
Collaborator

PhysicsNeMo Pull Request

Description

This PR modifies the format backend for saving/loading model checkpoints:

  • The method Module.save now saves .mdlus checkpoints using zip format rather than tar.
  • The methods Module.load and Module.from_checkpoint can not only read zip-based checkpoint files but still support reading tar-based checkooints for backward compatibility.
  • Minor improvements in the docstrings of these three methods are made.

Closes #1170 .

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@CharlelieLrt CharlelieLrt self-assigned this Oct 25, 2025
@CharlelieLrt CharlelieLrt added the ! - Release PRs or Issues releating to a release label Oct 25, 2025
@CharlelieLrt
Copy link
Collaborator Author

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR refactors the checkpoint persistence mechanism in Module.save, Module.load, and Module.from_checkpoint to use zip archives instead of tar archives. The new implementation writes checkpoint data (model state dict, args, metadata) directly to zip archives using in-memory buffers, eliminating the need for temporary directory creation and reducing I/O overhead. A new _detect_checkpoint_format() static method enables automatic format detection, allowing the load methods to handle both new zip-based checkpoints and legacy tar-based checkpoints transparently. This change integrates with the existing checkpoint infrastructure that uses fsspec for filesystem abstraction and torch.save/torch.load for serialization, maintaining the same public API while improving performance for large model checkpoints.

PR Description Notes:

  • Typo in PR body: "checkooints" should be "checkpoints" (second bullet point)

Important Files Changed

Filename Score Overview
physicsnemo/models/module.py 4/5 Refactored checkpoint save/load from tar to zip format with backward compatibility detection and in-memory buffer-based writing

Confidence score: 4/5

  • This PR is safe to merge with minimal risk, as it maintains backward compatibility and follows established patterns
  • Score reflects well-structured implementation with proper error handling and format detection, but deducted one point due to lack of explicit testing for edge cases like corrupted zip files, partial writes, or concurrent access scenarios
  • Pay close attention to the zip file writing logic in the save() method (lines 527-546) to ensure proper buffer management and the format detection logic in _detect_checkpoint_format() to verify it handles malformed archives gracefully

Sequence Diagram

sequenceDiagram
    participant User
    participant Module
    participant FileSystem
    participant ZipFile
    participant TarFile
    participant Torch

    Note over User,Torch: Save Checkpoint Flow (New ZIP Format)
    User->>Module: save(file_name)
    Module->>Module: _save_process(recursively serialize nested modules)
    Module->>ZipFile: Create temporary zip file
    Module->>Torch: torch.save(state_dict)
    Torch-->>Module: state_dict bytes
    Module->>ZipFile: writestr("model.pt", state_dict)
    Module->>ZipFile: writestr("args.json", args)
    Module->>ZipFile: writestr("metadata.json", metadata)
    Module->>FileSystem: Upload zip to destination
    FileSystem-->>User: Checkpoint saved

    Note over User,Torch: Load Checkpoint Flow (ZIP or TAR)
    User->>Module: load(file_name) or from_checkpoint(file_name)
    Module->>FileSystem: _download_cached(file_name)
    FileSystem-->>Module: cached_file_name
    Module->>Module: _detect_checkpoint_format(cached_file_name)
    alt ZIP Format
        Module->>ZipFile: Open zip archive
        ZipFile->>Module: Read args.json
        ZipFile->>Module: Read metadata.json
        Module->>Module: _from_checkpoint_process(recursively instantiate)
        ZipFile->>Module: Read model.pt
        Module->>Torch: torch.load(model.pt)
        Torch-->>Module: state_dict
        Module->>Module: _load_state_dict_with_logging(state_dict)
    else TAR Format (Backward Compatibility)
        Module->>TarFile: Extract to temporary directory
        TarFile-->>Module: Extracted files
        Module->>Module: Read args.json from disk
        Module->>Module: Read metadata.json from disk
        Module->>Module: _from_checkpoint_process(recursively instantiate)
        Module->>Torch: torch.load(model.pt from disk)
        Torch-->>Module: state_dict
        Module->>Module: _load_state_dict_with_logging(state_dict)
    end
    Module-->>User: Model loaded
Loading

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@CharlelieLrt
Copy link
Collaborator Author

/blossom-ci

@loliverhennigh
Copy link
Collaborator

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR successfully refactors checkpoint serialization from tar to zip format, delivering the performance improvements described in issue #1170. The changes eliminate double-writing of checkpoint data and enable direct in-memory operations.

Key Changes:

  • save() now writes directly to zip archives using io.BytesIO buffers, removing the need for temporary directories
  • load() and from_checkpoint() read directly from zip files without extraction to disk
  • Backward compatibility maintained by detecting checkpoint format (tar vs zip) with tar checked first
  • Improved docstrings with comprehensive examples for all three methods
  • New _detect_checkpoint_format() helper method for format detection

Architecture Benefits:

  • Eliminates temporary directory overhead during saves
  • Enables direct reading from zip archives via file-like objects
  • Zip's central directory structure allows quick metadata inspection
  • Matches common Python packaging standards (.whl, .pth files)

The implementation correctly handles the migration path, allowing the codebase to read both legacy tar checkpoints and new zip checkpoints seamlessly.

Confidence Score: 4/5

  • This PR is safe to merge with minor style improvements suggested
  • The refactoring is well-structured with proper backward compatibility. Previous reviewers identified three style-level resource management improvements (exception handling specificity, context manager scope optimization, and temp file cleanup timing). No critical logical errors or security vulnerabilities were found. The implementation follows Python best practices for zip file handling and maintains API compatibility.
  • No files require special attention - the suggested improvements are minor style enhancements

Important Files Changed

File Analysis

Filename Score Overview
physicsnemo/models/module.py 4/5 Refactored checkpoint save/load from tar to zip format with backward compatibility; previous style comments flagged resource management issues

Sequence Diagram

sequenceDiagram
    participant User
    participant Module
    participant ZipFile
    participant FileSystem
    participant TorchLoad

    Note over User,TorchLoad: Save Operation (New Zip Format)
    User->>Module: save(file_name)
    Module->>Module: Prepare metadata & args
    Module->>FileSystem: _get_fs(file_name)
    Module->>Module: Create temp file (delete=False)
    Module->>ZipFile: Open zip for writing
    Module->>Module: state_dict() -> BytesIO buffer
    Module->>ZipFile: writestr("model.pt", buffer)
    Module->>ZipFile: writestr("args.json", json)
    Module->>ZipFile: writestr("metadata.json", json)
    ZipFile-->>Module: Close archive
    Module->>FileSystem: put(tmp_path, file_name)
    Module->>Module: Clean up temp file

    Note over User,TorchLoad: Load Operation (Zip Format)
    User->>Module: load(file_name)
    Module->>Module: _download_cached(file_name)
    Module->>Module: _detect_checkpoint_format()
    alt Zip Format
        Module->>ZipFile: Open zip for reading
        Module->>ZipFile: Check expected files exist
        Module->>ZipFile: open("model.pt")
        Module->>TorchLoad: torch.load(file_obj, device)
        TorchLoad-->>Module: state_dict
        ZipFile-->>Module: Close archive
    else Tar Format (Backward Compatibility)
        Module->>Module: Create temp directory
        Module->>Module: Extract tar to temp dir
        Module->>Module: _check_checkpoint()
        Module->>TorchLoad: torch.load(path, device)
        TorchLoad-->>Module: state_dict
        Module->>Module: Clean up temp directory
    end
    Module->>Module: _load_state_dict_with_logging()

    Note over User,TorchLoad: From Checkpoint Operation
    User->>Module: from_checkpoint(file_name)
    Module->>Module: _download_cached(file_name)
    Module->>Module: _detect_checkpoint_format()
    alt Zip Format
        Module->>ZipFile: Open zip for reading
        Module->>ZipFile: Load args.json and metadata.json
        Module->>Module: _from_checkpoint_process() - instantiate model
        Module->>ZipFile: open("model.pt")
        Module->>TorchLoad: torch.load(file_obj, device)
        TorchLoad-->>Module: state_dict
        ZipFile-->>Module: Close archive
    else Tar Format
        Module->>Module: Extract and process tar checkpoint
        Module->>Module: _from_checkpoint_process() - instantiate model
        Module->>TorchLoad: Load state_dict from extracted files
    end
    Module->>Module: _load_state_dict_with_logging()
    Module-->>User: Return instantiated model
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@loliverhennigh
Copy link
Collaborator

@CharlelieLrt, Sorry if I am out of the loop but what was the ask for this?

@CharlelieLrt
Copy link
Collaborator Author

@CharlelieLrt, Sorry if I am out of the loop but what was the ask for this?

@loliverhennigh it's an ask from @nbren12 in #1170 .

Copy link
Collaborator

@loliverhennigh loliverhennigh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the code nothing jumps out at me. It would be nice to have a specific unit test for this but I know we already have a lot of unit tests for saving and loading models. I can't think of any case that will not come out in the unit cases for this.

@loliverhennigh
Copy link
Collaborator

I would also say it would be nice to be able to still save old tar style checkpoints. Probably not really needed though.

@CharlelieLrt
Copy link
Collaborator Author

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR successfully refactors the checkpoint format from tar to zip, improving I/O performance by eliminating redundant disk writes and enabling efficient metadata access.

Key Changes:

  • Default checkpoint format changed from tar to zip using zipfile.ZipFile
  • Added legacy_format parameter to save() method to support tar format when needed
  • Implemented _detect_checkpoint_format() for automatic format detection during load
  • Zip format writes data directly using BytesIO buffers, avoiding temporary directories
  • Both load() and from_checkpoint() support both formats with automatic detection
  • Enhanced docstrings with detailed examples and parameter descriptions

Backward Compatibility:

  • Existing tar-based checkpoints can still be loaded automatically
  • Format detection checks tar first (since old checkpoints may be both zip and tar)
  • Legacy tar format available via legacy_format=True parameter

Performance Benefits:

  • Eliminates double-write overhead (data written once directly to zip)
  • No temporary directory extraction needed for zip format
  • Faster metadata access due to zip's central directory structure

Confidence Score: 4/5

  • This PR is safe to merge with some minor concerns about error handling
  • The implementation is solid with good backward compatibility. The zip format change is well-implemented with proper resource management. Previous review comments identified valid issues with exception handling and temp file cleanup that should be addressed, but these don't block functionality. The lack of explicit tests for legacy_format parameter and format detection is a minor concern, though existing tests will exercise the default zip path.
  • No files require special attention - the issues identified in previous comments (exception handling at line 620, state_dict loading context at line 710, temp file cleanup at line 540) are worth addressing but not critical

Important Files Changed

File Analysis

Filename Score Overview
physicsnemo/models/module.py 4/5 Refactors checkpoint save/load from tar to zip format for better performance, maintains backward compatibility with tar format, improves docstrings with examples

Sequence Diagram

sequenceDiagram
    participant User
    participant Module
    participant ZipFile
    participant FileSystem
    participant TempFile
    
    Note over User,TempFile: Save Operation (Zip Format - Default)
    User->>Module: save(file_name, legacy_format=False)
    Module->>Module: Prepare metadata & args
    Module->>TempFile: Create temp .zip file
    Module->>ZipFile: Open temp file for writing
    Module->>ZipFile: Write model.pt (via BytesIO)
    Module->>ZipFile: Write args.json
    Module->>ZipFile: Write metadata.json
    ZipFile-->>Module: Close archive
    Module->>FileSystem: Upload to destination
    Module->>TempFile: Clean up temp file
    Module-->>User: Success
    
    Note over User,TempFile: Load Operation (Auto-detect Format)
    User->>Module: load(file_name)
    Module->>FileSystem: Download & cache checkpoint
    Module->>Module: _detect_checkpoint_format()
    alt Zip Format
        Module->>ZipFile: Open checkpoint
        Module->>ZipFile: Validate expected files
        Module->>ZipFile: Read model.pt
        ZipFile-->>Module: Return state_dict
        ZipFile-->>Module: Close archive
    else Tar Format (Legacy)
        Module->>TempFile: Create temp directory
        Module->>Module: Extract tar to temp dir
        Module->>TempFile: Read model.pt from disk
        TempFile-->>Module: Return state_dict
        Module->>TempFile: Cleanup temp directory
    end
    Module->>Module: Load state_dict into model
    Module-->>User: Success
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@CharlelieLrt
Copy link
Collaborator Author

@loliverhennigh

I would also say it would be nice to be able to still save old tar style checkpoints. Probably not really needed though.

Good point. I added the option to Module.save with a legacy_format kwarg.

Regarding the tests, the reason why I didn't add any new one is because we have already a lot of tests that fall under these 2 categories:

  • Non-regression tests that load a reference checkpoint. Those ensure that we can still load the legacy tar-based checkpoints.

  • Tests that instantiate a model, save a checkpoint and reload it. Those ensure that we are able to save and load the new zip-based format.

@CharlelieLrt
Copy link
Collaborator Author

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR successfully migrates checkpoint format from tar to zip archives, eliminating redundant I/O operations and temporary directory usage. The implementation maintains full backward compatibility with existing tar-based checkpoints through automatic format detection.

Key changes:

  • New default zip format writes directly to in-memory buffers before creating archive, reducing disk I/O by half
  • legacy_format parameter added to save() method to optionally use tar format
  • _detect_checkpoint_format() static method intelligently determines checkpoint type (checks tar first since old checkpoints are both)
  • Both load() and from_checkpoint() handle zip and tar formats transparently
  • Enhanced docstrings with detailed parameter descriptions and usage examples

Confidence Score: 4/5

  • This PR is safe to merge with only minor style improvements recommended
  • The implementation correctly handles the migration from tar to zip format with proper backward compatibility. The code follows good practices including proper resource cleanup with try/finally blocks, format auto-detection, and comprehensive docstrings. Previous reviewers already identified the main style issues (catch-all exception handling, resource management patterns). No critical logic errors or security vulnerabilities were found. The confidence score of 4/5 reflects that while the implementation is solid, the style issues should be addressed.
  • No files require special attention - the style issues flagged in previous comments are minor

Important Files Changed

File Analysis

Filename Score Overview
physicsnemo/models/module.py 4/5 Refactored checkpoint format from tar to zip by default with backward compatibility for tar format. Implementation is sound but has minor style issues already flagged.

Sequence Diagram

sequenceDiagram
    participant User
    participant Module
    participant ZipFile
    participant TarFile
    participant FileSystem as fs

    alt Save (zip format - default)
        User->>Module: save(file_name)
        Module->>Module: Prepare metadata & args
        Module->>ZipFile: Create temp zip file
        Module->>ZipFile: Write state_dict to BytesIO
        ZipFile->>ZipFile: writestr("model.pt")
        ZipFile->>ZipFile: writestr("args.json")
        ZipFile->>ZipFile: writestr("metadata.json")
        Module->>FileSystem: put(tmp_path, file_name)
        Module->>Module: cleanup temp file
    end

    alt Save (legacy tar format)
        User->>Module: save(file_name, legacy_format=True)
        Module->>Module: Prepare metadata & args
        Module->>TarFile: Create temp directory
        TarFile->>TarFile: Write model.pt to disk
        TarFile->>TarFile: Write args.json to disk
        TarFile->>TarFile: Write metadata.json to disk
        TarFile->>TarFile: Create tar from files
        Module->>FileSystem: put(tar_path, file_name)
    end

    alt Load/from_checkpoint (zip format)
        User->>Module: load(file_name) or from_checkpoint(file_name)
        Module->>FileSystem: _download_cached(file_name)
        Module->>Module: _detect_checkpoint_format()
        Module->>ZipFile: Open zip archive
        ZipFile->>ZipFile: Validate expected files present
        ZipFile->>Module: Read args.json & metadata.json
        Module->>Module: _from_checkpoint_process() [if from_checkpoint]
        ZipFile->>Module: torch.load(model.pt)
        Module->>Module: load_state_dict()
    end

    alt Load/from_checkpoint (tar format - backward compat)
        User->>Module: load(file_name) or from_checkpoint(file_name)
        Module->>FileSystem: _download_cached(file_name)
        Module->>Module: _detect_checkpoint_format()
        Module->>TarFile: Extract to temp directory
        TarFile->>Module: Read files from disk
        Module->>Module: _from_checkpoint_process() [if from_checkpoint]
        Module->>Module: torch.load(model.pt)
        Module->>Module: load_state_dict()
    end
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@CharlelieLrt
Copy link
Collaborator Author

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Refactored checkpoint saving/loading from tar to zip format to eliminate double I/O overhead and enable direct file access without extraction. The implementation maintains backward compatibility by detecting and handling both tar and zip formats during load operations.

Key Changes:

  • save() now writes directly to zip archives using in-memory buffers, avoiding temporary directory creation
  • Added legacy_format parameter to save() for optional tar-based checkpoints
  • load() and from_checkpoint() auto-detect format and handle both zip (direct access) and tar (extraction-based) checkpoints
  • New _detect_checkpoint_format() helper method for format detection

Issues Found:

  • Resource leak risk in save() - temp file created without immediate try block protection (line 540-541)
  • Performance issue in load() and from_checkpoint() - zip archives remain open during potentially slow torch.load operations

Confidence Score: 3/5

  • This PR is moderately safe to merge with some resource management concerns
  • The refactoring from tar to zip format is well-implemented with proper backward compatibility, but contains a critical resource leak risk in the save() method where temporary files could be leaked if exceptions occur during zip file creation. Additionally, zip archives are held open during torch.load operations which could cause performance issues with large models. The backward compatibility logic appears sound and the format detection is properly ordered.
  • physicsnemo/models/module.py - Pay attention to the temporary file handling in save() method (lines 540-563) and the resource management in load/from_checkpoint methods

Important Files Changed

File Analysis

Filename Score Overview
physicsnemo/models/module.py 3/5 Refactored checkpoint format from tar to zip with backward compatibility. Found resource management issue in save() (temp file leak risk) and performance issues in load/from_checkpoint (zip file held open during torch.load)

Sequence Diagram

sequenceDiagram
    participant User
    participant Module
    participant ZipFile
    participant TarFile
    participant FileSystem
    participant TempFile
    
    Note over User,TempFile: Save Operation (New Zip Format)
    User->>Module: save(file_name, legacy_format=False)
    Module->>TempFile: Create NamedTemporaryFile(.zip)
    Module->>ZipFile: Open zip archive (tmp_path)
    Module->>ZipFile: writestr("model.pt", state_dict)
    Module->>ZipFile: writestr("args.json", args)
    Module->>ZipFile: writestr("metadata.json", metadata)
    ZipFile-->>Module: Close archive
    Module->>FileSystem: put(tmp_path, file_name)
    Module->>TempFile: Remove temp file
    Module-->>User: Checkpoint saved
    
    Note over User,TempFile: Load Operation (Zip Format)
    User->>Module: load(file_name)
    Module->>FileSystem: Download and cache checkpoint
    Module->>Module: _detect_checkpoint_format()
    alt Zip Format
        Module->>ZipFile: Open cached_file_name
        Module->>ZipFile: Check expected files exist
        Module->>ZipFile: open("model.pt")
        Module->>Module: torch.load(from zip)
        ZipFile-->>Module: Close archive
    else Tar Format (Backward Compatibility)
        Module->>TempFile: Create TemporaryDirectory
        Module->>TarFile: Open and extract to temp_dir
        Module->>Module: torch.load(from temp_dir/model.pt)
        Module->>TempFile: Cleanup temp directory
    end
    Module->>Module: load_state_dict()
    Module-->>User: Weights loaded
    
    Note over User,TempFile: From Checkpoint (Zip Format)
    User->>Module: from_checkpoint(file_name)
    Module->>FileSystem: Download and cache checkpoint
    Module->>Module: _detect_checkpoint_format()
    Module->>ZipFile: Open cached_file_name
    Module->>ZipFile: open("args.json")
    Module->>ZipFile: open("metadata.json")
    Module->>Module: Instantiate model from args
    Module->>ZipFile: open("model.pt")
    Module->>Module: torch.load(from zip)
    ZipFile-->>Module: Close archive
    Module->>Module: load_state_dict()
    Module-->>User: Return instantiated model
Loading

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@CharlelieLrt
Copy link
Collaborator Author

/blossom-ci

@CharlelieLrt CharlelieLrt removed the ! - Release PRs or Issues releating to a release label Oct 31, 2025
@coreyjadams
Copy link
Collaborator

/blossom-ci

Copy link
Collaborator

@coreyjadams coreyjadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems reasonable to me.

@CharlelieLrt
Copy link
Collaborator Author

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR refactors checkpoint format from tar to zip for improved I/O performance and eliminates double-writing overhead. The implementation successfully adds backward compatibility for reading legacy tar checkpoints.

Key Changes

  • save() now uses zip format by default, writes directly to archive without temporary directories
  • load() and from_checkpoint() detect format automatically and handle both zip and tar
  • Improved resource management: model data loaded into memory before closing zip archives
  • Enhanced docstrings with clear examples and parameter descriptions

Issues Found

  • Critical bug in save() method: Variable tmp_path referenced in finally block can be undefined if exception occurs during temp file creation, causing NameError (line 540-563)

Notes

  • The comment on line 608-610 states "older checkpoints will be both zip and tar archives" which seems unclear - tar files are typically not also zip files
  • Exception handling in _detect_checkpoint_format() appropriately re-raises with context preservation

Confidence Score: 2/5

  • Critical bug in exception handling makes this unsafe to merge without fix
  • The save() method has a logic error where tmp_path can be undefined in the finally block, causing runtime NameError. This will break checkpoint saving if any exception occurs during temp file creation. The rest of the implementation is solid with good backward compatibility and improved resource management.
  • physicsnemo/models/module.py lines 540-563 - fix undefined variable in exception handler

Important Files Changed

File Analysis

Filename Score Overview
physicsnemo/models/module.py 2/5 Refactored checkpoint format from tar to zip with backward compatibility. Critical bug: undefined variable in exception handling (line 540-563). Good resource management improvements in load methods.

Sequence Diagram

sequenceDiagram
    participant User
    participant Module
    participant ZipFile
    participant FileSystem
    participant TempFile
    
    Note over User,FileSystem: Save Operation (Zip Format)
    User->>Module: save(file_name)
    Module->>TempFile: Create NamedTemporaryFile
    TempFile-->>Module: tmp_path
    Module->>ZipFile: Create archive at tmp_path
    Module->>Module: torch.save(state_dict) to BytesIO
    Module->>ZipFile: writestr("model.pt", buffer)
    Module->>ZipFile: writestr("args.json", json)
    Module->>ZipFile: writestr("metadata.json", json)
    Module->>ZipFile: Close archive
    Module->>FileSystem: put(tmp_path, file_name)
    Module->>TempFile: Remove temporary file
    
    Note over User,FileSystem: Load Operation (Zip Format)
    User->>Module: load(file_name)
    Module->>FileSystem: Download and cache file
    FileSystem-->>Module: cached_file_name
    Module->>Module: _detect_checkpoint_format()
    Module->>ZipFile: Open archive (read mode)
    Module->>ZipFile: Validate expected files present
    Module->>ZipFile: read("model.pt") into memory
    Module->>ZipFile: Close archive
    Module->>Module: torch.load(BytesIO(model_bytes))
    Module->>Module: Load state_dict into model
    
    Note over User,FileSystem: from_checkpoint Operation (Zip Format)
    User->>Module: from_checkpoint(file_name)
    Module->>FileSystem: Download and cache file
    FileSystem-->>Module: cached_file_name
    Module->>Module: _detect_checkpoint_format()
    Module->>ZipFile: Open archive (read mode)
    Module->>ZipFile: Validate expected files present
    Module->>ZipFile: open("args.json") and read
    Module->>ZipFile: open("metadata.json") and read
    Module->>Module: Instantiate model from args
    Module->>ZipFile: read("model.pt") into memory
    Module->>ZipFile: Close archive
    Module->>Module: torch.load(BytesIO(model_bytes))
    Module->>Module: Load state_dict into model
    Module-->>User: Return model instance
Loading

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@CharlelieLrt
Copy link
Collaborator Author

/blossom-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR refactors the checkpoint saving/loading mechanism from tar to zip format to improve I/O performance and eliminate redundant disk writes.

Key Changes:

  • Module.save() now uses zip format by default (with legacy_format flag for backward compatibility)
  • Added _detect_checkpoint_format() method to auto-detect checkpoint format
  • Module.load() and Module.from_checkpoint() support both zip and tar formats
  • Zip format writes data directly to archive in memory (via BytesIO), eliminating temporary directories
  • Enhanced docstrings with examples and clearer parameter descriptions

Benefits:

  • Removes double-write overhead (temp dir + tar archive)
  • Faster metadata access (zip stores directory at end of file)
  • Maintains full backward compatibility with existing tar checkpoints
  • Aligns with industry standards (Python wheels, PyTorch .pth files use zip)

Testing: Existing test suite should cover both formats through the auto-detection mechanism.

Confidence Score: 4/5

  • This PR is safe to merge with the existing style suggestions addressed
  • The implementation is technically sound with proper backward compatibility. Previous reviewers identified several style improvements around resource management and error handling that should be addressed, but no critical logical bugs. The core refactoring from tar to zip is well-designed and follows best practices.
  • No files require special attention beyond addressing the existing style comments

Important Files Changed

File Analysis

Filename Score Overview
physicsnemo/models/module.py 4/5 Refactors checkpoint format from tar to zip, improving I/O performance. Previous comments addressed major issues with temp file handling and resource management. Implementation is sound with backward compatibility for tar format.

Sequence Diagram

sequenceDiagram
    participant User
    participant Module
    participant ZipFile
    participant TempFile
    participant FileSystem

    Note over User,FileSystem: Save Operation (New Zip Format)
    User->>Module: save(file_name)
    Module->>Module: Prepare metadata & args
    Module->>TempFile: Create temp .zip file
    Module->>ZipFile: Open zip archive (write mode)
    Module->>ZipFile: Write model.pt (state_dict in memory)
    Module->>ZipFile: Write args.json
    Module->>ZipFile: Write metadata.json
    ZipFile-->>Module: Close archive
    Module->>FileSystem: Upload temp file to destination
    Module->>TempFile: Clean up temp file

    Note over User,FileSystem: Load Operation (Auto-detect Format)
    User->>Module: load(file_name) or from_checkpoint(file_name)
    Module->>FileSystem: Download & cache checkpoint
    Module->>Module: _detect_checkpoint_format()
    alt Zip Format (New)
        Module->>ZipFile: Open zip archive (read mode)
        Module->>ZipFile: Validate expected files exist
        Module->>ZipFile: Read model.pt into memory
        ZipFile-->>Module: Close archive
        Module->>Module: torch.load from BytesIO
    else Tar Format (Legacy)
        Module->>TempFile: Create temp directory
        Module->>TarFile: Extract to temp directory
        Module->>Module: Load from extracted files
        Module->>TempFile: Clean up temp directory
    end
    Module->>User: Return loaded model
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@CharlelieLrt CharlelieLrt merged commit 04d5fe9 into NVIDIA:main Oct 31, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀[FEA]: Checkpointing. Use zip instead of tar.

3 participants