A memory-first, lightning-fast, ready-to-use ML checkpointing library.
Adapters for PyTorch DCP, Megatron-LM and NeMo 2.0 are readily available for seamless integration. They are built on top of the core checkpointing APIs, which can also be used directly for custom integrations.
If interested in a native integration with another framework, please let us know by creating a feature request or upvoting an existing one!
For learning more about using the library and its performance, check out the user documentation. Below you will find development instructions for contributors.
This library defines core dependencies, as well as additional optional dependencies for specific adapters, to avoid polluting consumers with unnecessary dependencies.
See the adapters installation commands below for examples of the available options, and the pyproject.toml as the source of truth for all available adapters.
pip install -e .To avoid building C++ tests (and pulling test dependencies), such as when using for production:
pip install -e . --config-settings=cmake.define.BUILD_TESTING=OFFNOTE: Currently C++ binaries are expected to be in the package alongside the code, so editable mode (-e) is used.
# PyTorch
pip install -e .[pytorch]
# Megatron-LM
pip install -e .[megatron]
# Multiple
pip install -e .[pytorch,megatron]Ensure you have the correct Python version. As of this writing, the project uses Python 3.10, due to limitations in NeMo's dependencies.
To confirm, see which versions of python come up when tab-completing python in your shell.
You could install pyenv to manage different Python versions: https://github.com/pyenv/pyenv?tab=readme-ov-file#installation.
And then install the desired Python version with it e.g. pyenv install 3.10.
NOTE: If you already have a .venv for a different version in this repository, run rm -rf .venv first.
To set up the development environment, run (at the project root):
# Create and activate a virtual environment (e.g., using venv) in your local env (only needed once, but is safe to rerun)
python3.10 -m venv .venv
source .venv/bin/activate
# Install the package in editable mode with development dependencies
pip install -e .[dev]All code changes must be accompanied by comprehensive unit tests, and integration tests where feasible. With AI coding tools, there's no good reason to cut corners or omit tests. You can prompt your coding tool to "create a comprehensive test plan for X, covering edge cases and corner cases" that you can review.
-
Build C++ Components: The C++ components are built automatically when you run one of the
pip installcommands from above. -
Python Format: To apply automated fixes, run (with caution): NOTE: This may also modify lines that do not violate the lint rules, so use cautiously!
ruff check --fix . ruff format .
-
Python Lint: To check for code style violations, run:
ruff check . -
C++ Format: To apply automated fixes, run:
# install clang-format-18 sudo apt-get update && sudo apt-get install -y clang-format-18 # format all C++ files find src -name '*.cpp' -o -name '*.h' | xargs clang-format-18 -i
-
C++ Lint: Check for style violations, run:
find src -name '*.cpp' -o -name '*.h' | xargs clang-format-18 --dry-run --Werror
-
GitHub Actions Lint: To pin action versions, first install
ratchet- one way is viago install:go install github.com/sethvargo/ratchet@latest
Then run it on the workflow yaml file:
ratchet pin .github/workflows/build-and-test.yml # Or if installed to a specific location not in your path, something like: ~/go/bin/ratchet pin ./.github/workflows/build-and-test.yml
-
Test: To run all tests (Python and C++), run:
pytest
- Python tests should be in the
testsdirectory, in a package matching the subject-under-test, and the test files should start withtest_. - C++ tests should be in a
testdirectory next to the subject-under-test (so within thesrcdirectory), and should end with_test.cpp.
- Python tests should be in the
To calculate code coverage, run ./run_coverage.sh from the project root.
It will activate the venv located at .venv, remove build files, re-install the project, and produce coverage reports.
This project uses conventional commits, and the commit message should complete the sentence: "This change will...".
Specifying the scope for commits is optional, but highly recommended.
Typically, the scope will match the package the change relates to, and can use / for sub-packages, e.g.:
chore(replication): add the ReplicationManager skeleton class
feat(adapter/nemo): implement the callback to trigger MLFlashpoint checkpoints
We use release tags of the form vX.Y.Z for production releases, following semver, starting with zerover.
Releases should be created as GitHub Releases, which can be done here.
The helper script create_release.py will generate release notes that can be added to the Release.
Command: ./scripts/create_release.py.
Add -h for help.
Requirements:
- These release tags MUST be immutable - they cannot be modified or deleted after they are created.
- These release tags MUST be created from an approved and merged commit, typically from the
mainbranch. - They MUST NOT be created from unapproved, unmerged commits, such as a feature branch or patchset. The commit used to create the release tag must always be accessible and not temporary.
User documentation is all maintained in the docs/ directory, and is generated using mkdocs-material.
See the .example-syntax.md file for guidance on certain supported syntax.
When making changes, you can view them locally via mkdocs serve.
Once changes are merged to main, they are automatically deployed to the documentation site, available at https://google.github.io/ml-flashpoint.