Skip to content

google/ml-flashpoint

Overview

A memory-first, lightning-fast, ready-to-use ML checkpointing library.

Adapters for PyTorch DCP, Megatron-LM and NeMo 2.0 are readily available for seamless integration. They are built on top of the core checkpointing APIs, which can also be used directly for custom integrations.

If interested in a native integration with another framework, please let us know by creating a feature request or upvoting an existing one!

For learning more about using the library and its performance, check out the user documentation. Below you will find development instructions for contributors.

Installation

This library defines core dependencies, as well as additional optional dependencies for specific adapters, to avoid polluting consumers with unnecessary dependencies. See the adapters installation commands below for examples of the available options, and the pyproject.toml as the source of truth for all available adapters.

Core Library

pip install -e .

To avoid building C++ tests (and pulling test dependencies), such as when using for production:

pip install -e . --config-settings=cmake.define.BUILD_TESTING=OFF

NOTE: Currently C++ binaries are expected to be in the package alongside the code, so editable mode (-e) is used.

With Adapters

# PyTorch
pip install -e .[pytorch]

# Megatron-LM
pip install -e .[megatron]

# Multiple
pip install -e .[pytorch,megatron]

Development

Python version

Ensure you have the correct Python version. As of this writing, the project uses Python 3.10, due to limitations in NeMo's dependencies.

To confirm, see which versions of python come up when tab-completing python in your shell.

You could install pyenv to manage different Python versions: https://github.com/pyenv/pyenv?tab=readme-ov-file#installation.

And then install the desired Python version with it e.g. pyenv install 3.10.

Build and Installation

NOTE: If you already have a .venv for a different version in this repository, run rm -rf .venv first.

To set up the development environment, run (at the project root):

# Create and activate a virtual environment (e.g., using venv) in your local env (only needed once, but is safe to rerun)
python3.10 -m venv .venv
source .venv/bin/activate

# Install the package in editable mode with development dependencies
pip install -e .[dev]

Linting and Testing

All code changes must be accompanied by comprehensive unit tests, and integration tests where feasible. With AI coding tools, there's no good reason to cut corners or omit tests. You can prompt your coding tool to "create a comprehensive test plan for X, covering edge cases and corner cases" that you can review.

  • Build C++ Components: The C++ components are built automatically when you run one of the pip install commands from above.

  • Python Format: To apply automated fixes, run (with caution): NOTE: This may also modify lines that do not violate the lint rules, so use cautiously!

    ruff check --fix .
    ruff format .
  • Python Lint: To check for code style violations, run:

    ruff check .
  • C++ Format: To apply automated fixes, run:

    # install clang-format-18
    sudo apt-get update && sudo apt-get install -y clang-format-18
    
    # format all C++ files
    find src -name '*.cpp' -o -name '*.h' | xargs clang-format-18 -i
  • C++ Lint: Check for style violations, run:

    find src -name '*.cpp' -o -name '*.h' | xargs clang-format-18 --dry-run --Werror
  • GitHub Actions Lint: To pin action versions, first install ratchet - one way is via go install:

    go install github.com/sethvargo/ratchet@latest

    Then run it on the workflow yaml file:

    ratchet pin .github/workflows/build-and-test.yml
    # Or if installed to a specific location not in your path, something like:
    ~/go/bin/ratchet pin ./.github/workflows/build-and-test.yml
  • Test: To run all tests (Python and C++), run:

    pytest
    • Python tests should be in the tests directory, in a package matching the subject-under-test, and the test files should start with test_.
    • C++ tests should be in a test directory next to the subject-under-test (so within the src directory), and should end with _test.cpp.

Code Coverage

To calculate code coverage, run ./run_coverage.sh from the project root. It will activate the venv located at .venv, remove build files, re-install the project, and produce coverage reports.

Conventional Commits

This project uses conventional commits, and the commit message should complete the sentence: "This change will...". Specifying the scope for commits is optional, but highly recommended. Typically, the scope will match the package the change relates to, and can use / for sub-packages, e.g.:

chore(replication): add the ReplicationManager skeleton class

feat(adapter/nemo): implement the callback to trigger MLFlashpoint checkpoints

Releases

We use release tags of the form vX.Y.Z for production releases, following semver, starting with zerover.

Releases should be created as GitHub Releases, which can be done here.

The helper script create_release.py will generate release notes that can be added to the Release.

Command: ./scripts/create_release.py. Add -h for help.

Requirements:

  • These release tags MUST be immutable - they cannot be modified or deleted after they are created.
  • These release tags MUST be created from an approved and merged commit, typically from the main branch.
  • They MUST NOT be created from unapproved, unmerged commits, such as a feature branch or patchset. The commit used to create the release tag must always be accessible and not temporary.

User Documentation Site

User documentation is all maintained in the docs/ directory, and is generated using mkdocs-material. See the .example-syntax.md file for guidance on certain supported syntax.

When making changes, you can view them locally via mkdocs serve.

Once changes are merged to main, they are automatically deployed to the documentation site, available at https://google.github.io/ml-flashpoint.

About

A memory-first, lightning-fast, ready-to-use ML checkpointing library.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages