Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] A test case to show how to use DeviceMesh API to create the customized PG #17

Closed
wants to merge 31 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
aadf67c
[WIP] A test case to show how to use DeviceMesh API to create the cus…
fegin Nov 26, 2024
4ad6f1b
docs: add sphinx documentation and add missing documentation (#18)
d4l3k Nov 27, 2024
63ee40c
manager: expand API to include errors, participant information and nu…
d4l3k Nov 28, 2024
03f5350
docs: add legal info + fix jinja2 security warning (#20)
d4l3k Dec 3, 2024
4e86676
process_group: wrapper updates and ErrorSwallowingProcessGroup (#21)
d4l3k Dec 4, 2024
63c82c1
process_group: added ManagedProcessGroup
d4l3k Dec 6, 2024
b626d17
lintrunner: added black,isort,rustfmt (#22)
d4l3k Dec 6, 2024
1d5464d
lintrunner: enable pyre (#23)
d4l3k Dec 6, 2024
ab66c7c
manager: added FIXED_WITH_SPARES mode (#24)
d4l3k Dec 6, 2024
ddbc3c9
manager: added E2E tests and support getting lighthouse and manager a…
d4l3k Dec 7, 2024
7b93da7
manager_integ_tests: added Python integration test with lighthouse (#27)
d4l3k Dec 7, 2024
9878980
manager_integ_tests: added recovery test (#28)
d4l3k Dec 8, 2024
e4c8e5a
pyre strict (#29)
d4l3k Dec 9, 2024
6ceee88
Update README.md (#31)
d4l3k Dec 10, 2024
fdcfe5d
Update README.md (#33)
d4l3k Dec 11, 2024
c0acce1
Update README.md (#34)
d4l3k Dec 11, 2024
a52d746
manager: add CPU timeouts on allreduce/future calls (#38)
d4l3k Dec 13, 2024
58a436d
manager_integ_tests: added multi rank recovery and sync tests (#40)
d4l3k Dec 14, 2024
8a22dc8
Update README with protobuf related installation (#42)
H-Huang Dec 16, 2024
78c5721
manager: rename step to start_step + small shutdown fix (#44)
d4l3k Dec 18, 2024
a484e4f
manager: rename start_step to start_quorum and move step changes to s…
d4l3k Dec 18, 2024
6d6e9a4
local_sgd: initial version of fault tolerant LocalSGD (#47)
d4l3k Dec 18, 2024
49d2aec
Add _test_pg helper (#45)
H-Huang Dec 18, 2024
8247faf
lighthouse, manager: support multiple quorum rooms (#48)
d4l3k Dec 19, 2024
2bb8873
[Chore] Couple of small chores (#49)
Jackmin801 Dec 19, 2024
551fc4e
.github: add nightly wheel builds (#50)
d4l3k Dec 19, 2024
168b363
add pypi badge to README (#51)
d4l3k Dec 19, 2024
a32f807
compile -> compile_protos (#53)
Jackmin801 Dec 20, 2024
f31d3b1
manager_integ_tests: added LocalSGD integration test (#55)
d4l3k Dec 20, 2024
5e2466a
[WIP] A test case to show how to use DeviceMesh API to create the cus…
fegin Nov 26, 2024
ff8ce53
Merge remote-tracking branch 'origin/chienchin/ft_init_device_mesh' i…
fegin Dec 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions .github/workflows/docs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: Docs

on:
push:
branches:
- main
pull_request:

jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
architecture: x64
- name: Checkout
uses: actions/checkout@v4
- name: Install Dependencies
run: |
set -eux

sudo apt-get install -y protobuf-compiler

pip install .[dev] -v

pip install -r docs/requirements.txt
- name: Build Sphinx Docs
working-directory: docs
run: |
set -eux

make html
- name: Upload static files as artifact
id: deployment
uses: actions/upload-pages-artifact@v3
with:
path: docs/build/html/

deploy:
runs-on: ubuntu-latest
needs: build
if: ${{ github.ref == 'refs/heads/main' }}
permissions:
pages: write # to deploy to Pages
id-token: write # to verify the deployment originates from an appropriate source
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
18 changes: 13 additions & 5 deletions .github/workflows/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,35 @@ on:

jobs:
lint:
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
steps:
- name: Setup Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: "3.10"
architecture: x64
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Install Dependencies
run: |
set -eux

sudo apt-get install -y protobuf-compiler

pip install lintrunner lintrunner-adapters
lintrunner init

pip install .[dev] -v
- name: Run Python Lint
- name: Run lintrunner
run: |
set -eux

lintrunner --skip PYRE --force-color --all-files
- name: Run pyre
run: |
set -eux

black --check .
pyre check
- name: Run Rust Lint
run: |
set -eux
Expand Down
62 changes: 62 additions & 0 deletions .github/workflows/nightly.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
name: Nightly Push

on:
# run every day at 11:15am
schedule:
- cron: '15 11 * * *'

jobs:
nightly:
runs-on: ubuntu-latest
steps:
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
architecture: x64
- name: Checkout
uses: actions/checkout@v4
- name: Install Dependencies
run: |
set -eux

pip install -U twine toml
- name: Build Docker
run: |
set -eux

docker build --progress=plain -t torchft-maturin .

- name: Set Nightly Version
run: |
set -eux

python scripts/patch_nightly_version.py

cat Cargo.toml
cat pyproject.toml

- name: Build Wheels
run: |
set -eux

VERSIONS=(
"3.9"
"3.10"
"3.11"
"3.12"
"3.13"
)

for version in "${VERSIONS[@]}"; do
docker run --rm -v $(pwd):/io -t torchft-maturin build --release --out dist --interpreter "$version"
done

- name: Twine Check
run: twine check --strict dist/*

- name: Upload to Pypi
run: twine upload --skip-existing dist/*
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.NIGHTLY_PYPI_TOKEN }}
1 change: 0 additions & 1 deletion .github/workflows/unittest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,3 @@ jobs:

pytest -v
cargo test -v

1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ __pycache__
*.so
**/*.stderr
.pyre
dist/

# Torch
cifar/
69 changes: 69 additions & 0 deletions .lintrunner.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
[[linter]]
code = 'BLACK-ISORT'
include_patterns = [
'*.py',
'**/*.py',
]
exclude_patterns = []
command = [
'python',
'-m',
'lintrunner_adapters',
'run',
'black_isort_linter',
'--fast',
'--',
'@{{PATHSFILE}}',
]
init_command = [
'python',
'-m',
'lintrunner_adapters',
'run',
'pip_init',
'--dry-run={{DRYRUN}}',
'black==24.10.0', # Use 24.x when ruff styles are updated
'isort==5.13.2',
]
is_formatter = true

[[linter]]
code = 'RUSTFMT'
include_patterns = [
'**/*.rs',
]
command = [
'python',
'-m',
'lintrunner_adapters',
'run',
'rustfmt_linter',
'--binary=rustfmt',
'--config-path=.rustfmt.toml',
'--',
'@{{PATHSFILE}}',
]

[[linter]]
code = 'PYRE'
include_patterns = [
'*.py',
'**/*.py',
'**/*.pyi',
]
command = [
'python3',
'tools/linter/adapters/pyre_linter.py',
'--',
'@{{PATHSFILE}}'
]
init_command = [
'python',
'-m',
'lintrunner_adapters',
'run',
'pip_init',
'--dry-run={{DRYRUN}}',
'pyre-check==0.9.23',
]
is_formatter = false
1 change: 1 addition & 0 deletions .pyre_configuration
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{
"strict": true,
"site_package_search_strategy": "pep561",
"source_directories": [
{
Expand Down
1 change: 1 addition & 0 deletions .rustfmt.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
edition = "2021"
8 changes: 8 additions & 0 deletions .watchmanconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"root_files": [
"torchft",
"*.py",
".pyre_configuration",
".watchmanconfig"
]
}
29 changes: 28 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,15 +60,42 @@ We actively welcome your pull requests.
`torchft` enforces a fairly strict code format with tools such as cargo fmt and black.

```shell
scripts/lint.sh
pip install lintrunner lintrunner-adapters
lintrunner init
lintrunner -a
```

### Tests

We use `pytest` as our testing framework. To execute a specific test, use the following command:

```sh
pytest torchft/process_group_test.py -k test_device_mesh
```

To run the Rust tests run:

```sh
cargo test
```

To run the entire suite of tests:

```sh
$ scripts/test.sh
```

### Build Docs
To build the docs run:
```sh
pip install -r docs/requirements.txt
cd docs
make livehtml
```

The docs will be built in the `docs/build/html` directory and served at http://localhost:8000.
The page will be automatically re-built as long as the process is kept running.

## Contributor License Agreement ("CLA")

In order to accept your pull request, we need you to submit a CLA. You only need to do this once to work on any of
Expand Down
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "torchft"
version = "0.1.0"
version = "0.1.1"
edition = "2021"

[dependencies]
Expand Down
57 changes: 48 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,30 @@
# torchft -- per step fault tolerance for any PyTorch job

> ⚠️ WARNING: This is a prototype for PyTorch fault tolerance and may have bugs
> or breaking changes as this is actively under development. This is a public
> repo to encourage collaboration and contributions are welcome. There currently
> are no plans to make this a stable component of PyTorch Distributed and may be
> abandonded at any time if better approachs arise.
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/user-attachments/assets/ab57f551-7a66-4e5f-a3e6-c4d033a5863d">
<img width="55%" src="https://github.com/user-attachments/assets/9cd7fef9-cfff-409f-a033-d53811f3a99c" alt="torchft">
</picture>
</p>

<h3 align="center">
Easy Per Step Fault Tolerance for PyTorch
</h3>

<p align="center">
| <a href="https://pytorch-labs.github.io/torchft/"><b>Documentation</b></a>
| <a href="https://github.com/pytorch-labs/torchft/blob/main/media/fault_tolerance_poster.pdf"><b>Poster</b></a>
| <a href="https://docs.google.com/document/d/1OZsOsz34gRDSxYXiKkj4WqcD9x0lP9TcsfBeu_SsOY4/edit"><b>Design Doc</b></a>
|
</p>
<p align="center">
<a href="https://pypi.org/project/torchft-nightly/"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/torchft-nightly"></a>
</p>

---

> ⚠️ WARNING: This is an alpha prototype for PyTorch fault tolerance and may have bugs
> or breaking changes as this is actively under development. We'd love to collaborate
> and contributions are welcome. Please reach out if you're interested in torchft
> or want to discuss fault tolerance in PyTorch

This repository implements techniques for doing a per-step fault tolerance so
you can keep training if errors occur without interrupting the entire training
Expand All @@ -28,13 +48,32 @@ greatly improve efficiency by avoiding stop the world training on errors.

![](./media/torchft-overview.png)

## Installation
## Prerequisites

Before proceeding, ensure you have the following installed:

- Rust (with necessaray dependencies)
- `protobuf-compiler` and the corresponding development package for Protobuf.

Before proceeding, ensure you have Rust installed on your system. Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:
Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:
```sh
$ curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh
```

To install the required packages on a Debian-based system (such as Ubuntu) using apt, run:

```sh
sudo apt install protobuf-compiler libprotobuf-dev
```

or for a Red Hat-based system, run:

```sh
sudo dnf install protobuf-compiler protobuf-devel
```

## Installation

```sh
$ pip install .
```
Expand Down
4 changes: 3 additions & 1 deletion build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
// LICENSE file in the root directory of this source tree.

fn main() -> Result<(), Box<dyn std::error::Error>> {
tonic_build::compile_protos("proto/torchft.proto")?;
tonic_build::configure()
.protoc_arg("--experimental_allow_proto3_optional")
.compile_protos(&["proto/torchft.proto"], &["proto"])?;
Ok(())
}
Loading
Loading