Skip to content

Commit d815452

Browse files
committed
Initial commit
0 parents  commit d815452

File tree

138 files changed

+25251
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

138 files changed

+25251
-0
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
build
2+
dist
3+
*.egg-info
4+
__pycache__
5+
cupti_module.*.so

.pre-commit-config.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
default_language_version:
2+
python: python3
3+
4+
repos:
5+
6+
- repo: https://github.com/PyCQA/isort
7+
rev: 5.13.2
8+
hooks:
9+
- id: isort
10+
exclude: docs/
11+
12+
- repo: https://github.com/psf/black-pre-commit-mirror
13+
rev: 24.10.0
14+
hooks:
15+
- id: black
16+
language_version: python3.10
17+
18+
- repo: https://github.com/astral-sh/ruff-pre-commit
19+
rev: v0.6.9
20+
hooks:
21+
- id: ruff

CONTRIBUTING.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
2+
## Nvidia Resiliency Extension (NVRx) OSS Contribution Rules
3+
4+
#### Issue Tracking
5+
6+
* All enhancement, bugfix, or change requests must begin with the creation of a [NVRx Issue Request](TBD).
7+
* The issue request must be reviewed by NVRx engineers and approved prior to code review.
8+
9+
10+
#### Coding Guidelines
11+
12+
- All source code contributions must follow the existing conventions in the relevant file, submodule, module, and project when you add new code or when you extend/fix existing functionality.
13+
14+
- Avoid introducing unnecessary complexity into existing code so that maintainability and readability are preserved.
15+
16+
- Try to keep pull requests (PRs) as concise as possible:
17+
- Avoid committing commented-out code.
18+
- Wherever possible, each PR should address a single concern. If there are several otherwise-unrelated things that should be fixed to reach a desired endpoint, our recommendation is to open several PRs and indicate the dependencies in the description. The more complex the changes are in a single PR, the more time it will take to review those changes.
19+
20+
- To ensure code consistency and maintainability across the project, please format and lint your code using the following tools before committing any changes:
21+
- We use black to automatically format Python code. It enforces a consistent style by reformatting code according to a set of rules.
22+
- To format your code, run:
23+
```
24+
black .
25+
```
26+
- isort is used to sort and format import statements automatically. Ensure that your imports are ordered correctly by running:
27+
```
28+
isort .
29+
```
30+
- ruff is a fast Python linter that helps catch common issues. Please run ruff to check for and fix linting problems:
31+
```
32+
ruff check .
33+
```
34+
35+
- Write commit titles using imperative mood and [these rules](https://chris.beams.io/posts/git-commit/), and reference the Issue number corresponding to the PR. Following is the recommended format for commit texts:
36+
```
37+
#<Issue Number> - <Commit Title>
38+
39+
<Commit Body>
40+
```
41+
42+
- Ensure that the build log is clean, meaning no warnings or errors should be present.
43+
44+
- Ensure that all unit tests pass prior to submitting your code.
45+
46+
- All OSS components must contain accompanying documentation (READMEs) describing the functionality, dependencies, and known issues.
47+
48+
- See `README.md` for existing samples and plugins for reference.
49+
50+
- All OSS components must have an accompanying test.
51+
52+
- If introducing a new component, such as a plugin, provide a test sample to verify the functionality.
53+
54+
- Make sure that you can contribute your work to open source (no license and/or patent conflict is introduced by your code). You will need to [`sign`](#signing-your-work) your commit.
55+
56+
- Thanks in advance for your patience as we review your contributions; we do appreciate them!
57+
58+
59+
#### Pull Requests
60+
Developer workflow for code contributions is as follows:
61+
62+
1. Developers must first [fork](https://help.github.com/en/articles/fork-a-repo) the [upstream](TBD) NVRx OSS repository.
63+
64+
2. Git clone the forked repository and push changes to the personal fork.
65+
66+
```bash
67+
git clone https://github.com/YOUR_USERNAME/YOUR_FORK.git NVRx
68+
# Checkout the targeted branch and commit changes
69+
# Push the commits to a branch on the fork (remote).
70+
git push -u origin <local-branch>:<remote-branch>
71+
```
72+
73+
3. Once the code changes are staged on the fork and ready for review, a [Pull Request](https://help.github.com/en/articles/about-pull-requests) (PR) can be [requested](https://help.github.com/en/articles/creating-a-pull-request) to merge the changes from a branch of the fork into a selected branch of upstream.
74+
* Exercise caution when selecting the source and target branches for the PR.
75+
Note that versioned releases of NVRx OSS are posted to `release/` branches of the upstream repo.
76+
* Creation of a PR creation kicks off the code review process.
77+
* Atleast one NVRx engineer will be assigned for the review.
78+
* While under review, mark your PRs as work-in-progress by prefixing the PR title with [WIP].
79+
80+
4. Since there is no CI/CD process in place yet, the PR will be accepted and the corresponding issue closed only after adequate testing has been completed, manually, by the developer and/or NVRx engineer reviewing the code.
81+
82+
83+
#### Signing Your Work
84+
85+
* We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
86+
87+
* Any contribution which contains commits that are not Signed-Off will not be accepted.
88+
89+
* To sign off on a commit you simply use the `--signoff` (or `-s`) option when committing your changes:
90+
```bash
91+
$ git commit -s -m "Add cool feature."
92+
```
93+
This will append the following to your commit message:
94+
```
95+
Signed-off-by: Your Name <[email protected]>
96+
```
97+
98+
* Full text of the DCO:
99+
100+
```
101+
Developer Certificate of Origin
102+
Version 1.1
103+
104+
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
105+
1 Letterman Drive
106+
Suite D4700
107+
San Francisco, CA, 94129
108+
109+
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
110+
```
111+
112+
```
113+
Developer's Certificate of Origin 1.1
114+
115+
By making a contribution to this project, I certify that:
116+
117+
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
118+
119+
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
120+
121+
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
122+
123+
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
124+
```

Dockerfile.builder

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# This image purpose is to build "nvidia_resiliency_ext" wheels using different Python versions.
2+
# There are python3.10, python3.11 and python3.12 installed.
3+
# Base image is CUDA, as Straggler Detection package uses CUPTI.
4+
# Wheel for Python3.10 can be created with "python3.10 -m build --wheel" etc.
5+
6+
# Choose a base CUDA image from NVIDIA
7+
# nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04, nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04 etc.
8+
ARG BASE_CUDA_IMG=nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
9+
FROM ${BASE_CUDA_IMG}
10+
11+
# Set environment variables to non-interactive to avoid prompts during package installation
12+
ENV DEBIAN_FRONTEND=noninteractive
13+
14+
# Repo with Pythons
15+
RUN apt update && apt install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa
16+
17+
# Install common dependencies
18+
RUN apt-get update && apt-get install -y \
19+
python3.10 python3.10-dev python3.10-distutils \
20+
python3.11 python3.11-dev python3.11-distutils \
21+
python3.12 python3.12-dev python3.12-distutils \
22+
wget curl build-essential gcc-10 g++-10\
23+
&& rm -rf /var/lib/apt/lists/*
24+
25+
# Install pip for each Python version
26+
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10 && \
27+
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11 && \
28+
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12
29+
30+
# Install deps,
31+
# FIXME: for some reason six needs to be manually updated
32+
# otherwise wheel building fails with: ModuleNotFoundError: No module named 'six'
33+
RUN python3.10 -m pip install build poetry && \
34+
python3.11 -m pip install build poetry && \
35+
python3.12 -m pip install build poetry && \
36+
python3.10 -m pip install -U six && \
37+
python3.11 -m pip install -U six && \
38+
python3.12 -m pip install -U six
39+
40+
# Set the working directory
41+
WORKDIR /workspace
42+
43+
ENTRYPOINT ["/bin/bash", "-c"]

LICENSE.txt

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.

README.md

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Nvidia Resiliency Extension
2+
3+
This project combines multiple resiliency-related solutions.
4+
- Fault Tolerance package
5+
- Straggler Detection package
6+
- PyTorch Lightning callbacks
7+
8+
9+
## Installation:
10+
11+
### From sources
12+
- `git clone --recursive <this repo URL>`
13+
- `cd <repo>`
14+
- `pip install .`
15+
16+
Requirements:
17+
- Python >= 3.10
18+
- gcc >= 8.0
19+
- CUDA >= 11.8
20+
21+
## Fault Tolerance integration guide
22+
23+
This section describes Fault Tolerance callback integration with a PTL-based workload (e.g. NeMo).
24+
25+
Let's define some terms used in this section:
26+
- `PTL` is PyTorch Lightning
27+
- `Fault Tolerance`, `FT` is the `fault_tolerance` package, included in `nvidia_resiliency_ext`.
28+
- `FT callback`, `FaultToleranceCallback` is a PTL callback defined in `ptl_resiliency` package, included in `nvidia_resiliency_ext`.
29+
- `ft_launcher` is a launcher tool included in the FT, which is based on `torchrun`.
30+
- `heartbeat` is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.
31+
- `rank monitor` is a special side-process started by `ft_launcher` that monitors heartbeats from its rank.
32+
- `timeouts` are time intervals used by a rank monitor to detect that a rank is not alive.
33+
There are 2 separate timeouts: for the initial heartbeat and the subsequent heartbeats.
34+
- `launcher script` is a bash script that invokes `ft_launcher`.
35+
36+
### 0. Use `ft_launcher` to start the workload
37+
38+
`ft_launcher` is similar to `torchrun` but it starts a rank monitor for each started rank.
39+
`ft_launcher` takes the FT configuration in a YAML file (`--fault-tol-cfg-path`) or via CLI args (`--ft-param-...`).
40+
FT configuration items are described in `FaultToleranceConfig` docstring.
41+
42+
### 1. Add FT callback to the trainer
43+
44+
Add FT callback to PTL callbacks.
45+
46+
```
47+
fault_tol_cb = FaultToleranceCallback(
48+
autoresume=True,
49+
calculate_timeouts=True,
50+
logger_name="test_logger",
51+
exp_dir=tmp_path,
52+
)
53+
54+
trainer = pl.Trainer(
55+
...
56+
callbacks=[..., fault_tol_cb],
57+
)
58+
```
59+
60+
61+
Core FT callback functionality is:
62+
- Establishing a connection with a rank monitor
63+
- Sending heartbeats during training and evaluation steps
64+
- Disconnecting from a rank monitor
65+
66+
Optionally, it can also:
67+
- Compute timeouts that will be used instead of timeouts defined in the FT config
68+
- Create a flag file when the training is completed
69+
70+
FT callback initialization params:
71+
```
72+
def __init__(
73+
self,
74+
autoresume: bool,
75+
calculate_timeouts: bool,
76+
simulated_fault_params: Optional[Any] = None,
77+
exp_dir: Union[str, pathlib.Path, None] = None,
78+
logger_name: Optional[str] = "nemo_logger.FaultToleranceCallback",
79+
):
80+
"""
81+
Initialize callback instance.
82+
83+
This is a lightweight initialization. Most of the initialization is conducted in the 'setup' hook.
84+
85+
Args:
86+
autoresume (bool): Set to `True` if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run).
87+
calculate_timeouts (bool): Set to `True` if FT timeouts should be calculated based on observed heartbeat intervals.
88+
Calculated timeouts overwrite the timeouts from the FT config.
89+
Timeouts are computed at the end of a training job, if there was checkpoint loading and saving.
90+
For example, for training started from scratch, the timeouts are computed at the end of the second job.
91+
simulated_fault_params (Optional[Any], optional): Simulated fault spec. It's for debugging only. Defaults to None.
92+
exp_dir (Union[str, pathlib.Path, None], optional): Directory where the FT state should be saved.
93+
Must be available for all training jobs. NOTE: Beware that PTL/NeMo can move files written directly to `trainer.log_dir`.
94+
Defaults to None, in which case it defaults to `trainer.log_dir/ft_state/`.
95+
logger_name (Optional[str], optional): Logger name to be used.
96+
Defaults to "nemo_logger.FaultToleranceCallback".
97+
"""
98+
```
99+
100+
### 2. Implementing auto-resume
101+
102+
Auto-resume is a feature that simplifies running a training consisting of multiple subsequent training jobs.
103+
104+
NOTE: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the `FaultToleranceCallback`.
105+
106+
`FaultToleranceCallback` exposes an "interface" that allows implementing an auto-resume launcher script.
107+
Specifically, if `autoresume=True` the FT callback creates a special marker file when a training is completed.
108+
The marker file location is expected to be set in the `FAULT_TOL_FINISHED_FLAG_FILE` environment variable.
109+
110+
The following mechanism can be used to implement an auto-resuming launcher script:
111+
- Launcher script starts ranks with `ft_launcher`
112+
- `FAULT_TOL_FINISHED_FLAG_FILE` should be passed to rank processes
113+
- When a `ft_launcher` exits, a launcher script checks if the `FAULT_TOL_FINISHED_FLAG_FILE` file was created.
114+
- If `FAULT_TOL_FINISHED_FLAG_FILE` exists, the auto-resume loop can be broken, as the training is completed.
115+
- If `FAULT_TOL_FINISHED_FLAG_FILE` does not exist, the continuation job can be issued
116+
(other conditions can be checked e.g. if the maximum number of failures is not reached).
117+
118+
## Straggler Detection integration guide
119+
120+
### Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks.
121+
122+
```
123+
straggler_cb_args = dict(
124+
report_time_interval=300.0,
125+
calc_relative_gpu_perf=True,
126+
calc_individual_gpu_perf=True,
127+
num_gpu_perf_scores_to_log=3,
128+
gpu_relative_perf_threshold=0.7,
129+
gpu_individual_perf_threshold=0.7,
130+
stop_if_detected=False,
131+
logger_name="test_logger",
132+
)
133+
134+
straggler_det_cb = StragglerDetectionCallback(**cb_args)
135+
136+
trainer = pl.Trainer(
137+
...
138+
callbacks=[..., straggler_det_cb],
139+
)
140+
```
141+
142+
`StragglerDetectionCallback` initialization params:
143+
144+
```
145+
def __init__(
146+
self,
147+
report_time_interval: float,
148+
calc_relative_gpu_perf: bool,
149+
calc_individual_gpu_perf: bool,
150+
num_gpu_perf_scores_to_log: int,
151+
gpu_relative_perf_threshold: float,
152+
gpu_individual_perf_threshold: float,
153+
stop_if_detected: bool,
154+
logger_name: Optional[str] = "nemo_logger.StragglerDetectionCallback",
155+
):
156+
"""
157+
Initialize straggler detection callback instance.
158+
159+
Args:
160+
report_time_interval (float): Interval [seconds] of the straggler check
161+
calc_relative_gpu_perf (bool): Calculate relative GPU performance
162+
calc_individual_gpu_perf (bool): Calculate individual GPU performance
163+
num_gpu_perf_scores_to_log (int): How many best and worst scores to log (0 - does not log periodically, but only if stragglers are detected)
164+
gpu_relative_perf_threshold (float): Threshold for relative GPU performance scores
165+
gpu_individual_perf_threshold (float): Threshold for individual GPU performance scores
166+
stop_if_detected (bool): Set to True, to terminate the workload if stragglers are detected
167+
logger_name (Optional[str], optional): Defaults to "nemo_logger.StragglerDetectionCallback".
168+
169+
Raises:
170+
ValueError: If invalid config was provided.
171+
"""
172+
```
173+
174+
More info on straggler detection can be found in the straggler package's README.

0 commit comments

Comments
 (0)