|
| 1 | +# Nvidia Resiliency Extension |
| 2 | + |
| 3 | +This project combines multiple resiliency-related solutions. |
| 4 | +- Fault Tolerance package |
| 5 | +- Straggler Detection package |
| 6 | +- PyTorch Lightning callbacks |
| 7 | + |
| 8 | + |
| 9 | +## Installation: |
| 10 | + |
| 11 | +### From sources |
| 12 | +- `git clone --recursive <this repo URL>` |
| 13 | +- `cd <repo>` |
| 14 | +- `pip install .` |
| 15 | + |
| 16 | +Requirements: |
| 17 | +- Python >= 3.10 |
| 18 | +- gcc >= 8.0 |
| 19 | +- CUDA >= 11.8 |
| 20 | + |
| 21 | +## Fault Tolerance integration guide |
| 22 | + |
| 23 | +This section describes Fault Tolerance callback integration with a PTL-based workload (e.g. NeMo). |
| 24 | + |
| 25 | +Let's define some terms used in this section: |
| 26 | +- `PTL` is PyTorch Lightning |
| 27 | +- `Fault Tolerance`, `FT` is the `fault_tolerance` package, included in `nvidia_resiliency_ext`. |
| 28 | +- `FT callback`, `FaultToleranceCallback` is a PTL callback defined in `ptl_resiliency` package, included in `nvidia_resiliency_ext`. |
| 29 | +- `ft_launcher` is a launcher tool included in the FT, which is based on `torchrun`. |
| 30 | +- `heartbeat` is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive. |
| 31 | +- `rank monitor` is a special side-process started by `ft_launcher` that monitors heartbeats from its rank. |
| 32 | +- `timeouts` are time intervals used by a rank monitor to detect that a rank is not alive. |
| 33 | + There are 2 separate timeouts: for the initial heartbeat and the subsequent heartbeats. |
| 34 | +- `launcher script` is a bash script that invokes `ft_launcher`. |
| 35 | + |
| 36 | +### 0. Use `ft_launcher` to start the workload |
| 37 | + |
| 38 | +`ft_launcher` is similar to `torchrun` but it starts a rank monitor for each started rank. |
| 39 | +`ft_launcher` takes the FT configuration in a YAML file (`--fault-tol-cfg-path`) or via CLI args (`--ft-param-...`). |
| 40 | +FT configuration items are described in `FaultToleranceConfig` docstring. |
| 41 | + |
| 42 | +### 1. Add FT callback to the trainer |
| 43 | + |
| 44 | +Add FT callback to PTL callbacks. |
| 45 | + |
| 46 | +``` |
| 47 | +fault_tol_cb = FaultToleranceCallback( |
| 48 | + autoresume=True, |
| 49 | + calculate_timeouts=True, |
| 50 | + logger_name="test_logger", |
| 51 | + exp_dir=tmp_path, |
| 52 | +) |
| 53 | +
|
| 54 | +trainer = pl.Trainer( |
| 55 | + ... |
| 56 | + callbacks=[..., fault_tol_cb], |
| 57 | +) |
| 58 | +``` |
| 59 | + |
| 60 | + |
| 61 | +Core FT callback functionality is: |
| 62 | +- Establishing a connection with a rank monitor |
| 63 | +- Sending heartbeats during training and evaluation steps |
| 64 | +- Disconnecting from a rank monitor |
| 65 | + |
| 66 | +Optionally, it can also: |
| 67 | +- Compute timeouts that will be used instead of timeouts defined in the FT config |
| 68 | +- Create a flag file when the training is completed |
| 69 | + |
| 70 | +FT callback initialization params: |
| 71 | +``` |
| 72 | +def __init__( |
| 73 | + self, |
| 74 | + autoresume: bool, |
| 75 | + calculate_timeouts: bool, |
| 76 | + simulated_fault_params: Optional[Any] = None, |
| 77 | + exp_dir: Union[str, pathlib.Path, None] = None, |
| 78 | + logger_name: Optional[str] = "nemo_logger.FaultToleranceCallback", |
| 79 | +): |
| 80 | + """ |
| 81 | + Initialize callback instance. |
| 82 | +
|
| 83 | + This is a lightweight initialization. Most of the initialization is conducted in the 'setup' hook. |
| 84 | +
|
| 85 | + Args: |
| 86 | + autoresume (bool): Set to `True` if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run). |
| 87 | + calculate_timeouts (bool): Set to `True` if FT timeouts should be calculated based on observed heartbeat intervals. |
| 88 | + Calculated timeouts overwrite the timeouts from the FT config. |
| 89 | + Timeouts are computed at the end of a training job, if there was checkpoint loading and saving. |
| 90 | + For example, for training started from scratch, the timeouts are computed at the end of the second job. |
| 91 | + simulated_fault_params (Optional[Any], optional): Simulated fault spec. It's for debugging only. Defaults to None. |
| 92 | + exp_dir (Union[str, pathlib.Path, None], optional): Directory where the FT state should be saved. |
| 93 | + Must be available for all training jobs. NOTE: Beware that PTL/NeMo can move files written directly to `trainer.log_dir`. |
| 94 | + Defaults to None, in which case it defaults to `trainer.log_dir/ft_state/`. |
| 95 | + logger_name (Optional[str], optional): Logger name to be used. |
| 96 | + Defaults to "nemo_logger.FaultToleranceCallback". |
| 97 | + """ |
| 98 | +``` |
| 99 | + |
| 100 | +### 2. Implementing auto-resume |
| 101 | + |
| 102 | +Auto-resume is a feature that simplifies running a training consisting of multiple subsequent training jobs. |
| 103 | + |
| 104 | +NOTE: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the `FaultToleranceCallback`. |
| 105 | + |
| 106 | +`FaultToleranceCallback` exposes an "interface" that allows implementing an auto-resume launcher script. |
| 107 | +Specifically, if `autoresume=True` the FT callback creates a special marker file when a training is completed. |
| 108 | +The marker file location is expected to be set in the `FAULT_TOL_FINISHED_FLAG_FILE` environment variable. |
| 109 | + |
| 110 | +The following mechanism can be used to implement an auto-resuming launcher script: |
| 111 | +- Launcher script starts ranks with `ft_launcher` |
| 112 | +- `FAULT_TOL_FINISHED_FLAG_FILE` should be passed to rank processes |
| 113 | +- When a `ft_launcher` exits, a launcher script checks if the `FAULT_TOL_FINISHED_FLAG_FILE` file was created. |
| 114 | + - If `FAULT_TOL_FINISHED_FLAG_FILE` exists, the auto-resume loop can be broken, as the training is completed. |
| 115 | + - If `FAULT_TOL_FINISHED_FLAG_FILE` does not exist, the continuation job can be issued |
| 116 | + (other conditions can be checked e.g. if the maximum number of failures is not reached). |
| 117 | + |
| 118 | +## Straggler Detection integration guide |
| 119 | + |
| 120 | +### Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks. |
| 121 | + |
| 122 | +``` |
| 123 | +straggler_cb_args = dict( |
| 124 | + report_time_interval=300.0, |
| 125 | + calc_relative_gpu_perf=True, |
| 126 | + calc_individual_gpu_perf=True, |
| 127 | + num_gpu_perf_scores_to_log=3, |
| 128 | + gpu_relative_perf_threshold=0.7, |
| 129 | + gpu_individual_perf_threshold=0.7, |
| 130 | + stop_if_detected=False, |
| 131 | + logger_name="test_logger", |
| 132 | +) |
| 133 | +
|
| 134 | +straggler_det_cb = StragglerDetectionCallback(**cb_args) |
| 135 | +
|
| 136 | +trainer = pl.Trainer( |
| 137 | + ... |
| 138 | + callbacks=[..., straggler_det_cb], |
| 139 | +) |
| 140 | +``` |
| 141 | + |
| 142 | +`StragglerDetectionCallback` initialization params: |
| 143 | + |
| 144 | +``` |
| 145 | +def __init__( |
| 146 | + self, |
| 147 | + report_time_interval: float, |
| 148 | + calc_relative_gpu_perf: bool, |
| 149 | + calc_individual_gpu_perf: bool, |
| 150 | + num_gpu_perf_scores_to_log: int, |
| 151 | + gpu_relative_perf_threshold: float, |
| 152 | + gpu_individual_perf_threshold: float, |
| 153 | + stop_if_detected: bool, |
| 154 | + logger_name: Optional[str] = "nemo_logger.StragglerDetectionCallback", |
| 155 | +): |
| 156 | + """ |
| 157 | + Initialize straggler detection callback instance. |
| 158 | +
|
| 159 | + Args: |
| 160 | + report_time_interval (float): Interval [seconds] of the straggler check |
| 161 | + calc_relative_gpu_perf (bool): Calculate relative GPU performance |
| 162 | + calc_individual_gpu_perf (bool): Calculate individual GPU performance |
| 163 | + num_gpu_perf_scores_to_log (int): How many best and worst scores to log (0 - does not log periodically, but only if stragglers are detected) |
| 164 | + gpu_relative_perf_threshold (float): Threshold for relative GPU performance scores |
| 165 | + gpu_individual_perf_threshold (float): Threshold for individual GPU performance scores |
| 166 | + stop_if_detected (bool): Set to True, to terminate the workload if stragglers are detected |
| 167 | + logger_name (Optional[str], optional): Defaults to "nemo_logger.StragglerDetectionCallback". |
| 168 | +
|
| 169 | + Raises: |
| 170 | + ValueError: If invalid config was provided. |
| 171 | + """ |
| 172 | +``` |
| 173 | + |
| 174 | +More info on straggler detection can be found in the straggler package's README. |
0 commit comments