|
13 | 13 | See the License for the specific language governing permissions and |
14 | 14 | limitations under the License. |
15 | 15 | --> |
| 16 | +# ML Goodput Measurement |
16 | 17 |
|
17 | | - # Overview |
| 18 | +## Overview |
18 | 19 |
|
19 | | - Cloud TPU Goodput is a library intended to be used with Cloud TPU to log the |
| 20 | + ML Goodput Measurement is a library intended to be used with Cloud TPU to log the |
20 | 21 | necessary information and query a job's Goodput. It can be pip installed to |
21 | 22 | import its modules, and retrieve information about a training job's overall |
22 | 23 | productive Goodput. The package exposes API interfaces to log useful |
23 | 24 | information from the user application and query Goodput for the job run, gain |
24 | 25 | insight into the productivity of ML workloads and utilization of compute |
25 | | - resources. |
| 26 | + resources. |
| 27 | + |
| 28 | +## Components |
| 29 | + |
| 30 | + |
| 31 | + The ML Goodput Measurement library consists of two main components: |
| 32 | + the `GoodputRecorder` and the `GoodputCalculator`. The `GoodputRecorder` |
| 33 | + exposes APIs to the client to export key timestamps while a training job makes |
| 34 | + progress, namely APIs that allow logging of productive step time and total job |
| 35 | + run time. The library will serialize and store this data in Google Cloud |
| 36 | + Logging. The `GoodputCalculator` exposes APIs to compute Goodput based on the |
| 37 | + recorded data. Cloud Logging handles its internal operations asynchronously. |
| 38 | + The recommended way to compute Goodput is to run an analysis program separate |
| 39 | + from the training application, either on a CPU instance or on the users' |
| 40 | + development machine. |
| 41 | + |
| 42 | +## Installation |
| 43 | + |
| 44 | + To install the ML Goodput Measurement package, run the following command on TPU VM: |
| 45 | + |
| 46 | + ```bash |
| 47 | + pip install ml-goodput-measurement |
| 48 | + ``` |
| 49 | + |
| 50 | +## Usage |
| 51 | + |
| 52 | +The usage of this package requires the setup of a Google Cloud project with |
| 53 | +billing enabled to properly use Google Cloud Logging. If you don't have a Google |
| 54 | +Cloud project, or if you don't have billing enabled for your Google Cloud |
| 55 | +project, then do the following: |
| 56 | + |
| 57 | +1. In the Google Cloud console, on the project selector page, |
| 58 | + [select or create a Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects). |
| 59 | + |
| 60 | +2. Make sure that billing is enabled for your Google Cloud project. Instructions can be found [here](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled#console) |
| 61 | + |
| 62 | +To run your training on Cloud TPU, set up the Cloud TPU environment by following |
| 63 | +instructions [here](https://cloud.google.com/tpu/docs/setup-gcp-account). |
| 64 | + |
| 65 | +To learn more about Google Cloud Logging, visit this [page](https://cloud.google.com/logging/docs). |
| 66 | + |
| 67 | + |
| 68 | +### Import |
| 69 | + |
| 70 | + To use this package, import the `goodput` module: |
| 71 | + |
| 72 | + |
| 73 | + ```python |
| 74 | + from ml_goodput_measurement import goodput |
| 75 | + ``` |
| 76 | + |
| 77 | +### Define the name of the Google Cloud Logging logger bucket |
| 78 | + |
| 79 | + Create a run-specific logger bucket where Cloud Logging entries can be written and read from. |
| 80 | + |
| 81 | + For example: |
| 82 | + |
| 83 | + ```python |
| 84 | + goodput_logger_name = f'goodput_{config.run_name}' |
| 85 | + ``` |
| 86 | + |
| 87 | +### Create a `GoodputRecorder` object |
| 88 | + |
| 89 | + Next, create a recorder object with the following parameters: |
| 90 | + |
| 91 | + 1. `job_name`: The full run name of the job. |
| 92 | + 2. `logger_name`: The name of the Cloud Logging logger object (created in the previous step). |
| 93 | + 3. `logging_enabled`: Whether or not this process has Cloud Logging enabled. |
| 94 | + |
| 95 | + |
| 96 | + |
| 97 | + > **_NOTE:_** For a multi-worker setup, please ensure that only one worker |
| 98 | + writes the logs to avoid the duplication. In JAX, for example, the check |
| 99 | + could be `if jax.process_index() == 0` |
| 100 | + |
| 101 | + |
| 102 | + > **_NOTE:_** `logging_enabled` defaults to `False` and Goodput computations cannot be completed if no logs are ever written. |
| 103 | +
|
| 104 | + For example: |
| 105 | + |
| 106 | + |
| 107 | + ```python |
| 108 | + goodput_recorder = goodput.GoodputRecorder(job_name=config.run_name, logger_name=goodput_logger_name, logging_enabled=(jax.process_index() == 0)) |
| 109 | + ``` |
| 110 | + |
| 111 | + |
| 112 | +### Record Data with `GoodputRecorder` |
| 113 | + |
| 114 | +#### Record Job Start and End Time |
| 115 | + |
| 116 | + Use the recorder object to record the job's overall start and end time. |
| 117 | + |
| 118 | + For example: |
| 119 | + |
| 120 | + ```python |
| 121 | + def main(argv: Sequence[str]) -> None: |
| 122 | + # Initialize configs… |
| 123 | + goodput_recorder.record_job_start_time(datetime.datetime.now()) |
| 124 | + # TPU Initialization and device scanning… |
| 125 | + # Set up other things for the main training loop… |
| 126 | + # Main training loop |
| 127 | + train_loop(config) |
| 128 | + goodput_recorder.record_job_end_time(datetime.datetime.now()) |
| 129 | + ``` |
| 130 | + |
| 131 | + |
| 132 | +#### Record Step Time |
| 133 | + |
| 134 | + Use the recorder object to record a step's start time using `record_step_start_time(step_count)`: |
| 135 | + |
| 136 | +For example: |
| 137 | + |
| 138 | + ```python |
| 139 | + def train_loop(config, state=None): |
| 140 | + # Set up mesh, model, state, checkpoint manager… |
| 141 | + |
| 142 | + # Initialize functional train arguments and model parameters… |
| 143 | + |
| 144 | + # Define the compilation |
| 145 | + |
| 146 | + for step in np.arange(start_step, config.steps): |
| 147 | + goodput_recorder.record_step_start_time(step) |
| 148 | + # Training step… |
| 149 | + |
| 150 | + return state |
| 151 | + ``` |
| 152 | + |
| 153 | +### Retrieve Goodput with `GoodputCalculator` |
| 154 | + |
| 155 | +In order to retrieve the Goodput of a job run, all you need to do is instantiate |
| 156 | +a `GoodputCalculator` object with the job's run name and the Cloud Logging |
| 157 | +logger name used to record data for that job run. Then call the `get_job_goodput` |
| 158 | +API to get the computed Goodput for the job run. |
| 159 | + |
| 160 | +It is recommended to make the `get_job_goodput` calls for a job run from an |
| 161 | +instance that runs elsewhere from your training machine. |
| 162 | + |
| 163 | + |
| 164 | +#### Create a `GoodputCalculator` object |
| 165 | + |
| 166 | +Create the calculator object: |
| 167 | + |
| 168 | +```python |
| 169 | +goodput_logger_name = f'goodput_{config.run_name}' # You can choose your own logger name. |
| 170 | +goodput_calculator = goodput.GoodputCalculator(job_name=config.run_name, logger_name=goodput_logger_name) |
| 171 | +``` |
| 172 | + |
| 173 | +#### Retrieve Goodput |
| 174 | + |
| 175 | +Finally, call the `get_job_goodput` API to retrieve Goodput for the entire job run. |
| 176 | + |
| 177 | +```python |
| 178 | +total_goodput = goodput_calculator.get_job_goodput() |
| 179 | +print(f"Total job goodput: {total_goodput:.2f}%") |
| 180 | +``` |
| 181 | + |
0 commit comments