Skip to content

Commit 2ad4df0

Browse files
Googlercopybara-github
authored andcommitted
Update Goodput library documentation.
PiperOrigin-RevId: 610537927
1 parent 32ba75b commit 2ad4df0

File tree

1 file changed

+159
-3
lines changed

1 file changed

+159
-3
lines changed

README.md

Lines changed: 159 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,169 @@
1313
See the License for the specific language governing permissions and
1414
limitations under the License.
1515
-->
16+
# ML Goodput Measurement
1617

17-
# Overview
18+
## Overview
1819

19-
Cloud TPU Goodput is a library intended to be used with Cloud TPU to log the
20+
ML Goodput Measurement is a library intended to be used with Cloud TPU to log the
2021
necessary information and query a job's Goodput. It can be pip installed to
2122
import its modules, and retrieve information about a training job's overall
2223
productive Goodput. The package exposes API interfaces to log useful
2324
information from the user application and query Goodput for the job run, gain
2425
insight into the productivity of ML workloads and utilization of compute
25-
resources.
26+
resources.
27+
28+
## Components
29+
30+
31+
The ML Goodput Measurement library consists of two main components:
32+
the `GoodputRecorder` and the `GoodputCalculator`. The `GoodputRecorder`
33+
exposes APIs to the client to export key timestamps while a training job makes
34+
progress, namely APIs that allow logging of productive step time and total job
35+
run time. The library will serialize and store this data in Google Cloud
36+
Logging. The `GoodputCalculator` exposes APIs to compute Goodput based on the
37+
recorded data. Cloud Logging handles its internal operations asynchronously.
38+
The recommended way to compute Goodput is to run an analysis program separate
39+
from the training application, either on a CPU instance or on the users'
40+
development machine.
41+
42+
## Installation
43+
44+
To install the ML Goodput Measurement package, run the following command on TPU VM:
45+
46+
```bash
47+
pip install ml-goodput-measurement
48+
```
49+
50+
## Usage
51+
52+
The usage of this package requires the setup of a Google Cloud project with
53+
billing enabled to properly use Google Cloud Logging. If you don't have a Google
54+
Cloud project, or if you don't have billing enabled for your Google Cloud
55+
project, then do the following:
56+
57+
1. In the Google Cloud console, on the project selector page,
58+
[select or create a Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects).
59+
60+
2. Make sure that billing is enabled for your Google Cloud project. Instructions can be found [here](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled#console)
61+
62+
To run your training on Cloud TPU, set up the Cloud TPU environment by following
63+
instructions [here](https://cloud.google.com/tpu/docs/setup-gcp-account).
64+
65+
To learn more about Google Cloud Logging, visit this [page](https://cloud.google.com/logging/docs).
66+
67+
68+
### Import
69+
70+
To use this package, import the `goodput` module:
71+
72+
73+
```python
74+
from ml_goodput_measurement import goodput
75+
```
76+
77+
### Define the name of the Google Cloud Logging logger bucket
78+
79+
Create a run-specific logger bucket where Cloud Logging entries can be written and read from.
80+
81+
For example:
82+
83+
```python
84+
goodput_logger_name = f'goodput_{config.run_name}'
85+
```
86+
87+
### Create a `GoodputRecorder` object
88+
89+
Next, create a recorder object with the following parameters:
90+
91+
1. `job_name`: The full run name of the job.
92+
2. `logger_name`: The name of the Cloud Logging logger object (created in the previous step).
93+
3. `logging_enabled`: Whether or not this process has Cloud Logging enabled.
94+
95+
96+
97+
> **_NOTE:_** For a multi-worker setup, please ensure that only one worker
98+
writes the logs to avoid the duplication. In JAX, for example, the check
99+
could be `if jax.process_index() == 0`
100+
101+
102+
> **_NOTE:_** `logging_enabled` defaults to `False` and Goodput computations cannot be completed if no logs are ever written.
103+
104+
For example:
105+
106+
107+
```python
108+
goodput_recorder = goodput.GoodputRecorder(job_name=config.run_name, logger_name=goodput_logger_name, logging_enabled=(jax.process_index() == 0))
109+
```
110+
111+
112+
### Record Data with `GoodputRecorder`
113+
114+
#### Record Job Start and End Time
115+
116+
Use the recorder object to record the job's overall start and end time.
117+
118+
For example:
119+
120+
```python
121+
def main(argv: Sequence[str]) -> None:
122+
# Initialize configs…
123+
goodput_recorder.record_job_start_time(datetime.datetime.now())
124+
# TPU Initialization and device scanning…
125+
# Set up other things for the main training loop…
126+
# Main training loop
127+
train_loop(config)
128+
goodput_recorder.record_job_end_time(datetime.datetime.now())
129+
```
130+
131+
132+
#### Record Step Time
133+
134+
Use the recorder object to record a step's start time using `record_step_start_time(step_count)`:
135+
136+
For example:
137+
138+
```python
139+
def train_loop(config, state=None):
140+
# Set up mesh, model, state, checkpoint manager…
141+
142+
# Initialize functional train arguments and model parameters…
143+
144+
# Define the compilation
145+
146+
for step in np.arange(start_step, config.steps):
147+
goodput_recorder.record_step_start_time(step)
148+
# Training step…
149+
150+
return state
151+
```
152+
153+
### Retrieve Goodput with `GoodputCalculator`
154+
155+
In order to retrieve the Goodput of a job run, all you need to do is instantiate
156+
a `GoodputCalculator` object with the job's run name and the Cloud Logging
157+
logger name used to record data for that job run. Then call the `get_job_goodput`
158+
API to get the computed Goodput for the job run.
159+
160+
It is recommended to make the `get_job_goodput` calls for a job run from an
161+
instance that runs elsewhere from your training machine.
162+
163+
164+
#### Create a `GoodputCalculator` object
165+
166+
Create the calculator object:
167+
168+
```python
169+
goodput_logger_name = f'goodput_{config.run_name}' # You can choose your own logger name.
170+
goodput_calculator = goodput.GoodputCalculator(job_name=config.run_name, logger_name=goodput_logger_name)
171+
```
172+
173+
#### Retrieve Goodput
174+
175+
Finally, call the `get_job_goodput` API to retrieve Goodput for the entire job run.
176+
177+
```python
178+
total_goodput = goodput_calculator.get_job_goodput()
179+
print(f"Total job goodput: {total_goodput:.2f}%")
180+
```
181+

0 commit comments

Comments
 (0)