Skip to content

Commit 59ca9da

Browse files
authored
docs(site): fix perf section formatting; add code package links (#15)
1 parent 89f4cf1 commit 59ca9da

File tree

2 files changed

+6
-5
lines changed

2 files changed

+6
-5
lines changed

docs/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,16 @@ We performed some tests on a [Vertex AI Training Cluster](https://docs.cloud.goo
2727
These tests were conducted using ML Flashpoint _alongside_ NeMo's recommended checkpointing (as you would in production), where NeMo's default checkpointing used a 7-10 TB [Filestore](https://cloud.google.com/filestore) instance.
2828

2929
Observations when comparing the hybrid of ML Flashpoint (every 5 steps) and NeMo checkpointing (every 50 steps) to just NeMo's regular checkpointing (every 10 steps):
30+
3031
* Data write times that are up to 20-30x faster, with little to no optimization.
3132
This is expected to further improve with additional optimizations.
3233
* Total checkpoint recovery times that are ~7-10x faster (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
3334
* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_, with peaks of **5-10%** improvements.
3435
These improvements only account for checkpoint save efficiency, representing a "worst case" in the sense that checkpointing purely adds overhead and isn't actually used.
3536
Any job interruptions will also benefit from the improved checkpoint recovery times.
3637

37-
While [ML runtime goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity) is important, we focus on overall job time as an end-to-end metric, as it is most transparent and accounts for actual cost.
38-
Goodput can be misleading if improvements to unproductive time actually worsen productive time.
38+
While [ML runtime goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity) is important, we focus on overall job time as an end-to-end metric, as it is simpler and allows for straightforward _total_ cost comparisons.
39+
Runtime goodput alone can be misleading if improvements to unproductive time actually worsen productive (active training) time, and the change in total evaluation period (job time) is not taken into account.
3940

4041
## Design Philosophy
4142

docs/user-guide.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ See the project's [README](http://cs/h/cloud-mlnet/ml-flashpoint/+/main:README.m
2828

2929
### NeMo 2.0 & Pytorch Lightning
3030

31-
Code: See the `ml_flashpoint.adapter.nemo` package.
31+
Code: See the [`ml_flashpoint.adapter.nemo`](https://github.com/google/ml-flashpoint/tree/main/src/ml_flashpoint/adapter/nemo) package.
3232

3333
!!! note
3434

@@ -107,7 +107,7 @@ This reduces blocking time by avoiding duplicate work, at the cost of having a l
107107

108108
### Megatron-LM
109109

110-
Code: See the `ml_flashpoint.adapter.megatron` package.
110+
Code: See the [`ml_flashpoint.adapter.megatron`](https://github.com/google/ml-flashpoint/tree/main/src/ml_flashpoint/adapter/megatron) package.
111111

112112
The Megatron strategies depend on the PyTorch DCP implementations.
113113
Below are instructions for setting up ML Flashpoint checkpointing, which you should configure alongside regular checkpointing to long-term storage.
@@ -195,4 +195,4 @@ else:
195195

196196
### PyTorch DCP
197197

198-
Code: See the `ml_flashpoint.adapter.pytorch` package.
198+
Code: See the [`ml_flashpoint.adapter.pytorch`](https://github.com/google/ml-flashpoint/tree/main/src/ml_flashpoint/adapter/pytorch) package.

0 commit comments

Comments
 (0)