Skip to content

Commit a1e1846

Browse files
committed
docs(site): fix perf section formatting; add code package links
1 parent 89f4cf1 commit a1e1846

File tree

2 files changed

+6
-5
lines changed

2 files changed

+6
-5
lines changed

docs/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,16 @@ We performed some tests on a [Vertex AI Training Cluster](https://docs.cloud.goo
2727
These tests were conducted using ML Flashpoint _alongside_ NeMo's recommended checkpointing (as you would in production), where NeMo's default checkpointing used a 7-10 TB [Filestore](https://cloud.google.com/filestore) instance.
2828

2929
Observations when comparing the hybrid of ML Flashpoint (every 5 steps) and NeMo checkpointing (every 50 steps) to just NeMo's regular checkpointing (every 10 steps):
30+
3031
* Data write times that are up to 20-30x faster, with little to no optimization.
3132
This is expected to further improve with additional optimizations.
3233
* Total checkpoint recovery times that are ~7-10x faster (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
3334
* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_, with peaks of **5-10%** improvements.
3435
These improvements only account for checkpoint save efficiency, representing a "worst case" in the sense that checkpointing purely adds overhead and isn't actually used.
3536
Any job interruptions will also benefit from the improved checkpoint recovery times.
3637

37-
While [ML runtime goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity) is important, we focus on overall job time as an end-to-end metric, as it is most transparent and accounts for actual cost.
38-
Goodput can be misleading if improvements to unproductive time actually worsen productive time.
38+
While [ML runtime goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity) is important, we focus on overall job time as an end-to-end metric, as it is simpler, most transparent and accounts for actual cost.
39+
Goodput can be misleading if improvements to unproductive time actually worsen productive time, and the change in total evaluation period (job time) is not taken into account.
3940

4041
## Design Philosophy
4142

docs/user-guide.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ See the project's [README](http://cs/h/cloud-mlnet/ml-flashpoint/+/main:README.m
2828

2929
### NeMo 2.0 & Pytorch Lightning
3030

31-
Code: See the `ml_flashpoint.adapter.nemo` package.
31+
Code: See the [`ml_flashpoint.adapter.nemo`](https://github.com/google/ml-flashpoint/tree/main/src/ml_flashpoint/adapter/nemo) package.
3232

3333
!!! note
3434

@@ -107,7 +107,7 @@ This reduces blocking time by avoiding duplicate work, at the cost of having a l
107107

108108
### Megatron-LM
109109

110-
Code: See the `ml_flashpoint.adapter.megatron` package.
110+
Code: See the [`ml_flashpoint.adapter.megatron`](https://github.com/google/ml-flashpoint/tree/main/src/ml_flashpoint/adapter/megatron) package.
111111

112112
The Megatron strategies depend on the PyTorch DCP implementations.
113113
Below are instructions for setting up ML Flashpoint checkpointing, which you should configure alongside regular checkpointing to long-term storage.
@@ -195,4 +195,4 @@ else:
195195

196196
### PyTorch DCP
197197

198-
Code: See the `ml_flashpoint.adapter.pytorch` package.
198+
Code: See the [`ml_flashpoint.adapter.pytorch`](https://github.com/google/ml-flashpoint/tree/main/src/ml_flashpoint/adapter/pytorch) package.

0 commit comments

Comments
 (0)