You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/README.md
+10-7Lines changed: 10 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,7 @@
1
1
# ML Flashpoint
2
2
3
3
ML Flashpoint is a memory-first, lightning-fast, ready-to-use ML checkpointing library.
4
+
It is infrastructure and scheduler agnostic, with native integrations for certain frameworks, and a core library for custom use cases.
4
5
5
6
Check out the [User Guide](user-guide.md) to get started.
6
7
@@ -21,7 +22,7 @@ ML Flashpoint saves checkpoints to shared memory, to be able to recover when the
21
22
Replication has not been observed to have any meaningful negative impact on ongoing training or overall job time.
22
23
See the [overview](overview.md) for more detail.
23
24
24
-
###Performance
25
+
## Performance
25
26
26
27
We observe meaningful improvements even in small-scale tests, spanning just 300 training steps with 4 [A3-Mega](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) nodes, for Gemma 27B and Llama 70B pre-training.
27
28
We executed such tests on a [Vertex AI Training Cluster](https://docs.cloud.google.com/vertex-ai/docs/training/training-clusters/overview) and obtained the speedups listed below.
@@ -33,14 +34,16 @@ When comparing
33
34
1. the hybrid of ML Flashpoint (every 5 steps) and NeMo checkpointing (every 50 steps), to
34
35
1. NeMo's regular checkpointing (every 10 steps - so half as often)
35
36
36
-
the hybrid approach resulted in:
37
+
We observe:
37
38
38
-
* Data write times that are up to 20-30x faster, with little to no optimization.
39
+
* Data write times that are up to 20-30x faster for ML Flashpoint specifically, with little to no optimization.
39
40
This is expected to further improve with additional optimizations.
40
-
* Total checkpoint recovery times that are ~7-10x faster (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
41
-
* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_, and reaching **5-10%** when NeMo checkpointing is deferred to the end (300th step) instead of being done every 50 steps.
42
-
These improvements only account for checkpoint _save_ efficiency, representing a "lower bound" value as it doesn't account for the speedups in _recovery_ time.
43
-
Any job interruptions would also benefit from ML Flashpoint's recovery performance gains.
41
+
* Total checkpoint recovery times that are ~7-10x faster for ML Flashpoint specifically (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
42
+
* For _async_ checkpointing:
43
+
* Improvements averaging **3%** (Gemma 27B) & **6%** (Llama 70B) for _overall job time_ in the hybrid approach.
44
+
* Improvements reach **5%** (Gemma 27B) & **10%** (Llama 70B) when NeMo checkpointing is deferred to the end (300th step) instead of being done every 50 steps.
45
+
* These improvements only account for checkpoint _save_ efficiency, representing a "lower bound" value as it doesn't account for the speedups in _recovery_ time.
46
+
* Any job interruptions would also benefit from ML Flashpoint's recovery performance gains.
Copy file name to clipboardExpand all lines: docs/user-guide.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,10 @@
1
1
# User Guide
2
2
3
3
Below are instructions for using ML Flashpoint with the different frameworks supported.
4
+
For finer-grained control, use the [core](https://github.com/google/ml-flashpoint/tree/main/src/ml_flashpoint/core) library APIs, which the framework adapters build on top of.
5
+
The adapters also provide a good working example of how to use the core library.
6
+
7
+
If interested in a native integration with another framework, please let us know by creating a [feature request](https://github.com/google/ml-flashpoint/issues/new?template=feature_request.md) or upvoting an [existing one](https://github.com/google/ml-flashpoint/issues?q=is%3Aissue%20state%3Aopen%20label%3Aenhancement).
0 commit comments