Skip to content

Commit 2bdbe5b

Browse files
g-husamLeahlijuan
authored andcommitted
docs(site): clarify perf points (#24)
Clarifying where the improved read/write speedups were exactly. Also adding note on infra-agnosticism, and removing the WIP label from the doc site.
1 parent 3c08710 commit 2bdbe5b

File tree

3 files changed

+15
-8
lines changed

3 files changed

+15
-8
lines changed

docs/README.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# ML Flashpoint
22

33
ML Flashpoint is a memory-first, lightning-fast, ready-to-use ML checkpointing library.
4+
It is infrastructure and scheduler agnostic, with native integrations for certain frameworks, and a core library for custom use cases.
45

56
Check out the [User Guide](user-guide.md) to get started.
67

@@ -21,7 +22,7 @@ ML Flashpoint saves checkpoints to shared memory, to be able to recover when the
2122
Replication has not been observed to have any meaningful negative impact on ongoing training or overall job time.
2223
See the [overview](overview.md) for more detail.
2324

24-
### Performance
25+
## Performance
2526

2627
We observe meaningful improvements even in small-scale tests, spanning just 300 training steps with 4 [A3-Mega](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) nodes, for Gemma 27B and Llama 70B pre-training.
2728
We executed such tests on a [Vertex AI Training Cluster](https://docs.cloud.google.com/vertex-ai/docs/training/training-clusters/overview) and obtained the speedups listed below.
@@ -33,14 +34,16 @@ When comparing
3334
1. the hybrid of ML Flashpoint (every 5 steps) and NeMo checkpointing (every 50 steps), to
3435
1. NeMo's regular checkpointing (every 10 steps - so half as often)
3536

36-
the hybrid approach resulted in:
37+
We observe:
3738

38-
* Data write times that are up to 20-30x faster, with little to no optimization.
39+
* Data write times that are up to 20-30x faster for ML Flashpoint specifically, with little to no optimization.
3940
This is expected to further improve with additional optimizations.
40-
* Total checkpoint recovery times that are ~7-10x faster (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
41-
* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_, and reaching **5-10%** when NeMo checkpointing is deferred to the end (300th step) instead of being done every 50 steps.
42-
These improvements only account for checkpoint _save_ efficiency, representing a "lower bound" value as it doesn't account for the speedups in _recovery_ time.
43-
Any job interruptions would also benefit from ML Flashpoint's recovery performance gains.
41+
* Total checkpoint recovery times that are ~7-10x faster for ML Flashpoint specifically (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
42+
* For _async_ checkpointing:
43+
* Improvements averaging **3%** (Gemma 27B) & **6%** (Llama 70B) for _overall job time_ in the hybrid approach.
44+
* Improvements reach **5%** (Gemma 27B) & **10%** (Llama 70B) when NeMo checkpointing is deferred to the end (300th step) instead of being done every 50 steps.
45+
* These improvements only account for checkpoint _save_ efficiency, representing a "lower bound" value as it doesn't account for the speedups in _recovery_ time.
46+
* Any job interruptions would also benefit from ML Flashpoint's recovery performance gains.
4447

4548
!!! info
4649

docs/user-guide.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# User Guide
22

33
Below are instructions for using ML Flashpoint with the different frameworks supported.
4+
For finer-grained control, use the [core](https://github.com/google/ml-flashpoint/tree/main/src/ml_flashpoint/core) library APIs, which the framework adapters build on top of.
5+
The adapters also provide a good working example of how to use the core library.
6+
7+
If interested in a native integration with another framework, please let us know by creating a [feature request](https://github.com/google/ml-flashpoint/issues/new?template=feature_request.md) or upvoting an [existing one](https://github.com/google/ml-flashpoint/issues?q=is%3Aissue%20state%3Aopen%20label%3Aenhancement).
48

59
## Install
610

mkdocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
# yaml-language-server: $schema=https://squidfunk.github.io/mkdocs-material/schema.json
1616

17-
site_name: ML Flashpoint Docs [WIP]
17+
site_name: ML Flashpoint Docs
1818
site_url: https://google.github.io/ml-flashpoint
1919

2020
nav:

0 commit comments

Comments
 (0)