Skip to content

Commit d15e7bb

Browse files
authored
docs(site): update perf numbers (#46)
1 parent 85d780d commit d15e7bb

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

docs/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,15 +36,16 @@ When comparing
3636

3737
We observe:
3838

39-
* Data write times that are up to 20-30x faster for ML Flashpoint specifically, with little to no optimization.
40-
This is expected to further improve with additional optimizations.
41-
* Total checkpoint recovery times that are ~7-10x faster for ML Flashpoint specifically (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
39+
* Data write times that are up to 120x faster for ML Flashpoint specifically, currently reaching up to ~30 GB/s/node write throughput (scales linearly with cluster size).
40+
* Total checkpoint recovery times that are ~7-12x faster for ML Flashpoint specifically, depending on number of nodes lost (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
4241
* For _async_ checkpointing:
4342
* Improvements averaging **3%** (Gemma 27B) & **6%** (Llama 70B) for _overall job time_ in the hybrid approach.
4443
* Improvements reach **5%** (Gemma 27B) & **10%** (Llama 70B) when NeMo checkpointing is deferred to the end (300th step) instead of being done every 50 steps.
4544
* These improvements only account for checkpoint _save_ efficiency, representing a "lower bound" value as it doesn't account for the speedups in _recovery_ time.
4645
* Any job interruptions would also benefit from ML Flashpoint's recovery performance gains.
4746

47+
Stay tuned and watch the [repository](https://github.com/google/ml-flashpoint) for updates on future improvements!
48+
4849
!!! info
4950

5051
While [ML runtime goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity) is important, we focus on overall job time as an end-to-end metric, as it is simpler and allows for straightforward _total_ cost comparisons.
@@ -69,8 +70,7 @@ To use ML Flashpoint, the basic requirements for the training environment are:
6970
* This is enforced so that the pairwise strategy doesn't put a higher memory burden on one node than the others, and so the general capacity requirements are roughly consistent across nodes.
7071
1. A `tmpfs` mount is strongly recommended to be used for the container base path, that is separate from `/dev/shm`.
7172
E.g. a `/tmp` mount, which can be added to `/etc/fstab` on Linux machines to mount it persistently (A3-Mega example):
72-
1. `tmpfs /tmp tmpfs rw,nosuid,nodev,size=1024G,mode=1777,noswap,huge=within_size 0 0`
73-
1. `huge=within_size` is recommended to use huge pages for any files large enough, since checkpoint data is on the order of many GBs.
73+
1. `tmpfs /tmp tmpfs rw,nosuid,nodev,size=1024G,mode=1777,noswap 0 0`
7474
1. `noswap` is recommended to avoid degrading performance.
7575
This can be omitted if you prefer to allow transparent disk swapping to accommodate more checkpoint storage than can fit in memory, at the cost of poorer checkpointing performance.
7676
1. The amount of memory needed is at least equal to the checkpoint size per node x 4, to account for replicas and in-progress checkpoints.

0 commit comments

Comments
 (0)