Skip to content

Commit c991bfb

Browse files
authored
docs(site): clarify perf points
1 parent 6e20be2 commit c991bfb

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

docs/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,12 @@ When comparing
3333
1. the hybrid of ML Flashpoint (every 5 steps) and NeMo checkpointing (every 50 steps), to
3434
1. NeMo's regular checkpointing (every 10 steps - so half as often)
3535

36-
the hybrid approach resulted in:
36+
We observe:
3737

38-
* Data write times that are up to 20-30x faster, with little to no optimization.
38+
* Data write times that are up to 20-30x faster for ML Flashpoint, with little to no optimization.
3939
This is expected to further improve with additional optimizations.
40-
* Total checkpoint recovery times that are ~7-10x faster (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
41-
* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_, and reaching **5-10%** when NeMo checkpointing is deferred to the end (300th step) instead of being done every 50 steps.
40+
* Total checkpoint recovery times that are ~7-10x faster for ML Flashpoint (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
41+
* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_ in the hybrid approach, and reaching **5-10%** when NeMo checkpointing is deferred to the end (300th step) instead of being done every 50 steps.
4242
These improvements only account for checkpoint _save_ efficiency, representing a "lower bound" value as it doesn't account for the speedups in _recovery_ time.
4343
Any job interruptions would also benefit from ML Flashpoint's recovery performance gains.
4444

0 commit comments

Comments
 (0)