docs: tweak README verbiage and user doc perf section (#23)

g-husam · Leahlijuan · commit a66950b0eb79 · 2026-02-02T19:12:18.000Z
diff --git a/README.md b/README.md
@@ -4,17 +4,18 @@
 
 A memory-first, lightning-fast, ready-to-use ML checkpointing library.
 
-PyTorch DCP, Megatron-LM and NeMo 2.0 adapters are readily available for seamless integration, on top of our general core checkpointing library that can be used for custom integrations.
+Adapters for PyTorch DCP, Megatron-LM and NeMo 2.0 are readily available for seamless integration.
+They are built on top of the core checkpointing APIs, which can also be used directly for custom integrations.
 
-If interested in direct integration support with another framework, please let us know by creating a [feature request](https://github.com/google/ml-flashpoint/issues/new?template=feature_request.md) or upvoting an existing one!
+If interested in a native integration with another framework, please let us know by creating a [feature request](https://github.com/google/ml-flashpoint/issues/new?template=feature_request.md) or upvoting an [existing one](https://github.com/google/ml-flashpoint/issues?q=is%3Aissue%20state%3Aopen%20label%3Aenhancement)!
 
 For learning more about using the library and its performance, check out the [user documentation](https://google.github.io/ml-flashpoint). 
 Below you will find development instructions for contributors.
 
 ## Installation
 
 This library defines core dependencies, as well as additional optional dependencies for specific adapters, to avoid polluting consumers with unnecessary dependencies.
-See the adapters installation commands for examples of the available adapters.
+See the adapters installation commands below for examples of the available options, and the [`pyproject.toml`](./pyproject.toml) as the source of truth for all available adapters.
 
 ### Core Library
 ```bash
diff --git a/docs/README.md b/docs/README.md
@@ -18,25 +18,35 @@ The goal is to ultimately improve your ML runtime (total time and goodput), by a
 1. Free up your long-term storage bandwidth for other use cases.
 
 ML Flashpoint saves checkpoints to shared memory, to be able to recover when the node is not lost, and automatically replicates them asynchronously to peer(s) in the training cluster, to improve resilience during node losses.
-Replication has not been observed to have any negative impact on ongoing training or overall job time.
+Replication has not been observed to have any meaningful negative impact on ongoing training or overall job time.
 See the [overview](overview.md) for more detail.
 
 ### Performance
 
-We performed some tests on a [Vertex AI Training Cluster](https://docs.cloud.google.com/vertex-ai/docs/training/training-clusters/overview) with 4 [A3-Mega](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) nodes for Gemma 27B and Llama 70B pre-training over just 300 steps and observed the improvements listed below.
+We observe meaningful improvements even in small-scale tests, spanning just 300 training steps with 4 [A3-Mega](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) nodes, for Gemma 27B and Llama 70B pre-training.
+We executed such tests on a [Vertex AI Training Cluster](https://docs.cloud.google.com/vertex-ai/docs/training/training-clusters/overview) and obtained the speedups listed below.
 These tests were conducted using ML Flashpoint _alongside_ NeMo's recommended checkpointing (as you would in production), where NeMo's default checkpointing used a 7-10 TB [Filestore](https://cloud.google.com/filestore) instance.
 
-Observations when comparing the hybrid of ML Flashpoint (every 5 steps) and NeMo checkpointing (every 50 steps) to just NeMo's regular checkpointing (every 10 steps):
+[¶](#perf-summary){ #perf-summary }
+When comparing 
+
+1. the hybrid of ML Flashpoint (every 5 steps) and NeMo checkpointing (every 50 steps), to
+1. NeMo's regular checkpointing (every 10 steps - so half as often)
+
+the hybrid approach resulted in:
 
 * Data write times that are up to 20-30x faster, with little to no optimization.
 This is expected to further improve with additional optimizations.
 * Total checkpoint recovery times that are ~7-10x faster (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
-* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_, with peaks of **5-10%** improvements.
-These improvements only account for checkpoint save efficiency, representing a "worst case" in the sense that checkpointing purely adds overhead and isn't actually used.
-Any job interruptions will also benefit from the improved checkpoint recovery times.
+* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_, and reaching **5-10%** when NeMo checkpointing is deferred to the end (300th step) instead of being done every 50 steps.
+These improvements only account for checkpoint _save_ efficiency, representing a "lower bound" value as it doesn't account for the speedups in _recovery_ time.
+Any job interruptions would also benefit from ML Flashpoint's recovery performance gains.
+
+!!! info
 
-While [ML runtime goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity) is important, we focus on overall job time as an end-to-end metric, as it is simpler and allows for straightforward _total_ cost comparisons.
-Runtime goodput alone can be misleading if improvements to unproductive time actually worsen productive (active training) time, and the change in total evaluation period (job time) is not taken into account.
+    While [ML runtime goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity) is important, we focus on overall job time as an end-to-end metric, as it is simpler and allows for straightforward _total_ cost comparisons.
+    
+    Runtime goodput alone can be misleading if improvements to unproductive (non-training) time actually worsen productive (active training) time, and the change in total evaluation period (job time) is not taken into account.
 
 ## Design Philosophy
 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -48,6 +48,7 @@ plugins:
   # To enable generating documentation that can be served from a local filesystem.
   # - offline
   - search
+  - autorefs
   - mkdocstrings:
       handlers:
         python:
@@ -59,6 +60,7 @@ plugins:
 
 markdown_extensions:
   - admonition
+  - attr_list
   - pymdownx.emoji:
       emoji_index: !!python/name:material.extensions.emoji.twemoji
       emoji_generator: !!python/name:material.extensions.emoji.to_svg