docs(fabric): add all-process note for save/load checkpoints

royfa · royfa · commit c12258d01d51 · 2026-02-19T14:24:22.000+02:00
diff --git a/docs/source-fabric/api/fabric.rst b/docs/source-fabric/api/fabric.rst
@@ -16,3 +16,8 @@ Fabric
     :template: classtemplate.rst
 
     Fabric
+
+.. note::
+
+    In distributed training, :meth:`~lightning.fabric.fabric.Fabric.save` and
+    :meth:`~lightning.fabric.fabric.Fabric.load` must be called on all processes.
diff --git a/docs/source-fabric/guide/checkpoint/checkpoint.rst b/docs/source-fabric/guide/checkpoint/checkpoint.rst
@@ -43,6 +43,8 @@ To save the state to the filesystem, pass it to the :meth:`~lightning.fabric.fab
 
     fabric.save("path/to/checkpoint.ckpt", state)
 
+This method must be called on all processes.
+
 This will unwrap your model and optimizer and automatically convert their ``state_dict`` for you.
 Fabric and the underlying strategy will decide in which format your checkpoint gets saved.
 For example, ``strategy="ddp"`` saves a single file on rank 0, while ``strategy="fsdp"`` :doc:`saves multiple files from all ranks <distributed_checkpoint>`.
@@ -64,6 +66,8 @@ You can restore the state by loading a saved checkpoint back with :meth:`~lightn
 
     fabric.load("path/to/checkpoint.ckpt", state)
 
+This method must be called on all processes.
+
 Fabric will replace the state of your objects in-place.
 You can also request only to restore a portion of the checkpoint.
 For example, you want only to restore the model weights in your inference script:
diff --git a/docs/source-fabric/guide/checkpoint/distributed_checkpoint.rst b/docs/source-fabric/guide/checkpoint/distributed_checkpoint.rst
@@ -45,6 +45,8 @@ The distributed checkpoint format is the default when you train with the :doc:`F
     # DON'T do this (inefficient):
     # torch.save("path/to/checkpoint/file", state)
 
+This method must be called on all processes.
+
 With ``state_dict_type="sharded"``, each process/GPU will save its own file into a folder at the given path.
 This reduces memory peaks and speeds up the saving to disk.
 
@@ -138,6 +140,8 @@ You can easily load a distributed checkpoint in Fabric if your script uses :doc:
     # DON'T do this (inefficient):
     # model.load_state_dict(torch.load("path/to/checkpoint/file"))
 
+This method must be called on all processes.
+
 Note that you can load the distributed checkpoint even if the world size has changed, i.e., you are running on a different number of GPUs than when you saved the checkpoint.
 
 .. collapse:: Full example