Netflix
diff --git a/‎docs/index.md
Lines changed: 2 additions & 1 deletion b/‎docs/index.md
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/scaling/checkpoint/checkpoint-ml-libraries.md
Lines changed: 164 additions & 0 deletions b/‎docs/scaling/checkpoint/checkpoint-ml-libraries.md
Lines changed: 164 additions & 0 deletions
diff --git a/‎docs/scaling/checkpoint/introduction.md
Lines changed: 123 additions & 0 deletions b/‎docs/scaling/checkpoint/introduction.md
Lines changed: 123 additions & 0 deletions
@@ -40,9 +40,10 @@ Metaflow makes it easy to build and manage real-life data science, AI, and ML pr
 - [Computing at Scale](scaling/remote-tasks/introduction) 
 - [Managing Dependencies](scaling/dependencies) 
 - [Dealing with Failures](scaling/failures)
+- [Checkpointing Progress](scaling/checkpoint/introduction) ✨*New*✨
 - [Loading and Storing Data](scaling/data)
-- [Accessing Secrets](scaling/secrets)
 - [Organizing Results](scaling/tagging)
+- [Accessing Secrets](scaling/secrets)
 
 ## III. Deploying to Production
 
 
@@ -0,0 +1,164 @@
+
+# Checkpoints in ML/AI libraries
+
+Let's explore how `@checkpoint` works in a real-world scenario when checkpointing training progress with popular ML
+libraries.
+
+## Checkpointing XGBoost
+
+Like many other ML libraries, [XGBoost](https://xgboost.readthedocs.io/en/stable/) allows you to define custom callbacks
+that are called periodically during training. We can create a custom checkpointer that saves the model to a file, using
+`pickle`, [as recommended by XGBoost](https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html), and calls
+`current.checkpoint.save()` to persist it.
+
+Save this snippet in a file, `xgboost_checkpointer.py`:
+
+```python
+import os
+import pickle
+from metaflow import current
+import xgboost
+
+class Checkpointer(xgboost.callback.TrainingCallback):
+
+    @classmethod
+    def _path(cls):
+        return os.path.join(current.checkpoint.directory, 'xgb_cp.pkl')
+
+    def __init__(self, interval=10):
+        self._interval = interval
+
+    def after_iteration(self, model, epoch, evals_log):
+        if epoch > 0 and epoch % self._interval == 0:
+            with open(self._path(), 'wb') as f:
+                pickle.dump(model, f)
+            current.checkpoint.save()
+
+    @classmethod
+    def load(cls):
+        with open(cls._path(), 'rb') as f:
+            return pickle.load(f)  
+```
+
+:::tip
+Make sure that the checkpoint directory doesn't accumulate files across invocations, which would make the `save`
+operation become slower over time. Either overwrite the same files or clean up the directory between checkpoints.
+The `save` call will create a uniquely named checkpoint directory automatically, so you can keep overwriting files
+across iterations.
+:::
+
+We can then train an XGboost model using `Checkpointer`:
+
+```python
+from metaflow import FlowSpec, step, current, Flow,\
+                     Parameter, conda, retry, checkpoint, card, timeout
+
+class CheckpointXGBoost(FlowSpec):
+    rounds = Parameter("rounds", help="number of boosting rounds", default=128)
+
+    @conda(packages={"scikit-learn": "1.6.1"})
+    @step
+    def start(self):
+        from sklearn.datasets import load_breast_cancer
+
+        self.X, self.y = load_breast_cancer(return_X_y=True)
+        self.next(self.train)
+
+    @timeout(seconds=15)
+    @conda(packages={"xgboost": "2.1.4"})
+    @card
+    @retry
+    @checkpoint
+    @step
+    def train(self):
+        import xgboost as xgb
+        from xgboost_checkpointer import Checkpointer
+
+        if current.checkpoint.is_loaded:
+            cp_model = Checkpointer.load()
+            cp_rounds = cp_model.num_boosted_rounds()
+            print(f"Checkpoint was trained for {cp_rounds} rounds")
+        else:
+            cp_model = None
+            cp_rounds = 0
+
+        model = xgb.XGBClassifier(
+            n_estimators=self.rounds - cp_rounds,
+            eval_metric="logloss",
+            callbacks=[Checkpointer()])
+        model.fit(self.X, self.y, eval_set=[(self.X, self.y)], xgb_model=cp_model)
+
+        assert model.get_booster().num_boosted_rounds() == self.rounds
+        print("Training completed!")
+        self.next(self.end)
+
+    @step
+    def end(self):
+        pass
+
+if __name__ == "__main__":
+    CheckpointXGBoost()
+```
+
+You can run the flow, saved to `xgboostflow.py`, as usual:
+
+```
+python xgboostflow.py --environment=conda run
+```
+
+To demonstrate checkpoints in action, [the `@timeout`
+decorator](/scaling/failures#timing-out-with-the-timeout-decorator) interrupts training every 15 seconds.
+You can adjust the time
+depending on how fast the training progresses on your workstation. The `@retry` decorator will then start the task
+again, allowing `@checkpoint` to load the latest checkpoint and resume training.
+
+## Checkpointing PyTorch
+
+Using `@checkpoint` with [PyTorch](https://pytorch.org/) is straightforward. Within your training loop, periodically
+serialize the model and use `current.checkpoint.save()` to create a checkpoint, along these lines:
+
+```python
+model_path = os.path.join(current.checkpoint.directory, 'model')
+torch.save(model.state_dict(), model_path)
+current.checkpoint.save()
+```
+
+Before starting training, check for an available checkpoint and load the model from it if found:
+
+```python
+if current.checkpoint.is_loaded:
+    model.load_state_dict(torch.load(model_path))
+```
+
+Take a look at [this reference repository for a complete
+example](https://github.com/outerbounds/metaflow-checkpoint-examples/tree/master/mnist_torch_vanilla) showing this pattern in action, in addition to examples for many other frameworks.
+
+## Checkpointing GenAI/LLM fine-tuning
+
+Fine-tuning large language models and other large foundation models for generative AI can easily take hours, running on expensive GPU instances. Take a look at the following examples to learn how `@checkpoint` can be applied to various fine-tuning use cases:
+
+- [Finetuning a LoRA from a model downloaded from
+HuggingFace](https://github.com/outerbounds/metaflow-checkpoint-examples/tree/master/lora_huggingface)
+
+- [Finetuning an LLM using LLaMA
+Factory](https://github.com/outerbounds/metaflow-checkpoint-examples/tree/master/llama_factory)
+
+- [Finetuning an LLM and serve it with NVIDIA
+NIM](https://github.com/outerbounds/metaflow-checkpoint-examples/tree/master/nim_lora)
+
+## Checkpointing distributed workloads
+
+[Metaflow supports distributed training](/scaling/remote-tasks/distributed-computing) and other distributed workloads
+which execute across multiple instances in a cluster. When training large models over extended periods across multiple
+instances, which greatly increases the likelihood of hitting spurious failures, checkpointing becomes essential to
+ensure efficient recovery.
+
+Checkpointing works smoothly when only the control node in a training cluster is designated to handle it, preventing
+race conditions that could arise from multiple instances attempting to save progress simultaneously. For reference,
+[take a look at this
+example](https://github.com/outerbounds/metaflow-checkpoint-examples/tree/master/cifar_distributed) that uses [PyTorch Data Distributed Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) mode to train a vision model on CIFAR-10 dataset, checkpointing progress with `@checkpoint`.
+
+:::info
+Large-scale distributed computing can be challenging. If you need help setting up `@checkpoint` in distributed setups, don’t hesitate to [ask for guidance on Metaflow Slack](http://slack.outerbounds.co).
+:::
+
@@ -0,0 +1,123 @@
+# Checkpointing Progress
+
+Metaflow artifacts are used to persist models, dataframes, and other Python objects upon task completion. They
+checkpoint the flow's state at step boundaries, enabling you to inspect results of a task with
+[the Client API](/metaflow/client) and [`resume` execution from any
+step](/metaflow/debugging#how-to-use-the-resume-command).
+
+In some cases, a task may require a long time to execute. For example, training a model on an expensive GPU instance
+(or across a cluster) may take several hours or even days. In such situations, persisting the final model only upon
+task completion is not sufficient. Instead, it is advisable to checkpoint progress periodically while the task is
+executing, so you won’t lose hours of work in the event of a failure.
+
+You can use a Metaflow extension, `metaflow-checkpoint`, to create and use in-task checkpoints easily: Just add
+`@checkpoint` and call `current.checkpoint.save` to checkpoint progress periodically. A major benefit of `@checkpoint`
+is that it keeps checkpoints organized automatically alongside Metaflow tasks, so you don’t have to deal with saving,
+loading, organizing, and keeping track of checkpoint files manually.
+
+Notably, `@checkpoint` integrates seamlessly with popular AI and ML frameworks such as XGBoost, PyTorch, and others, as
+described below. For more background, read [the announcement blog post for
+`@checkpoint`](https://outerbounds.com/blog/indestructible-training-with-checkpoint).
+
+:::info
+The `@checkpoint` decorator is not a built-in part of core Metaflow yet, so you have to install it separately as
+described below. Also its APIs may change in the future, in contrast to the APIs of core Metaflow which are
+guaranteed to stay backwards compatible. Please share your feedback on
+[Metaflow Slack](http://slack.outerbounds.co)!
+:::
+
+## Installing `@checkpoint`
+
+To use the `@checkpoint` extension, install it with
+```
+pip install metaflow-checkpoint
+```
+in the environments where
+you develop and deploy Metaflow code. Metaflow packages extensions for remote execution automatically, so you don’t
+need to include it in container images used to run tasks remotely.
+
+## Using `@checkpoint`
+
+The `@checkpoint` decorator operates by persisting files in a local directory to the Metaflow datastore. This makes it
+directly compatible with many popular ML and AI frameworks that support persisting checkpoints on disk natively.
+
+Let’s demonstrate the functionality with this simple flow that tries to increment a counter in a loop that fails 20% of
+the time. Thanks to `@checkpoint` and `@retry`, the `flaky_count` step recovers from exceptions and continues counting
+from the latest checkpoint, succeeding eventually:
+
+```python
+import os
+import random
+from metaflow import FlowSpec, current, step, retry, checkpoint
+
+class CheckpointCounterFlow(FlowSpec):
+    @step
+    def start(self):
+        self.counter = 0
+        self.next(self.flaky_count)
+
+    @checkpoint
+    @retry
+    @step
+    def flaky_count(self):
+        cp_path = os.path.join(current.checkpoint.directory, "counter")
+
+        def _save_counter():
+            print(f"Checkpointing counter value {self.counter}")
+            with open(cp_path, "w") as f:
+                f.write(str(self.counter))
+            self.latest_checkpoint = current.checkpoint.save()
+
+        def _load_counter():
+            if current.checkpoint.is_loaded:
+                with open(cp_path) as f:
+                    self.counter = int(f.read())
+                print(f"Checkpoint loaded!")
+
+        _load_counter()
+        print("Counter is now", self.counter)
+
+        while self.counter < 10:
+            self.counter += 1
+            if self.counter % 2 == 0:
+                _save_counter()
+
+            if random.random() < 0.2:
+                raise Exception("Bad luck! Try again!")
+
+        self.next(self.end)
+
+    @step
+    def end(self):
+        print("Final counter", self.counter)
+
+if __name__ == "__main__":
+    CheckpointCounterFlow()
+```
+
+After installing the `metaflow-checkpoint` extension, you can run the flow as usual:
+```
+python checkpoint_counter.py run
+```
+The flow demonstrates typical usage of `@checkpoint`:
+
+- `@checkpoint` initializes a temporary directory, `current.checkpoint.directory`, which you can use as a staging area for data to be checkpointed.
+
+- By default, `@checkpoint` loads the latest task-specific checkpoint in the directory automatically. If a checkpoint is found, `current.checkpoint.is_loaded` is set to `True`, so you can initialize processing with previously stored data, like loading the latest value of `counter` in this case.
+
+- Periodically during processing, you can save any data required to resume processing in the staging directory and call `current.checkpoint.save()` to persist it in the datastore.
+
+- We save a reference to the latest checkpoint in an artifact, `latest_checkpoint`, which allows us to find and load particular checkpoints later, as explained later in this document.
+
+Behind the scenes, besides loading and storing data efficiently, `@checkpoint` takes care of scoping the checkpoint data to specific tasks. You can use `@checkpoint` in many parallel tasks, even in a foreach, knowing that `@checkpoint` will automatically load checkpoints specific to each branch. It also makes it possible to use checkpoints across runs, as described in [Deciding what checkpoint to use](/scaling/checkpoint/selecting-checkpoints).
+
+## Observing `@checkpoint` through cards
+
+Try running the above flow with [the default Metaflow
+`@card`](/metaflow/visualizing-results/effortless-task-inspection-with-default-cards):
+```
+python checkpoint_counter.py run --with card
+```
+If a step decorated with `@checkpoint` has a card enabled, it will add information about checkpoints loaded and stored in the card. For instance, the screenshot below shows a card associated with the second attempt (`[Attempt: 1]` at the top of the card) which loaded a checkpoint produced by the first attempt and stored four checkpoints at 2 second intervals:
+
+![](/assets/checkpoint_card.png)