Introduce `optuna.artifacts` to the PyTorch checkpoint example #280

kAIto47802 · 2024-09-24T08:59:39Z

Motivation

Currently, the PyTorch checkpoint example is using local file system to save and manage checkpoints, not yet reflecting the recent optuna.artifacts functionalities.

Description of the changes

Introduced optuna.artifacts.
Removed the use of local file system.

HideakiImamura · 2024-09-30T09:42:04Z

@nabenabe0928 Could you review this PR?

github-actions · 2024-10-07T23:04:13Z

This pull request has not seen any recent activity.

not522

Thank you for your PR! Could you check my comments?

pytorch/pytorch_checkpoint.py

not522 · 2024-10-09T02:38:38Z

pytorch/pytorch_checkpoint.py

-        checkpoint = torch.load(checkpoint_path)
+    if trial_number is not None:
+        study = optuna.load_study(study_name="pytorch_checkpoint", storage="sqlite:///example.db")
+        artifact_id = study.trials[trial_number].user_attrs["artifact_id"]


If the process is terminated before the first checkpoint, the artifact will not be saved, so check if it exists.

not522 · 2024-10-09T02:52:02Z

pytorch/pytorch_checkpoint.py

                "accuracy": accuracy,
            },
-            tmp_checkpoint_path,
+            "./tmp_model.pt",


Could you change the path of checkpoint for each trial? If we run this script with multi-process, the saved models can be broken by other processes.

github-actions · 2024-10-17T23:04:47Z

This pull request has not seen any recent activity.

Co-authored-by: Naoto Mizuno <[email protected]>

not522 · 2024-10-22T05:57:33Z

retried_trial_number returns the first trial's number in the retry history, so we should check the entire retry history. For example, running the following command can check the behavior.

timeout 5 python pytorch_checkpoint.py

The fix could be like this.

$ git diff
diff --git a/pytorch/pytorch_checkpoint.py b/pytorch/pytorch_checkpoint.py
index 35b697d..e7f2b5d 100644
--- a/pytorch/pytorch_checkpoint.py
+++ b/pytorch/pytorch_checkpoint.py
@@ -89,20 +89,25 @@ def objective(trial):
     lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
     optimizer = getattr(optim, optimizer_name)(model.parameters(), lr=lr)
 
-    trial_number = RetryFailedTrialCallback.retried_trial_number(trial)
-
-    artifact_id = trial_number and trial.study.trials[trial_number].user_attrs.get("artifact_id")
-    if trial_number is not None and artifact_id is not None:
+    artifact_id = None
+    retry_history = RetryFailedTrialCallback.retry_history(trial)
+    for trial_number in reversed(retry_history):
+        artifact_id = trial.study.trials[trial_number].user_attrs.get("artifact_id")
+        if artifact_id is not None:
+            retry_trial_number = trial_number
+            break
+
+    if artifact_id is not None:
         download_artifact(
             artifact_store=artifact_store,
-            file_path=f"./tmp_model_{trial_number}.pt",
+            file_path=f"./tmp_model_{trial.number}.pt",
             artifact_id=artifact_id,
         )
-        checkpoint = torch.load(f"./tmp_model_{trial_number}.pt")
+        checkpoint = torch.load(f"./tmp_model_{trial.number}.pt")
         epoch = checkpoint["epoch"]
         epoch_begin = epoch + 1
 
-        print(f"Loading a checkpoint from trial {trial_number} in epoch {epoch}.")
+        print(f"Loading a checkpoint from trial {retry_trial_number} in epoch {epoch}.")
 
         model.load_state_dict(checkpoint["model_state_dict"])
         optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
@@ -159,15 +164,15 @@ def objective(trial):
                 "optimizer_state_dict": optimizer.state_dict(),
                 "accuracy": accuracy,
             },
-            f"./tmp_model_{trial_number}.pt",
+            f"./tmp_model_{trial.number}.pt",
         )
         artifact_id = upload_artifact(
             artifact_store=artifact_store,
-            file_path=f"./tmp_model_{trial_number}.pt",
+            file_path=f"./tmp_model_{trial.number}.pt",
             study_or_trial=trial,
         )
         trial.set_user_attr("artifact_id", artifact_id)
-        os.remove(f"./tmp_model_{trial_number}.pt")
+        os.remove(f"./tmp_model_{trial.number}.pt")
 
         # Handle pruning based on the intermediate value.
         if trial.should_prune():

kAIto47802 · 2024-10-22T07:11:25Z

Thank you for your review! I have fixed it according to your suggestion.

not522

Thank you for your update. It's almost LGTM. Could you check my comment?

not522 · 2024-10-23T06:28:58Z

pytorch/pytorch_checkpoint.py

+            file_path=f"./tmp_model_{trial.number}.pt",
+            artifact_id=artifact_id,
+        )
+        checkpoint = torch.load(f"./tmp_model_{trial.number}.pt")


Could you remove the temporary file here?

os.remove(f"./tmp_model_{trial.number}.pt")

Thank you for your comment. I have fix this.

not522

LGTM!

kAIto47802 added 3 commits September 24, 2024 17:32

Introduce artifacts

9d63c5c

Apply formatter

79130db

Remove debug print

e7b117c

kAIto47802 changed the title ~~Introduce artifact store to the PyTorch checkpoint example~~ Introduce optuna.artifacts to the PyTorch checkpoint example Sep 24, 2024

HideakiImamura assigned nabenabe0928 Sep 30, 2024

github-actions bot added the stale Exempt from stale bot labeling. label Oct 7, 2024

not522 assigned not522 and unassigned nabenabe0928 Oct 9, 2024

not522 reviewed Oct 9, 2024

View reviewed changes

github-actions bot removed the stale Exempt from stale bot labeling. label Oct 9, 2024

github-actions bot added the stale Exempt from stale bot labeling. label Oct 17, 2024

kAIto47802 and others added 4 commits October 18, 2024 18:00

Update pytorch/pytorch_checkpoint.py

5f329a6

Co-authored-by: Naoto Mizuno <[email protected]>

Update based on review

0443f6b

Apply formatter

a1a5953

Fix

262c48d

nabenabe0928 removed the stale Exempt from stale bot labeling. label Oct 18, 2024

Update to check the entire retry history

4852e9d

not522 reviewed Oct 23, 2024

View reviewed changes

Add removal of temporal file

3fc41e8

not522 approved these changes Oct 25, 2024

View reviewed changes

not522 added this to the v4.1.0 milestone Oct 25, 2024

not522 merged commit 8158f46 into optuna:main Oct 25, 2024
6 checks passed

nabenabe0928 added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance label Oct 30, 2024

Introduce optuna.artifacts to the PyTorch checkpoint example #280

Introduce optuna.artifacts to the PyTorch checkpoint example #280

Uh oh!

Conversation

kAIto47802 commented Sep 24, 2024

Motivation

Description of the changes

Uh oh!

HideakiImamura commented Sep 30, 2024

Uh oh!

github-actions bot commented Oct 7, 2024

Uh oh!

not522 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

not522 Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

not522 Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 17, 2024

Uh oh!

not522 commented Oct 22, 2024

Uh oh!

kAIto47802 commented Oct 22, 2024

Uh oh!

not522 left a comment

Choose a reason for hiding this comment

Uh oh!

not522 Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

kAIto47802 Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

not522 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Introduce `optuna.artifacts` to the PyTorch checkpoint example #280

Introduce `optuna.artifacts` to the PyTorch checkpoint example #280

kAIto47802 Oct 25, 2024 •

edited

Loading