Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Deterministic mode is not set on remote worker #217

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions ray_lightning/launchers/ray_launcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,16 +298,20 @@ def _wrapping_function(
trainer.strategy.local_rank = self._strategy.local_rank
set_cuda_device_if_used(trainer.strategy)

# Set operations to deterministic in this worker when required
if trainer._accelerator_connector.deterministic:
trainer._accelerator_connector._init_deterministic(True)

results = function(*args, **kwargs)

if trainer is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check needed? Is there a case when trainer can be None?

return self._collect_rank_zero_results(trainer, results)
else:
return None
results = self._collect_rank_zero_results(trainer, results)

if results is None:
trainer._teardown()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to tear down trainer only when local_rank or global_rank is !=0?

trainer._call_teardown_hook()

trainer._teardown()
trainer._call_teardown_hook()
return None
return results

def _collect_rank_zero_results(self, trainer: "pl.Trainer",
results: Any) -> Optional["_RayOutput"]:
Expand All @@ -316,18 +320,20 @@ def _collect_rank_zero_results(self, trainer: "pl.Trainer",
This function is run on the worker process.
"""
rank_zero_debug("Finalizing the Ray launcher environment.")
if trainer.strategy.global_rank != 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it save to use trainer.strategy instead of self._strategy?

return None

if trainer.strategy.local_rank != 0:
return None

checkpoint_callback = trainer.checkpoint_callback
best_model_path = checkpoint_callback.best_model_path \
if checkpoint_callback else None

state_dict = trainer.lightning_module.state_dict()

if self._strategy.global_rank != 0:
return None

# Move state_dict to cpu before converting it to model state stream
if trainer.strategy.local_rank == 0:
state_dict = move_data_to_device(state_dict, "cpu")
state_dict = move_data_to_device(state_dict, "cpu")

# PyTorch Lightning saves the model weights in a temp file and
# loads it back on the driver.
Expand Down