Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Deterministic mode is not set on remote worker #217

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

MarkusSpanring
Copy link
Contributor

Fix related to #213

This PR should also fix an unreachable code segment introduced in my previous PR #208

@@ -316,18 +320,20 @@ def _collect_rank_zero_results(self, trainer: "pl.Trainer",
This function is run on the worker process.
"""
rank_zero_debug("Finalizing the Ray launcher environment.")
if trainer.strategy.global_rank != 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it save to use trainer.strategy instead of self._strategy?

# Set operations to deterministic in this worker when required
if trainer._accelerator_connector.deterministic:
trainer._accelerator_connector._init_deterministic(True)

results = function(*args, **kwargs)

if trainer is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check needed? Is there a case when trainer can be None?

results = self._collect_rank_zero_results(trainer, results)

if results is None:
trainer._teardown()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to tear down trainer only when local_rank or global_rank is !=0?

@MarkusSpanring MarkusSpanring marked this pull request as ready for review September 23, 2022 10:03
@MarkusSpanring
Copy link
Contributor Author

@JiahaoYao if you have time, could you check if _init_deterministic(True) is sufficient to replicate Trainer(deterministic=True) on all workers?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant