Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Support mixed precision with CPU driver #118

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

amogkam
Copy link
Collaborator

@amogkam amogkam commented Jan 19, 2022

Closes #99. GPU tests were run manually and all are passing.

In PTL 16 bit precision is only works on GPU. You specify that you want GPUs to your Trainer by setting gpus=1. However, if you do this with Ray Lightning, the driver process requires GPUs. This prevents you from using Ray Lightning with Ray Client, or if you are using Tune, this requires an extra GPU to be reserved but not actually be used. To fix this, we previously implemented a "hack" to swap out the accelerator with a custom accelerator so the driver doesn't require GPUs: #67.

However, this swap only takes place if the initial accelerator is a CPUAccelerator. This prevents mixed precision from being used since PTL complains about 16 bit precision with a CPUAccelerator. Instead this PR does the swap out regardless of the initial accelerator that is set.

amogkam and others added 30 commits September 9, 2021 15:47
Bumps [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) from 1.4.7 to 1.5.2.
- [Release notes](https://github.com/PyTorchLightning/pytorch-lightning/releases)
- [Changelog](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/CHANGELOG.md)
- [Commits](Lightning-AI/pytorch-lightning@1.4.7...1.5.2)

---
updated-dependencies:
- dependency-name: pytorch-lightning
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough description! Unfortunately my understanding of PTL is quite limited so I still have to ask a clarifying question below 😅

# Swap out the accelerator if necessary.
# This is needed to support CPU head with GPU workers or Ray Client.
current_accelerator = self.lightning_module.trainer.accelerator
if self.use_gpu and isinstance(current_accelerator, CPUAccelerator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following this logic - by removing isinstance(current_accelerator, CPUAccelerator), what scenario does this solve? Wouldn't the problem case (CPUAccelerator) be changed to DelayedGPUAccelerator both before and after this PR?

@@ -25,17 +25,23 @@ def ray_start_4_cpus_4_gpus():
ray.shutdown()


def train_func(dir, plugin, callbacks=None):
def train_func(dir, plugin, callbacks=None, amp=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test with amp=True?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when using 16 bit precision
2 participants