-
Notifications
You must be signed in to change notification settings - Fork 34
Support mixed precision with CPU driver #118
base: main
Are you sure you want to change the base?
Conversation
Bumps [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) from 1.4.7 to 1.5.2. - [Release notes](https://github.com/PyTorchLightning/pytorch-lightning/releases) - [Changelog](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/CHANGELOG.md) - [Commits](Lightning-AI/pytorch-lightning@1.4.7...1.5.2) --- updated-dependencies: - dependency-name: pytorch-lightning dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>
…ors into ptl-1.5-support
…ors into cpu-head-mixed-precision
…ors into cpu-head-mixed-precision
…rs into cpu-head-mixed-precision
…ors into cpu-head-mixed-precision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough description! Unfortunately my understanding of PTL is quite limited so I still have to ask a clarifying question below 😅
# Swap out the accelerator if necessary. | ||
# This is needed to support CPU head with GPU workers or Ray Client. | ||
current_accelerator = self.lightning_module.trainer.accelerator | ||
if self.use_gpu and isinstance(current_accelerator, CPUAccelerator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite following this logic - by removing isinstance(current_accelerator, CPUAccelerator)
, what scenario does this solve? Wouldn't the problem case (CPUAccelerator
) be changed to DelayedGPUAccelerator
both before and after this PR?
@@ -25,17 +25,23 @@ def ray_start_4_cpus_4_gpus(): | |||
ray.shutdown() | |||
|
|||
|
|||
def train_func(dir, plugin, callbacks=None): | |||
def train_func(dir, plugin, callbacks=None, amp=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a test with amp=True
?
Closes #99. GPU tests were run manually and all are passing.
In PTL 16 bit precision is only works on GPU. You specify that you want GPUs to your Trainer by setting
gpus=1
. However, if you do this with Ray Lightning, the driver process requires GPUs. This prevents you from using Ray Lightning with Ray Client, or if you are using Tune, this requires an extra GPU to be reserved but not actually be used. To fix this, we previously implemented a "hack" to swap out the accelerator with a custom accelerator so the driver doesn't require GPUs: #67.However, this swap only takes place if the initial accelerator is a
CPUAccelerator
. This prevents mixed precision from being used since PTL complains about 16 bit precision with aCPUAccelerator
. Instead this PR does the swap out regardless of the initial accelerator that is set.