Support mixed precision with CPU driver #118

amogkam · 2022-01-19T23:37:22Z

Closes #99. GPU tests were run manually and all are passing.

In PTL 16 bit precision is only works on GPU. You specify that you want GPUs to your Trainer by setting gpus=1. However, if you do this with Ray Lightning, the driver process requires GPUs. This prevents you from using Ray Lightning with Ray Client, or if you are using Tune, this requires an extra GPU to be reserved but not actually be used. To fix this, we previously implemented a "hack" to swap out the accelerator with a custom accelerator so the driver doesn't require GPUs: #67.

However, this swap only takes place if the initial accelerator is a CPUAccelerator. This prevents mixed precision from being used since PTL complains about 16 bit precision with a CPUAccelerator. Instead this PR does the swap out regardless of the initial accelerator that is set.

…ors into main

Bumps [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) from 1.4.7 to 1.5.2. - [Release notes](https://github.com/PyTorchLightning/pytorch-lightning/releases) - [Changelog](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/CHANGELOG.md) - [Commits](Lightning-AI/pytorch-lightning@1.4.7...1.5.2) --- updated-dependencies: - dependency-name: pytorch-lightning dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>

…ors into main

…ors into ptl-1.5-support

…ors into main

…ors into cpu-head-mixed-precision

…ors into main

…ors into cpu-head-mixed-precision

…ors into main

…rs into cpu-head-mixed-precision

…ors into cpu-head-mixed-precision

matthewdeng

Thanks for the thorough description! Unfortunately my understanding of PTL is quite limited so I still have to ask a clarifying question below 😅

matthewdeng · 2022-01-27T04:55:10Z

ray_lightning/ray_ddp.py

-        # Swap out the accelerator if necessary.
-        # This is needed to support CPU head with GPU workers or Ray Client.
-        current_accelerator = self.lightning_module.trainer.accelerator
-        if self.use_gpu and isinstance(current_accelerator, CPUAccelerator):


I'm not quite following this logic - by removing isinstance(current_accelerator, CPUAccelerator), what scenario does this solve? Wouldn't the problem case (CPUAccelerator) be changed to DelayedGPUAccelerator both before and after this PR?

matthewdeng · 2022-01-27T04:55:59Z

ray_lightning/tests/test_tune.py

@@ -25,17 +25,23 @@ def ray_start_4_cpus_4_gpus():
    ray.shutdown()


-def train_func(dir, plugin, callbacks=None):
+def train_func(dir, plugin, callbacks=None, amp=False):


Should there be a test with amp=True?

amogkam and others added 30 commits September 9, 2021 15:47

add annotations

8432251

Merge branch 'main' of github.com:ray-project/ray_lightning_accelerat…

685a309

…ors into main

update

adec375

fix test

0d7f3b2

update readme

21943d0

more fixes

eb06731

Merge branch 'main' of github.com:ray-project/ray_lightning_accelerat…

e5dd626

…ors into main

Merge branch 'main' of github.com:ray-project/ray_lightning_accelerat…

e451f58

…ors into ptl-1.5-support

move to post_dispatch

58d8da0

address comments

3936220

lint

c124650

Merge branch 'main' of github.com:ray-project/ray_lightning_accelerat…

84572e3

…ors into main

add back

012a99e

fix

9f64d1d

Merge branch 'main' of github.com:ray-project/ray_lightning_accelerat…

664dcee

…ors into cpu-head-mixed-precision

fix test

6ff7ad3

Merge branch 'main' of github.com:ray-project/ray_lightning_accelerat…

bcdefda

…ors into main

Merge branch 'main' of github.com:ray-project/ray_lightning_accelerat…

796e041

…ors into cpu-head-mixed-precision

fix test

64089d8

Merge branch 'main' of github.com:ray-project/ray_lightning_accelerat…

9ea707f

…ors into main

fix

6246f4b

updated delayedgpuaccelerator

dd20863

fix tests

d1015e7

fix tests

a11a0cc

fix gpu id

82334b1

fix root device

a8bba48

unpin

53df245

fix

ed884e1

share devices

a4502de

amogkam added 15 commits January 21, 2022 19:26

share devices

49b75e5

horovod delayed accelerator

5efbf41

fix horovod root device

bebc8f7

1.5-gpu

bcd1b4c

update

b9cd821

lint

2c7f5a0

fix

f0a2b02

4 gpu

0c7fd92

Merge branch '1.5-gpu' of github.com:amogkam/ray_lightning_accelerato…

e21016e

…rs into cpu-head-mixed-precision

wip

b486597

update

f5a657a

Merge branch 'main' of github.com:ray-project/ray_lightning_accelerat…

79123a2

…ors into cpu-head-mixed-precision

lint

0047eb1

fix

31ccc44

fix

fef9132

amogkam assigned matthewdeng Jan 27, 2022

matthewdeng reviewed Jan 27, 2022

View reviewed changes

DavidMChan mentioned this pull request Feb 22, 2022

Cannot Use GPUStatsMonitor callback with Ray Lightning #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support mixed precision with CPU driver #118

Support mixed precision with CPU driver #118

Uh oh!

amogkam commented Jan 19, 2022 •

edited

Loading

Uh oh!

matthewdeng left a comment

Uh oh!

matthewdeng Jan 27, 2022

Uh oh!

matthewdeng Jan 27, 2022

Uh oh!

Uh oh!

Support mixed precision with CPU driver #118

Are you sure you want to change the base?

Support mixed precision with CPU driver #118

Uh oh!

Conversation

amogkam commented Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthewdeng left a comment

Choose a reason for hiding this comment

Uh oh!

matthewdeng Jan 27, 2022

Choose a reason for hiding this comment

Uh oh!

matthewdeng Jan 27, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amogkam commented Jan 19, 2022 •

edited

Loading