Skip to content

Multiprocessing error during validation in LocalTorch compute context #298

@atc3

Description

@atc3

Describe the bug

When running cosem_example.ipynb on a local workstation with GPUs, the validation step during training throws the following error:

...
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

If I directly call validate_run outside of train_run, I get the same error:

from dacapo import validate_run

validate_run("cosem_distance_run_4nm", 2000)
Creating FileConfigStore:
	path: /home/[email protected]/dacapo/configs
Creating local weights store in directory /home/[email protected]/dacapo
Retrieving weights for run cosem_distance_run_4nm, iteration 2000
Validating run cosem_distance_run_4nm at iteration 2000...
Creating FileStatsStore:
	path    : /home/[email protected]/dacapo/stats
Validating run cosem_distance_run_4nm on dataset jrc_hela-2_recon-1/labels/groundtruth/crop6/[mito]_gt_jrc_hela-2_recon-1/labels/groundtruth/crop6/mito_s1_uint8_None_4nm
validation inputs already copied!
Predicting with input size (2304, 2304, 2304), output size (848, 848, 848)
Total input ROI: [11272:13728, 872:3328, 11352:13808] (2456, 2456, 2456), output ROI: [12000:13000, 1600:2600, 12080:13080] (1000, 1000, 1000)
Running blockwise prediction with worker_file:  /home/[email protected]/dacapo-ml/dacapo/blockwise/predict_worker.py
Running blockwise with worker_file:  /home/[email protected]/dacapo-ml/dacapo/blockwise/predict_worker.py
Using compute context: LocalTorch(distribute_workers=False, _device=None, oom_limit=4.2)
ERROR:daisy.worker:worker (hostname=10.101.50.108:port=35859:task_id=predict_worker2024-09-25_16-08-03:worker_id=2) received exception: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Happy to provide a full stack trace if it helps.

I tried to fix this issue by explicitly setting the torch multiprocessing method to use spawn but then I got a different error and decided not to go too deep into that hole. I then got around this error by enabling distribute_workers in the LocalTorch compute context, and this somehow fixes the issue.

To Reproduce

Just run cosem_example.ipynb on any local workstation with a GPU

Versions:

  • OS: Ubuntu 22.04
  • CUDA Version: 12.2
  • 3 x NVIDIA RTX A5000, 24 GB memory each

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions