Multiprocessing error during validation in LocalTorch compute context

**Describe the bug**

When running [cosem_example.ipynb](https://github.com/janelia-cellmap/dacapo/blob/main/examples/distance_task/cosem_example.ipynb) on a local workstation with GPUs, the validation step during training throws the following error:

```
...
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
```

If I directly call `validate_run` outside of `train_run`, I get the same error:

```python
from dacapo import validate_run

validate_run("cosem_distance_run_4nm", 2000)
```

```
Creating FileConfigStore:
	path: /home/chena2@hhmi.org/dacapo/configs
Creating local weights store in directory /home/chena2@hhmi.org/dacapo
Retrieving weights for run cosem_distance_run_4nm, iteration 2000
Validating run cosem_distance_run_4nm at iteration 2000...
Creating FileStatsStore:
	path    : /home/chena2@hhmi.org/dacapo/stats
Validating run cosem_distance_run_4nm on dataset jrc_hela-2_recon-1/labels/groundtruth/crop6/[mito]_gt_jrc_hela-2_recon-1/labels/groundtruth/crop6/mito_s1_uint8_None_4nm
validation inputs already copied!
Predicting with input size (2304, 2304, 2304), output size (848, 848, 848)
Total input ROI: [11272:13728, 872:3328, 11352:13808] (2456, 2456, 2456), output ROI: [12000:13000, 1600:2600, 12080:13080] (1000, 1000, 1000)
Running blockwise prediction with worker_file:  /home/chena2@hhmi.org/dacapo-ml/dacapo/blockwise/predict_worker.py
Running blockwise with worker_file:  /home/chena2@hhmi.org/dacapo-ml/dacapo/blockwise/predict_worker.py
Using compute context: LocalTorch(distribute_workers=False, _device=None, oom_limit=4.2)
ERROR:daisy.worker:worker (hostname=10.101.50.108:port=35859:task_id=predict_worker2024-09-25_16-08-03:worker_id=2) received exception: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
```

Happy to provide a full stack trace if it helps.

I tried to fix this issue by explicitly setting the torch multiprocessing method to use `spawn` but then I got a different error and decided not to go too deep into that hole. I then got around this error by enabling `distribute_workers` in the `LocalTorch` compute context, and this somehow fixes the issue. 

**To Reproduce**

Just run [cosem_example.ipynb](https://github.com/janelia-cellmap/dacapo/blob/main/examples/distance_task/cosem_example.ipynb) on any local workstation with a GPU

**Versions:**
- OS: Ubuntu 22.04
- CUDA Version: 12.2
- 3 x NVIDIA RTX A5000, 24 GB memory each


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiprocessing error during validation in LocalTorch compute context #298

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiprocessing error during validation in LocalTorch compute context #298

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions