RuntimeError during train_npe_model

Hello Devs,

I have been trying to get some toy example of cryoSBI to run, but I am experiencing some crashes that I cannot make sense of. I am not sure, if this is really a bug, but I am hoping that you can help me understand why the program is crashing. The training actually runs for many epochs (i.e., many hours) but eventually crashes. The whole output looks like that:

```
Training neural netowrk:
 65%|██████████████████████████████████████████████████████████████████████████▍                                       | 98/150 [18:47:34<9:58:18, 690.35s/epoch, loss=-1.01]
Traceback (most recent call last):
  File "/home/paq/programs/micromamba/envs/cryoSBI/bin/train_npe_model", line 8, in <module>
    sys.exit(cl_npe_train_no_saving())
  File "/home/paq/programs/micromamba/envs/cryoSBI/lib/python3.10/site-packages/cryo_sbi/inference/command_line_tools.py", line 50, in cl_npe_train_no_saving
    npe_train_no_saving(
  File "/home/paq/programs/micromamba/envs/cryoSBI/lib/python3.10/site-packages/cryo_sbi/inference/train_npe_model.py", line 159, in npe_train_no_saving
    step(
  File "/home/paq/programs/micromamba/envs/cryoSBI/lib/python3.10/site-packages/lampe/utils.py", line 36, in __call__
    if loss.isfinite().all():
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
Have you encountered this error before? Can you give me any hints what went wrong?
________________________________________________________________________________

My input was:
```
export CUDA_VISIBLE_DEVICES=1
train_npe_model \
    --image_config_file 2_simulation/_sim_config.json \
    --train_config_file 3_train/_train_config.json\
    --epochs 150 \
    --estimator_file 3_train/posterior.estimator \
    --loss_file 3_train/posterior.loss \
    --n_workers 4 \
    --simulation_batch_size 5120 \
    --train_device cuda
```
I am running the program on an NVIDIA GeForce RTX 4080, with CUDA 12.3. 

________________________________________________________________________________
If you need any additional information, or if I can be of help for any eventual debugging, please let me know.

Best Regards,
Patrick

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError during train_npe_model #66

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError during train_npe_model #66

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions