Skip to content

RuntimeError during train_npe_model #66

@Clownshift

Description

@Clownshift

Hello Devs,

I have been trying to get some toy example of cryoSBI to run, but I am experiencing some crashes that I cannot make sense of. I am not sure, if this is really a bug, but I am hoping that you can help me understand why the program is crashing. The training actually runs for many epochs (i.e., many hours) but eventually crashes. The whole output looks like that:

Training neural netowrk:
 65%|██████████████████████████████████████████████████████████████████████████▍                                       | 98/150 [18:47:34<9:58:18, 690.35s/epoch, loss=-1.01]
Traceback (most recent call last):
  File "/home/paq/programs/micromamba/envs/cryoSBI/bin/train_npe_model", line 8, in <module>
    sys.exit(cl_npe_train_no_saving())
  File "/home/paq/programs/micromamba/envs/cryoSBI/lib/python3.10/site-packages/cryo_sbi/inference/command_line_tools.py", line 50, in cl_npe_train_no_saving
    npe_train_no_saving(
  File "/home/paq/programs/micromamba/envs/cryoSBI/lib/python3.10/site-packages/cryo_sbi/inference/train_npe_model.py", line 159, in npe_train_no_saving
    step(
  File "/home/paq/programs/micromamba/envs/cryoSBI/lib/python3.10/site-packages/lampe/utils.py", line 36, in __call__
    if loss.isfinite().all():
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Have you encountered this error before? Can you give me any hints what went wrong?


My input was:

export CUDA_VISIBLE_DEVICES=1
train_npe_model \
    --image_config_file 2_simulation/_sim_config.json \
    --train_config_file 3_train/_train_config.json\
    --epochs 150 \
    --estimator_file 3_train/posterior.estimator \
    --loss_file 3_train/posterior.loss \
    --n_workers 4 \
    --simulation_batch_size 5120 \
    --train_device cuda

I am running the program on an NVIDIA GeForce RTX 4080, with CUDA 12.3.


If you need any additional information, or if I can be of help for any eventual debugging, please let me know.

Best Regards,
Patrick

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions