-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Hello Devs,
I have been trying to get some toy example of cryoSBI to run, but I am experiencing some crashes that I cannot make sense of. I am not sure, if this is really a bug, but I am hoping that you can help me understand why the program is crashing. The training actually runs for many epochs (i.e., many hours) but eventually crashes. The whole output looks like that:
Training neural netowrk:
65%|██████████████████████████████████████████████████████████████████████████▍ | 98/150 [18:47:34<9:58:18, 690.35s/epoch, loss=-1.01]
Traceback (most recent call last):
File "/home/paq/programs/micromamba/envs/cryoSBI/bin/train_npe_model", line 8, in <module>
sys.exit(cl_npe_train_no_saving())
File "/home/paq/programs/micromamba/envs/cryoSBI/lib/python3.10/site-packages/cryo_sbi/inference/command_line_tools.py", line 50, in cl_npe_train_no_saving
npe_train_no_saving(
File "/home/paq/programs/micromamba/envs/cryoSBI/lib/python3.10/site-packages/cryo_sbi/inference/train_npe_model.py", line 159, in npe_train_no_saving
step(
File "/home/paq/programs/micromamba/envs/cryoSBI/lib/python3.10/site-packages/lampe/utils.py", line 36, in __call__
if loss.isfinite().all():
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Have you encountered this error before? Can you give me any hints what went wrong?
My input was:
export CUDA_VISIBLE_DEVICES=1
train_npe_model \
--image_config_file 2_simulation/_sim_config.json \
--train_config_file 3_train/_train_config.json\
--epochs 150 \
--estimator_file 3_train/posterior.estimator \
--loss_file 3_train/posterior.loss \
--n_workers 4 \
--simulation_batch_size 5120 \
--train_device cuda
I am running the program on an NVIDIA GeForce RTX 4080, with CUDA 12.3.
If you need any additional information, or if I can be of help for any eventual debugging, please let me know.
Best Regards,
Patrick
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels