Open
Description
With the latest versions of EDDL (1.2.0) and ECVL (1.1.0), I get a CUDA error when training the model using a single GPU. I have no problems when using 2 or 4 GPUs. The error occurs systematically at the beginning of the third epoch and does not seem to depend on the batch size. It does not depend on the memory consumption parameter (“full_mem”, “mid_mem” or “low_mem”), I tried all of them. The GPU is a nVidia V100. With previous versions of the libraries, this error did not occur (but I was using a different GPU).
.Traceback (most recent call last):
File "C01_2_rec_mod_edll.py", line 98, in <module>
fire.Fire({
File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "C01_2_rec_mod_edll.py", line 46, in train
rec_mod.train()
File "/mnt/datasets/uc5/UC5_pipeline_forked/src/eddl_lib/recurrent_module.py", line 289, in train
eddl.train_batch(rnn, [cnn_visual, thresholded], [Y])
File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/pyeddl/eddl.py", line 435, in train_batch
return _eddl.train_batch(net, in_, out)
RuntimeError: [CUDA ERROR]: invalid argument (1) raised in delete_tensor | (check_cuda)
The code is not yet available on the repository, please let me know what details I can add.
Metadata
Assignees
Labels
No labels