Open
Description
I was running the code on A6000 with distributed training, using outdoor large-scale point cloud data. When I set batch_size to 2 on each GPU, an error occurred at the 25rd epoch. I tried increasing the batch_size to 3, but an error occurred at the 63rd epoch.
"/mnt/sdb/anaconda3/envs/pointcept/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/sdb/anaconda3/envs/pointcept/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/sdb/anaconda3/envs/pointcept/lib/python3.8/site-packages/torchsparse-2.1.0-py3.8-linux-x86_64.egg/torchsparse/nn/modules/conv.py", line 98, in forward
return F.conv3d(
File "/mnt/sdb/anaconda3/envs/pointcept/lib/python3.8/site-packages/torchsparse-2.1.0-py3.8-linux-x86_64.egg/torchsparse/nn/functional/conv/conv.py", line 92, in conv3d
kmap = F.build_kernel_map(
File "/mnt/sdb/anaconda3/envs/pointcept/lib/python3.8/site-packages/torchsparse-2.1.0-py3.8-linux-x86_64.egg/torchsparse/nn/functional/conv/kmap/build_kmap.py", line 193, in build_kernel_map
out_in_map_bwd = F.convert_transposed_out_in_map(
File "/mnt/sdb/anaconda3/envs/pointcept/lib/python3.8/site-packages/torchsparse-2.1.0-py3.8-linux-x86_64.egg/torchsparse/nn/functional/conv/hash/query.py", line 48, in convert_transposed_out_in_map
out_in_map_t = torch.full(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Originally posted by @lihc-cz in #16 (comment)
Metadata
Metadata
Assignees
Labels
No labels