Skip to content

The question about the loss function #12

@stepwhal

Description

@stepwhal

Hi, Thanks for sharing the codebase for your work. I am trying to train the network on custom data, but I got the following error:

/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [46,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [47,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [48,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [49,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [54,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [55,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [56,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [57,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [58,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [59,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one failed.
0%| | 1/15356 [00:01<5:05:31, 1.19s/it]
Traceback (most recent call last):
File "/media/home/C/NgeNet/train.py", line 224, in
main()
File "/media/home/C/NgeNet/train.py", line 124, in main
loss_dict = model_loss(coords_src=coords_src,
File "/home/anaconda3/envs/Negnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/home/C/NgeNet/losses/loss.py", line 130, in forward
overlap_loss_v = 0.5 * self.overlap_loss(ol_scores_src, ol_gt_src) +
File "/media/home/C/NgeNet/losses/loss.py", line 65, in overlap_loss
weights[ol_gt > 0.5] = 1 - ratio
RuntimeError: CUDA error: device-side assert triggered

Process finished with exit code 1

It seems that during the network training, the weight parameter is too large, cause the variables ‘q_feats_local’ is too large, cause the leaky_relu to be nan, which causes an error in the back propagation of loss. Have you ever encountered this situation during the experiment?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions