I have found using -np.inf in the inter-attention module (attend part) often leads to nan loss computations even with gradient clipping or very low learning rates. Replacing it with some large value like -1e18 helps my case.
Could it be because there is some error in masking before calculating attention scores?
I have found using -np.inf in the inter-attention module (attend part) often leads to nan loss computations even with gradient clipping or very low learning rates. Replacing it with some large value like
-1e18helps my case.Could it be because there is some error in masking before calculating attention scores?