-np.inf in mask_3d causes numerical instability ?

I have found using -np.inf in the inter-attention module (attend part) often leads to nan loss computations even with gradient clipping or very low learning rates. Replacing it with some large value like `-1e18` helps my case.

Could it be because there is some error in masking before calculating attention scores?