There are so much NaN in loss. The train script, skip optimizing whenever NaN is in the tensor. What causes these NaN values?