Skip to content

Conversation

@aguevara22
Copy link

Hi Ziming!

I noticed using "update_grid_from_samples", which really improves the training on CPU, leads to NaN in the network and gradient when used on CUDA. The problem seems to be coming from the fit of the B spline coefficients in spline.py:

coef = torch.linalg.lstsq(mat.to(device), y_eval.unsqueeze(dim=2).to(device),
driver='gelsy' if device == 'cpu' else 'gels').solution[:, :, 0]

Here "mat" is the B spline function which are not a full rank matrix depending on the samples. It seems that the driver 'gels' cannot handle degenerate matrices. So I just sent that operation to the CPU, which allows to use 'gelsy' and handle degenerate matrices. Perhaps there is a better solution but I'm committing it just in case, since it worked for me on both CUDA and MPS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant