Open
Description
We're seeing this error message about 5 minute into training.
WARNING:distributed_shampoo.utils.matrix_functions:Failed to compute eigendecomposition in torch.float32 precision with exception linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 1).! Retrying in double precision...
Any ideas how we can fix this / avoid this?
Metadata
Metadata
Assignees
Labels
No labels