Skip to content

Aimnet2 failing when training with forces #390

@chrisiacovella

Description

@chrisiacovella

@MarshallYan ran into an issue with aimnet2 failing when trying to train; the checks in the representation function throw an error of NaN in the embedding. We tracked it down to be an issue of including forces in the loss function. Interestingly, the loss weighting was set to zero for the forces, so it does not seem to be contribution of the forces to the loss function itself directly, but likely the computation of the forces themselves.

I'll note that changing learning rate or the gradient clipping had no impact on this.

A bit of investigation, I think I tracked it down to where we take the norm over the vector components in the interaction module

Line 437:

       # Compute the norm over the last dimension (vector components)
        vector_contributions = torch.norm(
            avf_v_sum, dim=-1
        )  # Shape: (number_of_atoms, H)

Adding a small epsilon value to avf_v_sum (e.g., 1e-8) appears to fix this issue (hence it likely is an issue with dividing by zero). I'm going to do some more testing before pushing this as the fix, but right now all tests have been promising.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions