Aimnet2 failing when training with forces

@MarshallYan ran into an issue with aimnet2 failing when trying to train; the checks in the representation function throw an error of NaN in the embedding.  We tracked it down to be an issue of including forces in the loss function.  Interestingly, the loss weighting was set to zero for the forces, so it does not seem to be contribution of the forces to the loss function itself directly, but likely the computation of the forces themselves. 

I'll note that changing learning rate or the gradient clipping had no impact on this.  


A bit of investigation, I think I tracked it down to where we take the norm over the vector components in the interaction module

Line 437:
```
       # Compute the norm over the last dimension (vector components)
        vector_contributions = torch.norm(
            avf_v_sum, dim=-1
        )  # Shape: (number_of_atoms, H)
```

Adding a small epsilon value to `avf_v_sum` (e.g., 1e-8) appears to fix this issue (hence it likely is an issue with dividing by zero).   I'm going to do some more testing before pushing this as the fix, but right now all tests have been promising.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aimnet2 failing when training with forces #390

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aimnet2 failing when training with forces #390

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions