Skip to content

Conversation

@forklady42
Copy link
Collaborator

In an early training run on the original grid data, I ran into NaNs for losses.

After adding gradient logging locally, I discovered some major spikes in the gradients. This indicates that exploding gradients are likely the cause of the NaNs.

Screenshot 2025-12-10 at 1 41 38 PM Screenshot 2025-12-10 at 2 09 21 PM

Indeed, after adding gradient clipping to avoid the large spikes, I ran 10 epochs on the original gridded data without any NaNs for losses. This will be another hyperparameter for us to tune.

I suspect smarter weight initialization, learning rate adjustments, and revisiting normalization will help stabilize the gradients, but in the meantime, gradient clipping will prevent them from spiraling out of control.

Note: I'm basing this PR on hananol/setup-ptlightning since this configuration change is being passed into the Lightning trainer. However, no need to block merging #26 on this PR. I can rebase onto main once the PyTorch Lightning refactor is merged.

@forklady42 forklady42 requested a review from hanaol December 10, 2025 19:21
@hanaol hanaol force-pushed the hanaol/setup-ptlightning branch from d02539c to d11a763 Compare December 11, 2025 17:38
@forklady42 forklady42 changed the base branch from hanaol/setup-ptlightning to main December 11, 2025 21:02
Copy link
Collaborator

@hanaol hanaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forklady42, what gradient_clip_value did you use in your runs? I’ll add it to the configuration file and include it in this PR.

@forklady42
Copy link
Collaborator Author

I used 1.0, same as the default here, but honestly, I think that's more aggressive than necessary. Let's start with 20.0. That will head off any particularly large gradients without being excessively restrictive.

I'll go ahead and add it to the MP config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants