Detect vanishing gradients

It might be desirable to monitor/detect vanishing gradients during training. Note that I of course mean "stochastic gradient" here, as estimated by the training samples used in the current epoch (maybe the current batch size is too small to excite all king/piece positions, so preferably the mean or max abs over a window of multiple epochs).

This would have detected the anomalies in the input layer (dead weights for some king positions) in vondele's run84run3, see #53.

Note that with GC (gradient centralization), we cannot resort to investigating a mere difference of two checkpoints, as the centralized gradient by definition contains a contribution equal to the mean of all gradient vectors over all neurons of a layer (see equation (1) of https://arxiv.org/pdf/2004.01461v2).

As a "work-around", continued training without GC (`use_gc=False` in Ranger) on a checkpoint and then comparing/visualizing the difference between a later checkpoint should also do the trick I think.

See also https://discuss.pytorch.org/t/how-to-check-for-vanishing-exploding-gradients/9019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect vanishing gradients #55

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Detect vanishing gradients #55

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions