Skip to content

Detect vanishing gradients #55

Open
@ddobbelaere

Description

@ddobbelaere

It might be desirable to monitor/detect vanishing gradients during training. Note that I of course mean "stochastic gradient" here, as estimated by the training samples used in the current epoch (maybe the current batch size is too small to excite all king/piece positions, so preferably the mean or max abs over a window of multiple epochs).

This would have detected the anomalies in the input layer (dead weights for some king positions) in vondele's run84run3, see #53.

Note that with GC (gradient centralization), we cannot resort to investigating a mere difference of two checkpoints, as the centralized gradient by definition contains a contribution equal to the mean of all gradient vectors over all neurons of a layer (see equation (1) of https://arxiv.org/pdf/2004.01461v2).

As a "work-around", continued training without GC (use_gc=False in Ranger) on a checkpoint and then comparing/visualizing the difference between a later checkpoint should also do the trick I think.

See also https://discuss.pytorch.org/t/how-to-check-for-vanishing-exploding-gradients/9019

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions