Skip to content

Why use 5× learning rate and zero weight decay for Engram parameters? #9

@goodluckcwl

Description

@goodluckcwl

In the current implementation, the Engram module appears to be trained with a higher learning rate than the backbone and weight_decay=0. Could you clarify the motivation behind these choices? Specifically:

Does the higher LR help overcome gradient attenuation due to gating or late insertion in the network?
Is weight_decay=0 used to avoid regularizing discrete n-gram memory embeddings (which may harm capacity)?
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions