Question
Suppose ConstrainedAdam optimizer is not used
Wouldn't this allow the model to cheat the L_sparse in the next step?
For example:
- 1st step:
f_gate shrinks slightly because of L_sparse
- 2nd step:
L_recon increase because f_gate shrinks in the first step (affecting f), so now it tries to compensate by increasing decoder weights
- and the pattern continues
|
x_hat_gate = f_gate @ self.ae.decoder.weight.detach().T + self.ae.decoder_bias.detach() |